Re: UDF reflect

2014-04-02 Thread Szehon Ho
Hi, according to the description of the reflect UDF, you are trying to call
java.util.UUID.hashcode(uidString), which doesnt seem to be an existing
method on either java 6/7.

http://docs.oracle.com/javase/7/docs/api/java/util/UUID.html#hashCode()

Thanks
Szehon




On Wed, Apr 2, 2014 at 2:13 PM, Andy Srine  wrote:

> Hi guys,
>
>
> I am trying to use the reflect UDF for an UUID method and am getting an
> exception. I believe this function should be available in java 1.6.0_31 the
> system is running.
>
>
> select reflect("java.util.UUID", "hashCode", uid_str) my_uid,
>
> ...
>
>
> My suspicion is, this is because the hive column I am calling this on is a
> string and not an UUID. So I nested the reflects as shown below to go from
> a string to an UUID first and then to "hashCode" it.
>
>
> reflect("java.util.UUID", "hashCode", reflect("java.util.UUID",
> "fromString", uid_str)) my_uid,
>
>
> In either case, I always get the exception below though the row of data it
> prints has no null for the uid_str column. Any ideas?
>
>
>  at
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:565)
>
> at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
>
> ... 8 more
>
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: UDFReflect
> getMethod
>
> at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFReflect.evaluate(GenericUDFReflect.java:164)
>
> at
> org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.evaluate(ExprNodeGenericFuncEvaluator.java:163)
>
> at
> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.getNewKey(KeyWrapperFactory.java:113)
>
> at
> org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:794)
>
> at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
>
> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
>
> at
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>
> at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
>
> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
>
> at
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
>
> at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
>
> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
>
> at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:548)
>
> ... 9 more
>
> Caused by: java.lang.NoSuchMethodException: java.util.UUID.hashCode(null)
>
> at java.lang.Class.getMethod(Class.java:1605)
>
> at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFReflect.evaluate(GenericUDFReflect.java:160)
>
>
> Thanks,
>
> Andy
>
>
>


Re: Predicate pushdown optimisation not working for ORC

2014-04-02 Thread Abhay Bansal
I was able to resolve the issue by setting "hive.optimize.index.filter" to
true.

In the hadoop logs
syslog:2014-04-03 05:44:51,204 INFO
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included column ids =
3,8,13
syslog:2014-04-03 05:44:51,204 INFO
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included columns names =
sourceipv4address,sessionid,url
syslog:2014-04-03 05:44:51,216 INFO
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: ORC pushdown predicate:
leaf-0 = (EQUALS sourceipv4address 1809657989)

I can now see the ORC pushdown predicate.

Thanks,
-Abhay


On Thu, Apr 3, 2014 at 11:14 AM, Stephen Boesch  wrote:

> HI Abhay,
>   What is the DDL for your "test" table?
>
>
> 2014-04-02 22:36 GMT-07:00 Abhay Bansal :
>
> I am new to Hive, apologise for asking such a basic question.
>>
>> Following exercise was done with hive .12 and hadoop 0.20.203
>>
>> I created a ORC file form java, and pushed it into a table with the same
>> schema. I checked the conf
>> property 
>> hive.optimize.ppdtrue
>> which should ideally use the ppd optimisation.
>>
>> I ran a query "select sourceipv4address,sessionid,url from test where
>> sourceipv4address="dummy";"
>>
>> Just to see if the ppd optimization is working I checked the hadoop logs
>> where I found
>>
>> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
>> 05:01:39,913 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
>> column ids = 3,8,13
>> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
>> 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
>> columns names = sourceipv4address,sessionid,url
>> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
>> 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: *No
>> ORC pushdown predicate*
>>
>>  I am not sure which part of it I missed. Any help would be appreciated.
>>
>> Thanks,
>> -Abhay
>>
>
>


Re: Predicate pushdown optimisation not working for ORC

2014-04-02 Thread Stephen Boesch
HI Abhay,
  What is the DDL for your "test" table?


2014-04-02 22:36 GMT-07:00 Abhay Bansal :

> I am new to Hive, apologise for asking such a basic question.
>
> Following exercise was done with hive .12 and hadoop 0.20.203
>
> I created a ORC file form java, and pushed it into a table with the same
> schema. I checked the conf
> property 
> hive.optimize.ppdtrue
> which should ideally use the ppd optimisation.
>
> I ran a query "select sourceipv4address,sessionid,url from test where
> sourceipv4address="dummy";"
>
> Just to see if the ppd optimization is working I checked the hadoop logs
> where I found
>
> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
> 05:01:39,913 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
> column ids = 3,8,13
> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
> 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
> columns names = sourceipv4address,sessionid,url
> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
> 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: *No
> ORC pushdown predicate*
>
>  I am not sure which part of it I missed. Any help would be appreciated.
>
> Thanks,
> -Abhay
>


Predicate pushdown optimisation not working for ORC

2014-04-02 Thread Abhay Bansal
I am new to Hive, apologise for asking such a basic question.

Following exercise was done with hive .12 and hadoop 0.20.203

I created a ORC file form java, and pushed it into a table with the same
schema. I checked the conf
property hive.optimize.ppdtrue
which should ideally use the ppd optimisation.

I ran a query "select sourceipv4address,sessionid,url from test where
sourceipv4address="dummy";"

Just to see if the ppd optimization is working I checked the hadoop logs
where I found

./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
05:01:39,913 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
column ids = 3,8,13
./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
columns names = sourceipv4address,sessionid,url
./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: *No ORC
pushdown predicate*

 I am not sure which part of it I missed. Any help would be appreciated.

Thanks,
-Abhay


UDF reflect

2014-04-02 Thread Andy Srine
Hi guys,


I am trying to use the reflect UDF for an UUID method and am getting an
exception. I believe this function should be available in java 1.6.0_31 the
system is running.


select reflect("java.util.UUID", "hashCode", uid_str) my_uid,

...


My suspicion is, this is because the hive column I am calling this on is a
string and not an UUID. So I nested the reflects as shown below to go from
a string to an UUID first and then to "hashCode" it.


reflect("java.util.UUID", "hashCode", reflect("java.util.UUID",
"fromString", uid_str)) my_uid,


In either case, I always get the exception below though the row of data it
prints has no null for the uid_str column. Any ideas?


 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:565)

at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)

... 8 more

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: UDFReflect
getMethod

at
org.apache.hadoop.hive.ql.udf.generic.GenericUDFReflect.evaluate(GenericUDFReflect.java:164)

at
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.evaluate(ExprNodeGenericFuncEvaluator.java:163)

at
org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.getNewKey(KeyWrapperFactory.java:113)

at
org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:794)

at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)

at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)

at
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)

at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)

at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)

at
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)

at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)

at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)

at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:548)

... 9 more

Caused by: java.lang.NoSuchMethodException: java.util.UUID.hashCode(null)

at java.lang.Class.getMethod(Class.java:1605)

at
org.apache.hadoop.hive.ql.udf.generic.GenericUDFReflect.evaluate(GenericUDFReflect.java:160)


Thanks,

Andy


Re: bugs in 0.12 version

2014-04-02 Thread Juraj jiv
6113:
Caused by:
com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException:
Duplicate entry 'default' for key 'UNIQUE_DATABASE'
it looks to me like problem is on your side.

JV


On Tue, Apr 1, 2014 at 2:55 PM, Lior Schachter  wrote:

> Hi all,
>
> We are randomly getting 2 types of exceptions while inserting data to hive.
> Seems like we encountered : 
> HIVE-6114
> , HIVE-6113  issues.
>
> Both issues are critical but there is no patch nor workaround.
> Seems like need to downgrade back to 0.11.
>
> I'd appreciate your advise,
> Lior
>


Re: Deserializing into multiple records

2014-04-02 Thread David Quigley
Makes perfect sense, thanks Petter!


On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) <
petter.von.dolw...@gmail.com> wrote:

> Hi David,
>
> you can implement a custom InputFormat (extends
> org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom
> RecordReader (implements org.apache.hadoop.mapred.RecordReader). The
> RecordReader will be used to read your documents and from there you can
> decide which units you will return as records (return by the next()
> method). You'll still probably need a SerDe that transforms your data into
> Hive data types using 1:1 mapping.
>
> In this way you can choose only to duplicate your data while your query
> runs (and possible in the results) to avoid JOIN operations but the raw
> files will not contain duplicate data.
>
> Something like this:
>
> CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
>   myfield1 STRING,
>   myfield2 INT)
>   PARTITIONED BY (your_partition_if_appliccable STRING)
>   ROW FORMAT SERDE 'quigley.david.myserde'
>   STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>   LOCATION 'mylocation';
>
>
> Hope this helps.
>
> Br,
> Petter
>
>
>
>
> 2014-04-02 5:45 GMT+02:00 David Quigley :
>
> We are currently streaming complex documents to hdfs with the hope of
>> being able to query. Each single document logically breaks down into a set
>> of individual records. In order to use Hive, we preprocess each input
>> document into a set of discreet records, which we save on HDFS and create
>> an external table on top of.
>>
>> This approach works, but we end up duplicating a lot of data in the
>> records. It would be much more efficient to deserialize the document into a
>> set of records when a query is made. That way, we can just save the raw
>> documents on HDFS.
>>
>> I have looked into writing a cusom SerDe.
>>
>> Object
>>  *deserialize*(org.apache.hadoop.io.Writable blob)
>>
>> It looks like the input record => deserialized record still needs to be a
>> 1:1 relationship. Is there any way to deserialize a record into multiple
>> records?
>>
>> Thanks,
>> Dave
>>
>
>


Re: Deserializing into multiple records

2014-04-02 Thread Petter von Dolwitz (Hem)
Hi David,

you can implement a custom InputFormat (extends
org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom
RecordReader (implements org.apache.hadoop.mapred.RecordReader). The
RecordReader will be used to read your documents and from there you can
decide which units you will return as records (return by the next()
method). You'll still probably need a SerDe that transforms your data into
Hive data types using 1:1 mapping.

In this way you can choose only to duplicate your data while your query
runs (and possible in the results) to avoid JOIN operations but the raw
files will not contain duplicate data.

Something like this:

CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
  myfield1 STRING,
  myfield2 INT)
  PARTITIONED BY (your_partition_if_appliccable STRING)
  ROW FORMAT SERDE 'quigley.david.myserde'
  STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION 'mylocation';


Hope this helps.

Br,
Petter




2014-04-02 5:45 GMT+02:00 David Quigley :

> We are currently streaming complex documents to hdfs with the hope of
> being able to query. Each single document logically breaks down into a set
> of individual records. In order to use Hive, we preprocess each input
> document into a set of discreet records, which we save on HDFS and create
> an external table on top of.
>
> This approach works, but we end up duplicating a lot of data in the
> records. It would be much more efficient to deserialize the document into a
> set of records when a query is made. That way, we can just save the raw
> documents on HDFS.
>
> I have looked into writing a cusom SerDe.
>
> Object
>  *deserialize*(org.apache.hadoop.io.Writable blob)
>
> It looks like the input record => deserialized record still needs to be a
> 1:1 relationship. Is there any way to deserialize a record into multiple
> records?
>
> Thanks,
> Dave
>