Problem creating external table to a sequence file on HDFS

2013-04-08 Thread Ranjitha Chandrashekar
Hi

I am trying to create a external table to a sequence file on HDFS.

I have my own input format and a SerDe, which is compiled into a jar and added 
in HIVE. Inspite of doing this, I get the following error.

Please Suggest.

hive create table seq (key STRING, value STRING) ROW FORMAT SERDE 
'com.org.SequenceFileKeyRecordReader' STORED AS INPUTFORMAT 
'com.org.SequenceFileKeyInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.mapred.SequenceFileOutputFormat' location '/user/hiveFiles/';
FAILED: Error in metadata: Cannot validate serde: 
com.org.SequenceFileKeyRecordReader
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask
hive

Thanks
Ranjitha.


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Where should Hive Process Installed??

2013-04-08 Thread rohithsharma
Hi

 

I am using Hive-0.9.0 + Hadoop-2.0.1 with 2 machine. One machine contains say

Machine-1 : NameNode, SecondaraNameNode , ResourceManager and Hive

Machine-2 :  Proxy server, JHS , DataNode and NodeManager.

 

Problem : 

When I execute Job queries i.e “select count(key) from src” for the table src, 
Job is getting killed. 

The exception in application log was

 

 

MODIFY_APP VIEW_APP 
APPLICATION_OWNERrohith(container_1365412074766_0031_01_03�stderr1503WARNING:
 org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use 
org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. 
java.io.FileNotFoundException: 
/tmp/rohith/hive_2013-04-08_15-36-36_587_2956773547832982704/-mr-10001/41253f2c-9b15-49e0-a419-32ea4559e4e6
 (No such file or directory) 
at java.io.FileInputStream.open(Native Method) 
at java.io.FileInputStream.init(FileInputStream.java:120) 
at java.io.FileInputStream.init(FileInputStream.java:79) 
at 
org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:215) 
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255) 
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:381)
 
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:374)
 
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:536)
 
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.init(MapTask.java:161) 
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:382) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335) 
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:154) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:396) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
 
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:149) 
stdout0syslog58942013-04-08 15:45:52,293 WARN [main] 
org.apache.hadoop.conf.Configuration: job 



 

 

Please  let us know, where HiveServer should run exactly??

 

Thanks  Regards

Rohith Sharma K S



Re: Hive Problems Reading Avro+Snappy Data

2013-04-08 Thread Thomas, Matthew
Thanks Chuck.

I think the problem is the job configuration on the query.  I logged back into 
the system this morning and started a new Hive client shell and issued a series 
of more complex queries against the Avro+Snappy table and they all worked fine. 
 So I started trying to recall what could have been different between my new 
Hive client shell and the old one returning NULLs.  I am able to reproduce the 
NULLs being returned by setting SET hive.exec.compress.output=true;.  A brand 
new Hive client has that set to false and all the queries come back normal, but 
the second I set it to true the NULLs return.

Best,

Matt

From: Connell, Chuck 
chuck.conn...@nuance.commailto:chuck.conn...@nuance.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Sunday, April 7, 2013 7:32 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: Hive Problems Reading Avro+Snappy Data

When you do SELECT *, Hive does not run a real MapReduce job, so it is not a 
good test. Something is wrong with SerDe or InputFormat.

Chuck


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Sunday, April 07, 2013 5:41 PM
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive Problems Reading Avro+Snappy Data

Hive users,

I am having problems performing complex queries on Avro+Snappy data.  If I do 
a SELECT * FROM Blah LIMIT 50, I see the data coming back as it should be.  
But if I perform any kind of more complex query such as SELECT count(*) FROM 
Blah I am receive several rows of NULL values.  My workflow of how I created 
the table is described below along with some of the setup.

- I am running CDH4.2 with Avro 1.7.3

hive select * From mthomas_testavro limit 1;
OK
Field1Field2
03-19-2013a
03-19-2013b
03-19-2013c
03-19-2013c
Time taken: 0.103 seconds

hive select count(*) From mthomas_testavro;
…
Total MapReduce CPU Time Spent: 6 seconds 420 msec
OK
NULL
NULL
NULL
NULL
Time taken: 17.634 seconds
…


CREATE EXTERNAL TABLE mthomas_testavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/testavro/'
TBLPROPERTIES (
'avro.schema.literal'='{
namespace: hello.world,
name: some_schema,
type: record,
fields: [
{ name:field1,type:string},
{ name:field2,type:string}
]
}')
;

SET avro.output.codec=snappy;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET 
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM 
identical_table_inGzip_format;

If I cat the output file in the external table, I see 
Objavro.codec^Lsnappyavro.schema?{type… at the beginning followed by the 
rest of the schema and binary data.  So I am assuming the snappy compression 
worked.  Furthermore, I also tried to query this table via Impala and both 
queries worked just fine.

Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308  ???

Any ideas?

Thanks.

Matt
“This message (including any attachments) is intended only for the use of the 
individual or entity to which it is addressed, and may contain information that 
is non-public, proprietary, privileged, confidential and exempt from disclosure 
under applicable law or may be constituted as attorney work product. If you are 
not the intended recipient, you are hereby notified that any use, 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this message in error, notify sender 
immediately and delete this message immediately.”


RE: Hive Problems Reading Avro+Snappy Data

2013-04-08 Thread Bennie Schut
Just so you know there is still at least one bug using avro+compression like 
snappy:
https://issues.apache.org/jira/browse/HIVE-3308
There's a simple one line patch but unfortunately it's not committed yet.


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Monday, April 08, 2013 1:59 PM
To: user@hive.apache.org
Subject: Re: Hive Problems Reading Avro+Snappy Data

Thanks Chuck.

I think the problem is the job configuration on the query.  I logged back into 
the system this morning and started a new Hive client shell and issued a series 
of more complex queries against the Avro+Snappy table and they all worked fine. 
 So I started trying to recall what could have been different between my new 
Hive client shell and the old one returning NULLs.  I am able to reproduce the 
NULLs being returned by setting SET hive.exec.compress.output=true;.  A brand 
new Hive client has that set to false and all the queries come back normal, but 
the second I set it to true the NULLs return.

Best,

Matt

From: Connell, Chuck 
chuck.conn...@nuance.commailto:chuck.conn...@nuance.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Sunday, April 7, 2013 7:32 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: Hive Problems Reading Avro+Snappy Data

When you do SELECT *, Hive does not run a real MapReduce job, so it is not a 
good test. Something is wrong with SerDe or InputFormat.

Chuck


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Sunday, April 07, 2013 5:41 PM
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive Problems Reading Avro+Snappy Data

Hive users,

I am having problems performing complex queries on Avro+Snappy data.  If I do 
a SELECT * FROM Blah LIMIT 50, I see the data coming back as it should be.  
But if I perform any kind of more complex query such as SELECT count(*) FROM 
Blah I am receive several rows of NULL values.  My workflow of how I created 
the table is described below along with some of the setup.

- I am running CDH4.2 with Avro 1.7.3

hive select * From mthomas_testavro limit 1;
OK
Field1Field2
03-19-2013a
03-19-2013b
03-19-2013c
03-19-2013c
Time taken: 0.103 seconds

hive select count(*) From mthomas_testavro;
...
Total MapReduce CPU Time Spent: 6 seconds 420 msec
OK
NULL
NULL
NULL
NULL
Time taken: 17.634 seconds
...


CREATE EXTERNAL TABLE mthomas_testavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/testavro/'
TBLPROPERTIES (
'avro.schema.literal'='{
namespace: hello.world,
name: some_schema,
type: record,
fields: [
{ name:field1,type:string},
{ name:field2,type:string}
]
}')
;

SET avro.output.codec=snappy;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET 
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM 
identical_table_inGzip_format;

If I cat the output file in the external table, I see 
Objavro.codec^Lsnappyavro.schema?{type... at the beginning followed by the 
rest of the schema and binary data.  So I am assuming the snappy compression 
worked.  Furthermore, I also tried to query this table via Impala and both 
queries worked just fine.

Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308  ???

Any ideas?

Thanks.

Matt
This message (including any attachments) is intended only for the use of the 
individual or entity to which it is addressed, and may contain information that 
is non-public, proprietary, privileged, confidential and exempt from disclosure 
under applicable law or may be constituted as attorney work product. If you are 
not the intended recipient, you are hereby notified that any use, 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this message in error, notify sender 
immediately and delete this message immediately.


StackOverflowError when add jar using multiple thread cocurrently

2013-04-08 Thread Wangwenli
Hi All,
Recently we find that when multiple jdbc connection concurrently add jars, 
hiveserver will throw StackOverflowError when serializeMapRedWork to hdfs, I 
find the relate issue hive-2666 is similar, but I think it missed the 
concurrently scenario.

I find it is because the classloader is changed which will lead to the infinite 
loop. Any one met this issue? Any suggestion?


Regards
Wenli


RE: Syntax for filters on timstamp data type

2013-04-08 Thread LUTTER, Steffen
Hi Mark,

Correct, I just did some tests and the cast is the way to go. While for 
comparison operations (equal, diff, ...) implicit casts work, this is not the 
case for the IN clause. I think it should, as eventually this just translates 
to a disjunction of comparisons so it should be the same.

Anyway, I have a working solution now. For the record I paste two working 
example queries below.

Thanks a lot for your help !!!

Steffen

Example 1: SELECT * FROM  table1 WHERE datecol  =  CAST('2009-01-17 00:00:00' 
AS timestamp)
Example 2: SELECT * FROM  table1 WHERE datecol IN (CAST ('2009-01-11 00:00:00' 
AS timestamp), CAST ('2009-01-08 00:00:00' AS timestamp) )

From: Mark Grover [mailto:grover.markgro...@gmail.com]
Sent: 05 April 2013 18:43
To: user@hive.apache.org
Subject: Re: Syntax for filters on timstamp data type

Steffan,
One thing that may be different is that equal can cast operands to make equals 
work but that may not be true for IN. FWIW, this is me just speculating, I 
haven't looked at the code just yet.

Perhaps, you could explicit casting to get around this?
On Fri, Apr 5, 2013 at 7:36 AM, LUTTER, Steffen 
steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote:
Equal, not equal, less than, less or equal, greater than, greater or equal all 
work. Also the function execution in the IN clause seems to work, as the error 
message states that the result type is bigint. Following the error message, it 
expects the input as timestamp, but I couldn't find a syntax to express 
timestamps in HiveQL.

Two questions remain:


1)  How to express timestamps in HiveQL?

2)  Why doesn't the IN clause support comparisons between timestamp and 
bigint, if equal and so on does?

Thanks for any thought in this,

Steffen

From: Nitin Pawar 
[mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com]
Sent: 05 April 2013 16:11
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Syntax for filters on timstamp data type

I am not sure IN clause supports executing functions in the query

did it fail when you tried  less than greater than type

On Fri, Apr 5, 2013 at 7:36 PM, LUTTER, Steffen 
steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote:
Hi,

I have a question regarding filters on timestamps. The syntax seems to be 
UNIX_TIMESTAMP('-MM-dd hh:mm:ss'), is there another way to express a 
datetime type? The problem is that I get an exception when using the IN list 
syntax, while the equal comparison works without problems.

Example: SELECT * FROM  table1 WHERE datecol IN ( UNIX_TIMESTAMP('2009-01-05 
00:00:00'),UNIX_TIMESTAMP('2009-01-10 00:00:00')  )

Throws exception:

Caused by: java.sql.SQLException: Query returned non-zero code: 10014, cause: 
FAILED: SemanticException [Error 10014]: Line 5:21 Wrong arguments ''2009-01-10 
00:00:00'': The arguments for IN should be the same type! Types are: {timestamp 
IN (bigint, bigint)}
   at 
org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189)
   at 
org.apache.hadoop.hive.jdbc.HiveStatement.execute(HiveStatement.java:127)
   at 
com.sap.connectivity.cs.java.drivers.jdbc.JDBCAPI$Statement.execute(JDBCAPI.java:1648)
   at 
com.sap.connectivity.cs.java.drivers.jdbc.JDBCDriver.prepare(JDBCDriver.java:1760)
   ... 15 more

Following query works:

SELECT * FROM  table1 WHERE datecol  =  UNIX_TIMESTAMP('2009-01-17 00:00:00')

Is there another syntax for datetime types? Could it be a bug in the filter IN 
list operation?

Thanks in advance,

Steffen




--
Nitin Pawar



Re: Syntax for filters on timstamp data type

2013-04-08 Thread Nitin Pawar
great

so can we assume that equals comparison can auto cast but not the in  range
statement ?


On Mon, Apr 8, 2013 at 7:08 PM, LUTTER, Steffen steffen.lut...@sap.comwrote:

  Hi Mark,

 ** **

 Correct, I just did some tests and the cast is the way to go. While for
 comparison operations (equal, diff, …) implicit casts work, this is not the
 case for the IN clause. I think it should, as eventually this just
 translates to a disjunction of comparisons so it should be the same. 

 ** **

 Anyway, I have a working solution now. For the record I paste two working
 example queries below.

 ** **

 Thanks a lot for your help !!!

 ** **

 Steffen

 ** **

 Example 1: SELECT * FROM  table1 WHERE datecol  =  CAST('2009-01-17
 00:00:00' AS timestamp)

 Example 2: SELECT * FROM  table1 WHERE datecol IN (CAST ('2009-01-11
 00:00:00' AS timestamp), CAST ('2009-01-08 00:00:00' AS timestamp) )

 ** **

 *From:* Mark Grover [mailto:grover.markgro...@gmail.com]
 *Sent:* 05 April 2013 18:43

 *To:* user@hive.apache.org
 *Subject:* Re: Syntax for filters on timstamp data type

 ** **

 Steffan,

 One thing that may be different is that equal can cast operands to make
 equals work but that may not be true for IN. FWIW, this is me just
 speculating, I haven't looked at the code just yet.

 ** **

 Perhaps, you could explicit casting to get around this?

 On Fri, Apr 5, 2013 at 7:36 AM, LUTTER, Steffen steffen.lut...@sap.com
 wrote:

 Equal, not equal, less than, less or equal, greater than, greater or equal
 all work. Also the function execution in the IN clause seems to work, as
 the error message states that the result type is bigint. Following the
 error message, it expects the input as timestamp, but I couldn’t find a
 syntax to express timestamps in HiveQL. 

  

 Two questions remain:

  

 1)  How to express timestamps in HiveQL?

 2)  Why doesn’t the IN clause support comparisons between timestamp
 and bigint, if “equal” and so on does?

  

 Thanks for any thought in this,

  

 Steffen

  

 *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com]
 *Sent:* 05 April 2013 16:11
 *To:* user@hive.apache.org
 *Subject:* Re: Syntax for filters on timstamp data type

  

 I am not sure IN clause supports executing functions in the query 

  

 did it fail when you tried  less than greater than type 

  

 On Fri, Apr 5, 2013 at 7:36 PM, LUTTER, Steffen steffen.lut...@sap.com
 wrote:

 Hi,

  

 I have a question regarding filters on timestamps. The syntax seems to be
 UNIX_TIMESTAMP('-MM-dd hh:mm:ss'), is there another way to express a
 datetime type? The problem is that I get an exception when using the IN
 list syntax, while the equal comparison works without problems.

  

 Example: SELECT * FROM  table1 WHERE datecol IN (
 UNIX_TIMESTAMP('2009-01-05 00:00:00'),UNIX_TIMESTAMP('2009-01-10
 00:00:00')  )

  

 Throws exception: 

  

 Caused by: java.sql.SQLException: Query returned non-zero code: 10014,
 cause: FAILED: SemanticException [Error 10014]: Line 5:21 Wrong arguments
 ''2009-01-10 00:00:00'': The arguments for IN should be the same type!
 Types are: {timestamp IN (bigint, bigint)}

at
 org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189)
 

at
 org.apache.hadoop.hive.jdbc.HiveStatement.execute(HiveStatement.java:127)*
 ***

at
 com.sap.connectivity.cs.java.drivers.jdbc.JDBCAPI$Statement.execute(JDBCAPI.java:1648)
 

at
 com.sap.connectivity.cs.java.drivers.jdbc.JDBCDriver.prepare(JDBCDriver.java:1760)
 

... 15 more

  

 Following query works:

  

 SELECT * FROM  table1 WHERE datecol  =  UNIX_TIMESTAMP('2009-01-17
 00:00:00')

  

 Is there another syntax for datetime types? Could it be a bug in the
 filter “IN list” operation?

  

 Thanks in advance,

  

 Steffen

  



 

  

 --
 Nitin Pawar

 ** **




-- 
Nitin Pawar


RE: Syntax for filters on timstamp data type

2013-04-08 Thread LUTTER, Steffen
Exactly

From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
Sent: 08 April 2013 16:06
To: user@hive.apache.org
Subject: Re: Syntax for filters on timstamp data type

great

so can we assume that equals comparison can auto cast but not the in  range 
statement ?

On Mon, Apr 8, 2013 at 7:08 PM, LUTTER, Steffen 
steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote:
Hi Mark,

Correct, I just did some tests and the cast is the way to go. While for 
comparison operations (equal, diff, ...) implicit casts work, this is not the 
case for the IN clause. I think it should, as eventually this just translates 
to a disjunction of comparisons so it should be the same.

Anyway, I have a working solution now. For the record I paste two working 
example queries below.

Thanks a lot for your help !!!

Steffen

Example 1: SELECT * FROM  table1 WHERE datecol  =  CAST('2009-01-17 00:00:00' 
AS timestamp)
Example 2: SELECT * FROM  table1 WHERE datecol IN (CAST ('2009-01-11 00:00:00' 
AS timestamp), CAST ('2009-01-08 00:00:00' AS timestamp) )

From: Mark Grover 
[mailto:grover.markgro...@gmail.commailto:grover.markgro...@gmail.com]
Sent: 05 April 2013 18:43

To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Syntax for filters on timstamp data type

Steffan,
One thing that may be different is that equal can cast operands to make equals 
work but that may not be true for IN. FWIW, this is me just speculating, I 
haven't looked at the code just yet.

Perhaps, you could explicit casting to get around this?
On Fri, Apr 5, 2013 at 7:36 AM, LUTTER, Steffen 
steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote:
Equal, not equal, less than, less or equal, greater than, greater or equal all 
work. Also the function execution in the IN clause seems to work, as the error 
message states that the result type is bigint. Following the error message, it 
expects the input as timestamp, but I couldn't find a syntax to express 
timestamps in HiveQL.

Two questions remain:


1)  How to express timestamps in HiveQL?

2)  Why doesn't the IN clause support comparisons between timestamp and 
bigint, if equal and so on does?

Thanks for any thought in this,

Steffen

From: Nitin Pawar 
[mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com]
Sent: 05 April 2013 16:11
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Syntax for filters on timstamp data type

I am not sure IN clause supports executing functions in the query

did it fail when you tried  less than greater than type

On Fri, Apr 5, 2013 at 7:36 PM, LUTTER, Steffen 
steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote:
Hi,

I have a question regarding filters on timestamps. The syntax seems to be 
UNIX_TIMESTAMP('-MM-dd hh:mm:ss'), is there another way to express a 
datetime type? The problem is that I get an exception when using the IN list 
syntax, while the equal comparison works without problems.

Example: SELECT * FROM  table1 WHERE datecol IN ( UNIX_TIMESTAMP('2009-01-05 
00:00:00'),UNIX_TIMESTAMP('2009-01-10 00:00:00')  )

Throws exception:

Caused by: java.sql.SQLException: Query returned non-zero code: 10014, cause: 
FAILED: SemanticException [Error 10014]: Line 5:21 Wrong arguments ''2009-01-10 
00:00:00'': The arguments for IN should be the same type! Types are: {timestamp 
IN (bigint, bigint)}
   at 
org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189)
   at 
org.apache.hadoop.hive.jdbc.HiveStatement.execute(HiveStatement.java:127)
   at 
com.sap.connectivity.cs.java.drivers.jdbc.JDBCAPI$Statement.execute(JDBCAPI.java:1648)
   at 
com.sap.connectivity.cs.java.drivers.jdbc.JDBCDriver.prepare(JDBCDriver.java:1760)
   ... 15 more

Following query works:

SELECT * FROM  table1 WHERE datecol  =  UNIX_TIMESTAMP('2009-01-17 00:00:00')

Is there another syntax for datetime types? Could it be a bug in the filter IN 
list operation?

Thanks in advance,

Steffen




--
Nitin Pawar




--
Nitin Pawar


Re: Huge join performance issue

2013-04-08 Thread Igor Tatarinov
Did you verify that all your available mappers are running (and reducers
too)? If you have a small number of partitions with huge files, you might
me underutilizing mappers (check that the files are being split). Also, it
might be optimal to have a single wave of reducers by setting the number
of reduce tasks appropriately.

You might also consider optimizing a simpler query first:

select t1.a, count(*)
from (select a from table baseTB1 where ... ) t1  -- filter by partition as
well
  join
(select a  from baseTB2 where ...) t2-- filter by partition as
well
on t1.a=t2.a
group by t1.a

just to give you an idea how much overhead the extra columns are adding. If
the columns are pretty big they could be causing the slowdown.

igor
decide.com


On Sat, Apr 6, 2013 at 2:30 PM, Gabi D gabi...@gmail.com wrote:

 Thank you for your answer Nitin.
 Does anyone have additional insight into this? will be greatly appreciated.


 On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you dont really need subqueries to join the tables which have common
 columns. Its an additional overhead
 best way to filter your data and speed up your data processing is how you
 layout your data
 When you have larger table I will use partitioning and bucketing to trim
 down the data and improve the performances over joins

 distribute by is mainly used when you have your custom map reduce scripts
 and you want to use transform functionality in hive. I have not used it a
 lot so not sure on that part. also its helpful to write where clauses in
 join statements to reduce the dataset you want to join.



 On Thu, Apr 4, 2013 at 5:53 PM, Gabi D gabi...@gmail.com wrote:

 Hi all,
 I have two tables I need to join and then summarize.
 They are both huge (about 1B rows each, in the relevant partitions) and
 the query runs for over 2 hours creating 5T intermediate data.

 The current query looks like this:

 select t1.b,t1.c,t2.d,t2.e, count(*)
 from (select a,b,cfrom table baseTB1 where ... ) t1  -- filter by
 partition as well
   join
 (select a,d,e from baseTB2 where ...) t2-- filter by
 partition as well
 on t1.a=t2.a
 group by t1.b,t1.c,t2.d,t2.e


 two questions:
 1. would joining baseTB1 and baseTB2 directly (instead of subqueries)
 be better in any way?
   (I know subqueries cause a lot of writes of the intermediate
 data but we also understand it's best to filter down the data that is being
 joined, which is more correct?)
 2. can I use 'distribute by ' and/or 'sort by' in some way that would
 help this? my understanding at the moment is that the problem lies in the
 fact
 that the reduces are on column a while the group by is on column b ...

 Any thoughts would be appreciated.




 --
 Nitin Pawar





Should the SQL schema for HIVE be identical to the SQL schema for a relational database?

2013-04-08 Thread Matthieu Labour
Hi

It would be terrific to get some advice on migrating a schema from RDMS to
Hive.

Should the SQL schema for HIVE be identical to the SQL schema for a
Posgresql/mysql database?

Specifically:

I have an application that generates events that look like the following:

{ts:N+1,userId:123,event:location,payload:{verticalAccuracy:10,longitude:-73.99718090313884,latitude:40.72473278788106,altitude:27.79653739929199,horizontalAccuracy:65}}
{ts:N+2,userId:123,event:addProduct,payload:[cart,osprey-kestrel-48]}
...

Events are being written to persistent storage (AWS S3). A 'worker' wakes
up periodically, reads the new events received and inserts them in a
postgresql database. The database has  user, product, user_product
(user_id, product_id, action:(viewed|wishlist...), timestamp), location
etc... tables.

We are migrating to HIVE.

Should we also create user, product, user_product, locations etc... as HIVE
tables and have a MapReduce job process event files to populate the HIVE
tables? Or should/can we implement a different schema that would allow for
external HIVE tables to map directly to the event files generated. Or a mix
of both?

Thank you for your help!

-matt


RE: Should the SQL schema for HIVE be identical to the SQL schema for a relational database?

2013-04-08 Thread Mike Liddell
Some general guidance would be to aim for minimal JOINs in your regular queries.
Thus the biggest change from a normal-form RDBMS schema is to denormalize such 
that joins do not come in to play until a query has already performed data 
reduction via filtering or aggregation. This implies a star schema comprising 
one primary fact table that has sufficient data in it to sensibly partition it 
and support direct filtering and aggregations.
Supplementing the main table will be dimension tables that can provide 
additional data to flesh out result sets.

It looks like your data will be highly amenable to this.

-mike.

From: Matthieu Labour [mailto:matth...@actionx.com]
Sent: Monday, April 08, 2013 3:50 PM
To: user@hive.apache.org
Subject: Should the SQL schema for HIVE be identical to the SQL schema for a 
relational database?

Hi

It would be terrific to get some advice on migrating a schema from RDMS to Hive.

Should the SQL schema for HIVE be identical to the SQL schema for a 
Posgresql/mysql database?

Specifically:

I have an application that generates events that look like the following:

{ts:N+1,userId:123,event:location,payload:{verticalAccuracy:10,longitude:-73.99718090313884,latitude:40.72473278788106,altitude:27.79653739929199,horizontalAccuracy:65}}
{ts:N+2,userId:123,event:addProduct,payload:[cart,osprey-kestrel-48]}
...

Events are being written to persistent storage (AWS S3). A 'worker' wakes up 
periodically, reads the new events received and inserts them in a postgresql 
database. The database has  user, product, user_product (user_id, product_id, 
action:(viewed|wishlist...), timestamp), location etc... tables.

We are migrating to HIVE.

Should we also create user, product, user_product, locations etc... as HIVE 
tables and have a MapReduce job process event files to populate the HIVE 
tables? Or should/can we implement a different schema that would allow for 
external HIVE tables to map directly to the event files generated. Or a mix of 
both?

Thank you for your help!

-matt