RE: Hive Problems Reading Avro+Snappy Data

2013-04-08 Thread Bennie Schut
Just so you know there is still at least one bug using avro+compression like 
snappy:
https://issues.apache.org/jira/browse/HIVE-3308
There's a simple one line patch but unfortunately it's not committed yet.


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Monday, April 08, 2013 1:59 PM
To: user@hive.apache.org
Subject: Re: Hive Problems Reading Avro+Snappy Data

Thanks Chuck.

I think the problem is the job configuration on the query.  I logged back into 
the system this morning and started a new Hive client shell and issued a series 
of more complex queries against the Avro+Snappy table and they all worked fine. 
 So I started trying to recall what could have been different between my new 
Hive client shell and the old one returning NULLs.  I am able to reproduce the 
NULLs being returned by setting "SET hive.exec.compress.output=true;".  A brand 
new Hive client has that set to false and all the queries come back normal, but 
the second I set it to true the NULLs return.

Best,

Matt

From: , Chuck 
mailto:chuck.conn...@nuance.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Sunday, April 7, 2013 7:32 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Subject: RE: Hive Problems Reading Avro+Snappy Data

When you do SELECT *, Hive does not run a real MapReduce job, so it is not a 
good test. Something is wrong with SerDe or InputFormat.

Chuck


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Sunday, April 07, 2013 5:41 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Hive Problems Reading Avro+Snappy Data

Hive users,

I am having problems performing "complex" queries on Avro+Snappy data.  If I do 
a "SELECT * FROM Blah LIMIT 50", I see the data coming back as it should be.  
But if I perform any kind of more complex query such as "SELECT count(*) FROM 
Blah" I am receive several rows of NULL values.  My workflow of how I created 
the table is described below along with some of the setup.

- I am running CDH4.2 with Avro 1.7.3

hive> select * From mthomas_testavro limit 1;
OK
Field1Field2
03-19-2013a
03-19-2013b
03-19-2013c
03-19-2013c
Time taken: 0.103 seconds

hive> select count(*) From mthomas_testavro;
...
Total MapReduce CPU Time Spent: 6 seconds 420 msec
OK
NULL
NULL
NULL
NULL
Time taken: 17.634 seconds
...


CREATE EXTERNAL TABLE mthomas_testavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/testavro/'
TBLPROPERTIES (
'avro.schema.literal'='{
"namespace": "hello.world",
"name": "some_schema",
"type": "record",
"fields": [
{ "name":"field1","type":"string"},
{ "name":"field2","type":"string"}
]
}')
;

SET avro.output.codec=snappy;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET 
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM 
identical_table_inGzip_format;

If I cat the output file in the external table, I see 
"Objavro.codec^Lsnappyavro.schema?{"type"..." at the beginning followed by the 
rest of the schema and binary data.  So I am assuming the snappy compression 
worked.  Furthermore, I also tried to query this table via Impala and both 
queries worked just fine.

Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308  ???

Any ideas?

Thanks.

Matt
"This message (including any attachments) is intended only for the use of the 
individual or entity to which it is addressed, and may contain information that 
is non-public, proprietary, privileged, confidential and exempt from disclosure 
under applicable law or may be constituted as attorney work product. If you are 
not the intended recipient, you are hereby notified that any use, 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this message in error, notify sender 
immediately and delete this message immediately."


Re: Hive Problems Reading Avro+Snappy Data

2013-04-08 Thread Thomas, Matthew
Thanks Chuck.

I think the problem is the job configuration on the query.  I logged back into 
the system this morning and started a new Hive client shell and issued a series 
of more complex queries against the Avro+Snappy table and they all worked fine. 
 So I started trying to recall what could have been different between my new 
Hive client shell and the old one returning NULLs.  I am able to reproduce the 
NULLs being returned by setting "SET hive.exec.compress.output=true;".  A brand 
new Hive client has that set to false and all the queries come back normal, but 
the second I set it to true the NULLs return.

Best,

Matt

From: , Chuck 
mailto:chuck.conn...@nuance.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Sunday, April 7, 2013 7:32 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Subject: RE: Hive Problems Reading Avro+Snappy Data

When you do SELECT *, Hive does not run a real MapReduce job, so it is not a 
good test. Something is wrong with SerDe or InputFormat.

Chuck


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Sunday, April 07, 2013 5:41 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Hive Problems Reading Avro+Snappy Data

Hive users,

I am having problems performing "complex" queries on Avro+Snappy data.  If I do 
a "SELECT * FROM Blah LIMIT 50", I see the data coming back as it should be.  
But if I perform any kind of more complex query such as "SELECT count(*) FROM 
Blah" I am receive several rows of NULL values.  My workflow of how I created 
the table is described below along with some of the setup.

- I am running CDH4.2 with Avro 1.7.3

hive> select * From mthomas_testavro limit 1;
OK
Field1Field2
03-19-2013a
03-19-2013b
03-19-2013c
03-19-2013c
Time taken: 0.103 seconds

hive> select count(*) From mthomas_testavro;
…
Total MapReduce CPU Time Spent: 6 seconds 420 msec
OK
NULL
NULL
NULL
NULL
Time taken: 17.634 seconds
…


CREATE EXTERNAL TABLE mthomas_testavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/testavro/'
TBLPROPERTIES (
'avro.schema.literal'='{
"namespace": "hello.world",
"name": "some_schema",
"type": "record",
"fields": [
{ "name":"field1","type":"string"},
{ "name":"field2","type":"string"}
]
}')
;

SET avro.output.codec=snappy;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET 
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM 
identical_table_inGzip_format;

If I cat the output file in the external table, I see 
"Objavro.codec^Lsnappyavro.schema?{"type"…" at the beginning followed by the 
rest of the schema and binary data.  So I am assuming the snappy compression 
worked.  Furthermore, I also tried to query this table via Impala and both 
queries worked just fine.

Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308  ???

Any ideas?

Thanks.

Matt
“This message (including any attachments) is intended only for the use of the 
individual or entity to which it is addressed, and may contain information that 
is non-public, proprietary, privileged, confidential and exempt from disclosure 
under applicable law or may be constituted as attorney work product. If you are 
not the intended recipient, you are hereby notified that any use, 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this message in error, notify sender 
immediately and delete this message immediately.”


RE: Hive Problems Reading Avro+Snappy Data

2013-04-07 Thread Connell, Chuck
When you do SELECT *, Hive does not run a real MapReduce job, so it is not a 
good test. Something is wrong with SerDe or InputFormat.

Chuck


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Sunday, April 07, 2013 5:41 PM
To: user@hive.apache.org
Subject: Hive Problems Reading Avro+Snappy Data

Hive users,

I am having problems performing "complex" queries on Avro+Snappy data.  If I do 
a "SELECT * FROM Blah LIMIT 50", I see the data coming back as it should be.  
But if I perform any kind of more complex query such as "SELECT count(*) FROM 
Blah" I am receive several rows of NULL values.  My workflow of how I created 
the table is described below along with some of the setup.

- I am running CDH4.2 with Avro 1.7.3

hive> select * From mthomas_testavro limit 1;
OK
Field1 Field2
03-19-2013 a
03-19-2013 b
03-19-2013 c
03-19-2013 c
Time taken: 0.103 seconds

hive> select count(*) From mthomas_testavro;
...
Total MapReduce CPU Time Spent: 6 seconds 420 msec
OK
NULL
NULL
NULL
NULL
Time taken: 17.634 seconds
...


CREATE EXTERNAL TABLE mthomas_testavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/testavro/'
TBLPROPERTIES (
'avro.schema.literal'='{
"namespace": "hello.world",
"name": "some_schema",
"type": "record",
"fields": [
{ "name":"field1","type":"string"},
{ "name":"field2","type":"string"}
]
}')
;

SET avro.output.codec=snappy;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET 
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM 
identical_table_inGzip_format;

If I cat the output file in the external table, I see 
"Objavro.codec^Lsnappyavro.schema?{"type"..." at the beginning followed by the 
rest of the schema and binary data.  So I am assuming the snappy compression 
worked.  Furthermore, I also tried to query this table via Impala and both 
queries worked just fine.

Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308  ???

Any ideas?

Thanks.

Matt
"This message (including any attachments) is intended only for the use of the 
individual or entity to which it is addressed, and may contain information that 
is non-public, proprietary, privileged, confidential and exempt from disclosure 
under applicable law or may be constituted as attorney work product. If you are 
not the intended recipient, you are hereby notified that any use, 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this message in error, notify sender 
immediately and delete this message immediately."