RE: Hive Problems Reading Avro+Snappy Data
Just so you know there is still at least one bug using avro+compression like snappy: https://issues.apache.org/jira/browse/HIVE-3308 There's a simple one line patch but unfortunately it's not committed yet. From: Thomas, Matthew [mailto:mtho...@verisign.com] Sent: Monday, April 08, 2013 1:59 PM To: user@hive.apache.org Subject: Re: Hive Problems Reading Avro+Snappy Data Thanks Chuck. I think the problem is the job configuration on the query. I logged back into the system this morning and started a new Hive client shell and issued a series of more complex queries against the Avro+Snappy table and they all worked fine. So I started trying to recall what could have been different between my new Hive client shell and the old one returning NULLs. I am able to reproduce the NULLs being returned by setting "SET hive.exec.compress.output=true;". A brand new Hive client has that set to false and all the queries come back normal, but the second I set it to true the NULLs return. Best, Matt From: , Chuck mailto:chuck.conn...@nuance.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Date: Sunday, April 7, 2013 7:32 PM To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Subject: RE: Hive Problems Reading Avro+Snappy Data When you do SELECT *, Hive does not run a real MapReduce job, so it is not a good test. Something is wrong with SerDe or InputFormat. Chuck From: Thomas, Matthew [mailto:mtho...@verisign.com] Sent: Sunday, April 07, 2013 5:41 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Hive Problems Reading Avro+Snappy Data Hive users, I am having problems performing "complex" queries on Avro+Snappy data. If I do a "SELECT * FROM Blah LIMIT 50", I see the data coming back as it should be. But if I perform any kind of more complex query such as "SELECT count(*) FROM Blah" I am receive several rows of NULL values. My workflow of how I created the table is described below along with some of the setup. - I am running CDH4.2 with Avro 1.7.3 hive> select * From mthomas_testavro limit 1; OK Field1Field2 03-19-2013a 03-19-2013b 03-19-2013c 03-19-2013c Time taken: 0.103 seconds hive> select count(*) From mthomas_testavro; ... Total MapReduce CPU Time Spent: 6 seconds 420 msec OK NULL NULL NULL NULL Time taken: 17.634 seconds ... CREATE EXTERNAL TABLE mthomas_testavro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/tmp/testavro/' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "hello.world", "name": "some_schema", "type": "record", "fields": [ { "name":"field1","type":"string"}, { "name":"field2","type":"string"} ] }') ; SET avro.output.codec=snappy; SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM identical_table_inGzip_format; If I cat the output file in the external table, I see "Objavro.codec^Lsnappyavro.schema?{"type"..." at the beginning followed by the rest of the schema and binary data. So I am assuming the snappy compression worked. Furthermore, I also tried to query this table via Impala and both queries worked just fine. Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308 ??? Any ideas? Thanks. Matt "This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed, and may contain information that is non-public, proprietary, privileged, confidential and exempt from disclosure under applicable law or may be constituted as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this message in error, notify sender immediately and delete this message immediately."
Re: Hive Problems Reading Avro+Snappy Data
Thanks Chuck. I think the problem is the job configuration on the query. I logged back into the system this morning and started a new Hive client shell and issued a series of more complex queries against the Avro+Snappy table and they all worked fine. So I started trying to recall what could have been different between my new Hive client shell and the old one returning NULLs. I am able to reproduce the NULLs being returned by setting "SET hive.exec.compress.output=true;". A brand new Hive client has that set to false and all the queries come back normal, but the second I set it to true the NULLs return. Best, Matt From: , Chuck mailto:chuck.conn...@nuance.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Date: Sunday, April 7, 2013 7:32 PM To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>> Subject: RE: Hive Problems Reading Avro+Snappy Data When you do SELECT *, Hive does not run a real MapReduce job, so it is not a good test. Something is wrong with SerDe or InputFormat. Chuck From: Thomas, Matthew [mailto:mtho...@verisign.com] Sent: Sunday, April 07, 2013 5:41 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Hive Problems Reading Avro+Snappy Data Hive users, I am having problems performing "complex" queries on Avro+Snappy data. If I do a "SELECT * FROM Blah LIMIT 50", I see the data coming back as it should be. But if I perform any kind of more complex query such as "SELECT count(*) FROM Blah" I am receive several rows of NULL values. My workflow of how I created the table is described below along with some of the setup. - I am running CDH4.2 with Avro 1.7.3 hive> select * From mthomas_testavro limit 1; OK Field1Field2 03-19-2013a 03-19-2013b 03-19-2013c 03-19-2013c Time taken: 0.103 seconds hive> select count(*) From mthomas_testavro; … Total MapReduce CPU Time Spent: 6 seconds 420 msec OK NULL NULL NULL NULL Time taken: 17.634 seconds … CREATE EXTERNAL TABLE mthomas_testavro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/tmp/testavro/' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "hello.world", "name": "some_schema", "type": "record", "fields": [ { "name":"field1","type":"string"}, { "name":"field2","type":"string"} ] }') ; SET avro.output.codec=snappy; SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM identical_table_inGzip_format; If I cat the output file in the external table, I see "Objavro.codec^Lsnappyavro.schema?{"type"…" at the beginning followed by the rest of the schema and binary data. So I am assuming the snappy compression worked. Furthermore, I also tried to query this table via Impala and both queries worked just fine. Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308 ??? Any ideas? Thanks. Matt “This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed, and may contain information that is non-public, proprietary, privileged, confidential and exempt from disclosure under applicable law or may be constituted as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this message in error, notify sender immediately and delete this message immediately.”
RE: Hive Problems Reading Avro+Snappy Data
When you do SELECT *, Hive does not run a real MapReduce job, so it is not a good test. Something is wrong with SerDe or InputFormat. Chuck From: Thomas, Matthew [mailto:mtho...@verisign.com] Sent: Sunday, April 07, 2013 5:41 PM To: user@hive.apache.org Subject: Hive Problems Reading Avro+Snappy Data Hive users, I am having problems performing "complex" queries on Avro+Snappy data. If I do a "SELECT * FROM Blah LIMIT 50", I see the data coming back as it should be. But if I perform any kind of more complex query such as "SELECT count(*) FROM Blah" I am receive several rows of NULL values. My workflow of how I created the table is described below along with some of the setup. - I am running CDH4.2 with Avro 1.7.3 hive> select * From mthomas_testavro limit 1; OK Field1 Field2 03-19-2013 a 03-19-2013 b 03-19-2013 c 03-19-2013 c Time taken: 0.103 seconds hive> select count(*) From mthomas_testavro; ... Total MapReduce CPU Time Spent: 6 seconds 420 msec OK NULL NULL NULL NULL Time taken: 17.634 seconds ... CREATE EXTERNAL TABLE mthomas_testavro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/tmp/testavro/' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "hello.world", "name": "some_schema", "type": "record", "fields": [ { "name":"field1","type":"string"}, { "name":"field2","type":"string"} ] }') ; SET avro.output.codec=snappy; SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM identical_table_inGzip_format; If I cat the output file in the external table, I see "Objavro.codec^Lsnappyavro.schema?{"type"..." at the beginning followed by the rest of the schema and binary data. So I am assuming the snappy compression worked. Furthermore, I also tried to query this table via Impala and both queries worked just fine. Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308 ??? Any ideas? Thanks. Matt "This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed, and may contain information that is non-public, proprietary, privileged, confidential and exempt from disclosure under applicable law or may be constituted as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this message in error, notify sender immediately and delete this message immediately."