On 02/27/2015 12:59 PM, java8964 wrote:
I just joined in the list today, and it is so quiet here, that I even doubt if
I did join or not.
Anyway, I gave it a try by a question current blocking me.
Most datasets on our production Hadoop cluster currently are stored as AVRO +
SNAPPY format. I heard lots of good things about Parquet, and want to give it a
try.
I followed this web page
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/,
to change one of our ETL to generate Parquet files, instead of Avro, as the
output of our reducer. I used the Parquet + Avro schema, to produce the final
output data, plus snappy codec. Everything works fine. So the final output
parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM
BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1 array<struct<sub1:string,
sub2:string, date_value:bigint>>,field2 array<struct<..............>>ROW FORMAT SERDE
'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any
problem.
But when I tried to query the table, like "select * from table limit 2", I got the following error:Caused by:
java.lang.RuntimeException: Invalid parquet hive schema: repeated group array { required binary sub1 (UTF8); optional
binary sub2 (UTF8); optional int64 date_value;} at
parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56) at
parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36) at
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46) at
parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38) at
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40
)
at
parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32)
at
parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109)
at
parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
at
parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230)
at
parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522)
... 14 more
I noticed that the error comes from the first nested array of struct columns.
My question is following:
1) Does Parquet support the nested array of struct?2) Is this only related to Parquet
1.3.2? Do I have any solution on Parquet 1.3.2?3) If I have to use later version of
Parquet to fix above problem, and if Parquet 1.3.2 available in runtime, will that cause
any issue?4) Can I use all kinds of Hive feature, like "explode" of nest
structure, from the parquet data? What we are looking for is to know if parquet can be
used same way as we currently use AVRO, but gives us the columnar storage benefits which
missing from AVRO.
Thanks
Yong
Hi Yong,
There were some issues with nested types in Hive's Parquet support until
the last release or so. We've fixed them and added requirements to the
Parquet specification to help us get all of the data models (like Hive
or Avro) compatible with one another. This includes
backward-compatibility rules to read data written with any older version.
I think the easiest way to get this working is to update to Hive, which
should be fixed in version 1.1.0 according to the issue tracking the
fix, HIVE-8909 [1].
rb
[1]: https://issues.apache.org/jira/browse/HIVE-8909
--
Ryan Blue
Software Engineer
Cloudera, Inc.