Re: Parquet 1.3.2 Invalid parquet hive schema: repeated group array

Ryan Blue Tue, 03 Mar 2015 08:27:06 -0800

On 02/27/2015 12:59 PM, java8964 wrote:

I just joined in the list today, and it is so quiet here, that I even doubt if 
I did join or not.
Anyway, I gave it a try by a question current blocking me.
Most datasets on our production Hadoop cluster currently are stored as AVRO + 
SNAPPY format. I heard lots of good things about Parquet, and want to give it a 
try.
I followed this web page 
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/,
 to change one of our ETL to generate Parquet files, instead of Avro, as the 
output of our reducer. I used the Parquet + Avro schema, to produce the final 
output data, plus snappy codec. Everything works fine. So the final output 
parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM 
BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1 array<struct<sub1:string, 
sub2:string, date_value:bigint>>,field2 array<struct<..............>>ROW FORMAT SERDE 
'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' 
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any 
problem.
But when I tried to query the table, like "select * from table limit 2", I got the following error:Caused by: 
java.lang.RuntimeException: Invalid parquet hive schema: repeated group array {  required binary sub1 (UTF8);  optional 
binary sub2 (UTF8);  optional int64 date_value;}        at 
parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56)        at 
parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36)        at 
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61)        at 
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46)        at 
parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38)        at 
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61)        at 
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40

)
       at 
parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32)
        at 
parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109)  
      at 
parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142) 
       at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)   
     at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)        
at 
parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230)
        at 
parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119)    
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439) 
       at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522)   
     ... 14 more

I noticed that the error comes from the first nested array of struct columns. 
My question is following:
1) Does Parquet support the nested array of struct?2) Is this only related to Parquet 
1.3.2? Do I have any solution on Parquet 1.3.2?3) If I have to use later version of 
Parquet to fix above problem, and if Parquet 1.3.2 available in runtime, will that cause 
any issue?4) Can I use all kinds of Hive feature, like "explode" of nest 
structure, from the parquet data? What we are looking for is to know if parquet can be 
used same way as we currently use AVRO, but gives us the columnar storage benefits which 
missing from AVRO.
Thanks
Yong



Hi Yong,

There were some issues with nested types in Hive's Parquet support untilthe last release or so. We've fixed them and added requirements to theParquet specification to help us get all of the data models (like Hiveor Avro) compatible with one another. This includesbackward-compatibility rules to read data written with any older version.

I think the easiest way to get this working is to update to Hive, whichshould be fixed in version 1.1.0 according to the issue tracking thefix, HIVE-8909 [1].


rb


[1]: https://issues.apache.org/jira/browse/HIVE-8909

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Parquet 1.3.2 Invalid parquet hive schema: repeated group array

Reply via email to