[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

Ganesha Shreedhara (Jira) Tue, 31 Aug 2021 23:38:03 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ganesha Shreedhara updated HIVE-25494:
--------------------------------------
    Labels: schema-evolution  (was: )

> Hive query fails with IndexOutOfBoundsException when a struct type column's 
> field is missing in parquet file schema but present in table schema
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-25494
>                 URL: https://issues.apache.org/jira/browse/HIVE-25494
>             Project: Hive
>          Issue Type: Bug
>          Components: Parquet
>    Affects Versions: 3.1.2
>            Reporter: Ganesha Shreedhara
>            Priority: Major
>              Labels: schema-evolution
>         Attachments: test-struct.parquet
>
>
> When a struct type column's field is missing in parquet file schema but 
> present in table schema and columns are accessed by names, the 
> requestedSchema getting sent from Hive to Parquet storage layer has type even 
> for missing field since we always add type as primitive type if a field is 
> missing in file schema (Ref: 
> [code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).
>  On a parquet side, this missing field gets pruned and since this field 
> belongs to struct type, it ends creating a GroupColumnIO without any 
> children. This causes query to fail with IndexOutOfBoundsException, stack 
> trace is given below.
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file test-struct.parquet
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>  ... 15 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>  at java.util.ArrayList.get(ArrayList.java:433)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at 
> org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
>  at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
>  at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
>  {code}
>  
> Steps to reproduce:
>  
> {code:java}
> CREATE TABLE parquet_struct_test(
> `parent` struct<child:string,extracol:string> COMMENT '',
> `toplevel` string COMMENT '')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
>  
> -- Use the attached test-struct.parquet data file to load data to this table
> LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> OK
> Failed with exception 
> java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 0 in block -1 in file 
> hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
> {code}
>  
> Same query works fine in the following scenarios:
> 1) Accessing parquet file columns by index instead of names
> {code:java}
> hive> set parquet.column.index.access=true;
> hive>  select parent.extracol, toplevel from parquet_struct_test;
> OK
> NULL toplevel{code}
>  
> 2) When VectorizedParquetRecordReader is used
> {code:java}
> hive> set hive.fetch.task.conversion=none;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
> jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1630412697229_0031)
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  
> KILLED----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      1          1        0        0  
>      0       
> 0----------------------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.06 
> s----------------------------------------------------------------------------------------------
> OK
> NULL toplevel{code}
>  
> 3) Create a copy of the same table and run the same query on the newly 
> created table. 
> {code:java}
> hive> create table parquet_struct_test_copy like parquet_struct_test;
> OK
> hive> insert into parquet_struct_test_copy select * from parquet_struct_test;
> Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total 
> jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster 
> with App id application_1630412697229_0031)
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  
> KILLED----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      1          1        0        0  
>      0       
> 0----------------------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.81 
> s----------------------------------------------------------------------------------------------
> Loading data to table default.parquet_struct_test_copy
> OK
> hive> select parent.extracol, toplevel from parquet_struct_test_copy;
> OK
> NULL toplevel{code}
>  
> Also, this issue doesn't exist when only missing struct type column's field 
> is selected or all the fields in table are selected. This issue exists only 
> when combination of missing struct type column's field and another existing 
> column are selected.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

Reply via email to