[
https://issues.apache.org/jira/browse/HIVE-18553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363373#comment-16363373
]
Vihang Karajgaonkar commented on HIVE-18553:
--------------------------------------------
+1 (pending tests) LGTM.
> Support schema evolution in Parquet Vectorization reader
> --------------------------------------------------------
>
> Key: HIVE-18553
> URL: https://issues.apache.org/jira/browse/HIVE-18553
> Project: Hive
> Issue Type: Sub-task
> Affects Versions: 3.0.0, 2.4.0, 2.3.2
> Reporter: Vihang Karajgaonkar
> Assignee: Ferdinand Xu
> Priority: Major
> Attachments: HIVE-18553.10.patch, HIVE-18553.11.patch,
> HIVE-18553.2.patch, HIVE-18553.3.patch, HIVE-18553.4.patch,
> HIVE-18553.5.patch, HIVE-18553.6.patch, HIVE-18553.7.patch,
> HIVE-18553.8.patch, HIVE-18553.9.patch, HIVE-18553.patch,
> test_result_based_on_HIVE-18553.xlsx
>
>
> For schema evolution, it includes the following points:
> 1. column changes
> column reorder
> column add, column delete
> column rename
> 2. type conversion
> low precision to high precision
> type to String
> For 1st type, current the code is not supporting the column addition
> operation. Detailed error is as follows:
> {code}
> 0: jdbc:hive2://localhost:10000/default> desc test_p;
> +-----------+------------+----------+
> | col_name | data_type | comment |
> +-----------+------------+----------+
> | t1 | tinyint | |
> | t2 | tinyint | |
> | i1 | int | |
> | i2 | int | |
> +-----------+------------+----------+
> 0: jdbc:hive2://localhost:10000/default> set hive.fetch.task.conversion=none;
> 0: jdbc:hive2://localhost:10000/default> set
> hive.vectorized.execution.enabled=true;
> 0: jdbc:hive2://localhost:10000/default> alter table test_p add columns (ts
> timestamp);
> 0: jdbc:hive2://localhost:10000/default> select * from test_p;
> Error: Error while processing statement: FAILED: Execution Error, return code
> 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
> {code}
> Following exception is seen in the logs
> {code}
> Caused by: java.lang.IllegalArgumentException: [ts] BINARY is not in the
> store: [[i1] INT32, [i2] INT32, [t1] INT32, [t2] INT32] 3
> at
> org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:160)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.buildVectorizedParquetReader(VectorizedParquetRecordReader.java:479)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:432)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:393)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:345)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:88)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:167)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:52)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:229)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:142)
> ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
> ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
> ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
> ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
> ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> ~[hadoop-mapreduce-client-core-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
> ~[hadoop-mapreduce-client-common-3.0.0-alpha3-cdh6.x-SNAPSHOT.jar:?]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_121]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[?:1.8.0_121]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[?:1.8.0_121]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[?:1.8.0_121]
> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]
> {code}
> For 2nd type operation, non Vectorized Parquet reader leverages existing
> Parquet String inspector to do the conversion while vectorized path does not.
> To support, this JIRA is providing an abstract layer to read the underlying
> data and convert it to what Hive required for further computing.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)