[
https://issues.apache.org/jira/browse/HIVE-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222774#comment-14222774
]
Dong Chen commented on HIVE-8128:
---------------------------------
To improve Parquet Vectorization, I think we need following changes, and they
should be based on PARQUET-131. These are some initial thoughts and I will make
them more specific after working on parquet side for a while.
Assuming the RecordReader in Hive will get data of type
{{ParquetVectorizedRowBatch}}.
1. The next() method of {{VectorizedParquetRecordReader}} should be
{{next(NullWritable key, ParquetVectorizedRowBatch outputBatch)}}. This will
let Hive get a vectorized batch of rows of Parquet at a time.
2. A {{VectorizedParquetHiveSerDe}} will be added to convert
{{ParquetVectorizedRowBatch}} to Hive recognized {{VectorizedRowBatch}}. In
order to make conversion efficiently, the Parquet vectorized API design might
consider this. The more similar between the 2 kinds of row batch, the better.
3. The support for partition has been in trunk. Whether it works for Parquet
should be verified after main work is done, and make possible changes if
neccessary.
> Improve Parquet Vectorization
> -----------------------------
>
> Key: HIVE-8128
> URL: https://issues.apache.org/jira/browse/HIVE-8128
> Project: Hive
> Issue Type: Sub-task
> Reporter: Brock Noland
> Assignee: Dong Chen
>
> We'll want to do is finish the vectorization work (e.g. VectorizedOrcSerde,
> VectorizedOrcSerde) which was partially done in HIVE-5998.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)