[ 
https://issues.apache.org/jira/browse/HIVE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated HIVE-5998:
-------------------------------

    Status: Patch Available  (was: Open)

This fix provides vectorization execution on top of the normal 
ParquetInputFormat. No changes are required to the table declaration. 
This implementation does not cross the border between Hive and Parquet and as 
such it uses the exiting Hive parquet record reader, which is row mode. The 
vectorized output is 'shallow', provided on top of the row mode by iterating. 
This is not optimal for vectorized execution, but none the less this first step 
provides benefits of the vectorized operators for Parquet format.  Going 
forward a deep vectorized reader would be required but such an endeavour 
requires changes in the Parquet side of the border (the parquet-mr project). 
Bringing Hive dependencies like VectorizationContext and VectorizedRowBatch 
into parquet-mr is not feasible imho now (there are bandwith/capacity issues 
from me/Eric/Jitendra but also engineering issues, like circular dependencies). 
A deep vectorized reader inside the parqeut-mr would have to be based on a 
design that consider other possible vectorized engines consumers (hint: Pig). 

> Add vectorized reader for Parquet files
> ---------------------------------------
>
>                 Key: HIVE-5998
>                 URL: https://issues.apache.org/jira/browse/HIVE-5998
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Remus Rusanu
>            Assignee: Remus Rusanu
>            Priority: Minor
>         Attachments: HIVE-5998.1.patch
>
>
> HIVE-5783 is adding native Parquet support in Hive. As Parquet is a columnar 
> format, it makes sense to provide a vectorized reader, similar to how RC and 
> ORC formats have, to benefit from vectorized execution engine.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to