[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Dong Chen (JIRA) Mon, 01 Dec 2014 00:59:54 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229555#comment-14229555
 ]


Dong Chen commented on PARQUET-131:
-----------------------------------

Hi,

After digging into the code more, I get some thoughts based on the design 
proposal. 

In order to describe these thoughts clearly, I uploaded a doc 
{{Parquet-Vectorized-APIs.pdf}}.

The general ideas are:
* Parquet internal readers read one row at a time now. I think we don't have to 
add a series of Readers for vectorization. Maybe we could use these existed 
readers and just add methods like {{readBatch(T next, int size)}}.
* ColumnReader.Binding is responsible for binding low level ValuesReader to the 
customized record Converter materializing records. We can add new concrete 
Binding classes in Parquet and new customized Converter classes in SQL engine 
like Hive, Drill. Then the loaded raw primitive data could be materialized to 
records in the representation SQL engine expecting.
This solution could decouple Parquet iterative raw data reading and SQL engine 
vectorized records materialization. Parquet will not have to organize the 
primitive data by itself. It just load the data iteratively for vectorization 
usage. SQL engines could organize the data as they like.


> Supporting Vectorized APIs in Parquet
> -------------------------------------
>
>                 Key: PARQUET-131
>                 URL: https://issues.apache.org/jira/browse/PARQUET-131
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>            Assignee: Zhenxiao Luo
>
> Vectorized Query Execution could have big performance improvement for SQL 
> engines like Hive, Drill, and Presto. Instead of processing one row at a 
> time, Vectorized Query Execution could streamline operations by processing a 
> batch of rows at a time. Within one batch, each column is represented as a 
> vector of a primitive data type. SQL engines could apply predicates very 
> efficiently on these vectors, avoiding a single row going through all the 
> operators before the next row can be processed.
> As an efficient columnar data representation, it would be nice if Parquet 
> could support Vectorized APIs, so that all SQL engines could read vectors 
> from Parquet files, and do vectorized execution for Parquet File Format.
>  
> Detail proposal:
> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Reply via email to