[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Zhenxiao Luo (JIRA) Mon, 17 Nov 2014 15:57:18 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215442#comment-14215442
 ]


Zhenxiao Luo commented on PARQUET-131:
--------------------------------------

[~brocknoland] The gist is updated with ColumnVector interface. We are still 
discussing with the Drill team about whether to use Primitive Arrays, or 
ByteBuffer, or byte[] for setters and getters.
[~jnadeau] I just updated the gist with a ByteBuffer, hoping both Drill, Hive 
and Presto could use this kind of generalized ByteBuffer. I will spend time 
reading Drill's code to see other magics. While, some relevant articles seems 
showing ByteBuffer is not as efficient/fast as primitive arrays:
http://www.evanjones.ca/software/java-bytebuffers.html
https://groups.google.com/forum/#!topic/mechanical-sympathy/9I18sXm4bvY
http://imranrashid.com/posts/profiling-bytebuffers/
Still thinking primitive arrays could be the most efficient way. Anyway, let's 
continue discussing about it.

> Supporting Vectorized APIs in Parquet
> -------------------------------------
>
>                 Key: PARQUET-131
>                 URL: https://issues.apache.org/jira/browse/PARQUET-131
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>            Assignee: Zhenxiao Luo
>
> Vectorized Query Execution could have big performance improvement for SQL 
> engines like Hive, Drill, and Presto. Instead of processing one row at a 
> time, Vectorized Query Execution could streamline operations by processing a 
> batch of rows at a time. Within one batch, each column is represented as a 
> vector of a primitive data type. SQL engines could apply predicates very 
> efficiently on these vectors, avoiding a single row going through all the 
> operators before the next row can be processed.
> As an efficient columnar data representation, it would be nice if Parquet 
> could support Vectorized APIs, so that all SQL engines could read vectors 
> from Parquet files, and do vectorized execution for Parquet File Format.
>  
> Detail proposal:
> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Reply via email to