[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Brock Noland (JIRA) Tue, 02 Dec 2014 13:44:45 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232184#comment-14232184
 ]


Brock Noland commented on PARQUET-131:
--------------------------------------

Hello [~dongc],

Today the informal elements of the "parquet vectorization" team in the US met. 
This included myself, Zhenxiao, Daniel, and Eva for PrestoDB, Parth Jason for 
Drill, and [~spena] and myself for Hive. I of course thought to invite you but 
the rest of the team wanted an on-site and I know it's very late in China...

h2. Questions

*Why does presto read api specify ColumnVector. Do they read one column at a 
time?*

Presto has code which reads all columns in a loop, thus they don't need the 
batch API.

*Original API specified encoding, does the reader use the encoding to 
materialize?*

ColumnVector will not expose Encoding and won’t materialize values until getter 
is called or initialize is called.

*Does presto DB use ByteBuffers or primitive arrays (long[], etc)?* 

They use primitive arrays, like Hive. Drill uses native Buffers.

*If API is not going to materialize and gives back raw Buffer, is there any 
strategy for converting that to long array without copying?*

We’ll pass in allocator which allocates appropriate Buffer type. Presto and 
Hive will allocate instances of for exmaple {{LongBuffer}} which gives us 
access to the primitive array.

h2. Next Steps

# Update interface to remove Encoding as getters will materialize
# Add allocator interface
# Netflix will hack together POC (Drill and Hive might do POC on top of this 
POC)
# GSOC byte buffer patch is a requirement, thus we should merge soon.
# Finish implementation of Parquet Vector* classes (part of POC)
# Finish Drill, Presto and Hive implementation

[~dweeks-netflix] - in the meeting it was said that merging the GSOC buffer 
patch ([PR 49?|https://github.com/apache/incubator-parquet-mr/pull/49]) 
depended on doing some parquet releases such as mr 1.6/1.7 and format 2.0. I 
chatted with [~rdblue] and he wasn't sure what that would be?



> Supporting Vectorized APIs in Parquet
> -------------------------------------
>
>                 Key: PARQUET-131
>                 URL: https://issues.apache.org/jira/browse/PARQUET-131
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>            Assignee: Zhenxiao Luo
>         Attachments: Parquet-Vectorized-APIs.pdf, ParquetInPresto.pdf
>
>
> Vectorized Query Execution could have big performance improvement for SQL 
> engines like Hive, Drill, and Presto. Instead of processing one row at a 
> time, Vectorized Query Execution could streamline operations by processing a 
> batch of rows at a time. Within one batch, each column is represented as a 
> vector of a primitive data type. SQL engines could apply predicates very 
> efficiently on these vectors, avoiding a single row going through all the 
> operators before the next row can be processed.
> As an efficient columnar data representation, it would be nice if Parquet 
> could support Vectorized APIs, so that all SQL engines could read vectors 
> from Parquet files, and do vectorized execution for Parquet File Format.
>  
> Detail proposal:
> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Reply via email to