[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Jacques Nadeau (JIRA) Thu, 13 Nov 2014 09:17:23 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14210025#comment-14210025
 ]


Jacques Nadeau commented on PARQUET-131:
----------------------------------------

Few thoughts:

- I agree with Brock's general comments about having to avoid a parquet 
canonical representation of the in memory data structure.  
- For the getter/setters, we need to support bulk transfer and primitive 
transfer. 
- We should avoid copy unless necessary.  For example, in Drill we often choose 
to avoid copying the variable length data, instead choosing to use it as is.  
- The interface should also take in column level filter expression evaluator.  
Again, this should be a no copy interface.  While you may think that with 
vectorized reads, this isn't necessary, we've found that it actually depends 
entirely on the selectivity of the filter and whether you are using dictionary 
encoding.

I also would suggest that this be a replacement for the lower layers of the 
Parquet reader rather than a secondary path.  Otherwise, we're always going to 
have a partial implementation.  We're very engaged in trying to think through 
the ideas here and are definitely going to be pushing this along.

One last thought, I'm not entirely convinced that this should be a column at a 
time interface.  I've been thinking that a batch of records at a time is more 
appropriate.  Otherwise, there are too many internal concerns that have to be 
reimplemented and fancy inter column behaviors have to be implemented multiple 
times (as well as complex data support).  On the flip side I'm not sure any 
other engines currently have vectorized readers for complex data but I think 
we're more than happy to push it that direction alone and people can fall back 
to a higher-level non-vectorized read interface for complex data.

> Supporting Vectorized APIs in Parquet
> -------------------------------------
>
>                 Key: PARQUET-131
>                 URL: https://issues.apache.org/jira/browse/PARQUET-131
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>            Assignee: Zhenxiao Luo
>
> Vectorized Query Execution could have big performance improvement for SQL 
> engines like Hive, Drill, and Presto. Instead of processing one row at a 
> time, Vectorized Query Execution could streamline operations by processing a 
> batch of rows at a time. Within one batch, each column is represented as a 
> vector of a primitive data type. SQL engines could apply predicates very 
> efficiently on these vectors, avoiding a single row going through all the 
> operators before the next row can be processed.
> As an efficient columnar data representation, it would be nice if Parquet 
> could support Vectorized APIs, so that all SQL engines could read vectors 
> from Parquet files, and do vectorized execution for Parquet File Format.
>  
> Detail proposal:
> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Reply via email to