[
https://issues.apache.org/jira/browse/PARQUET-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14210025#comment-14210025
]
Jacques Nadeau commented on PARQUET-131:
----------------------------------------
Few thoughts:
- I agree with Brock's general comments about having to avoid a parquet
canonical representation of the in memory data structure.
- For the getter/setters, we need to support bulk transfer and primitive
transfer.
- We should avoid copy unless necessary. For example, in Drill we often choose
to avoid copying the variable length data, instead choosing to use it as is.
- The interface should also take in column level filter expression evaluator.
Again, this should be a no copy interface. While you may think that with
vectorized reads, this isn't necessary, we've found that it actually depends
entirely on the selectivity of the filter and whether you are using dictionary
encoding.
I also would suggest that this be a replacement for the lower layers of the
Parquet reader rather than a secondary path. Otherwise, we're always going to
have a partial implementation. We're very engaged in trying to think through
the ideas here and are definitely going to be pushing this along.
One last thought, I'm not entirely convinced that this should be a column at a
time interface. I've been thinking that a batch of records at a time is more
appropriate. Otherwise, there are too many internal concerns that have to be
reimplemented and fancy inter column behaviors have to be implemented multiple
times (as well as complex data support). On the flip side I'm not sure any
other engines currently have vectorized readers for complex data but I think
we're more than happy to push it that direction alone and people can fall back
to a higher-level non-vectorized read interface for complex data.
> Supporting Vectorized APIs in Parquet
> -------------------------------------
>
> Key: PARQUET-131
> URL: https://issues.apache.org/jira/browse/PARQUET-131
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Reporter: Zhenxiao Luo
> Assignee: Zhenxiao Luo
>
> Vectorized Query Execution could have big performance improvement for SQL
> engines like Hive, Drill, and Presto. Instead of processing one row at a
> time, Vectorized Query Execution could streamline operations by processing a
> batch of rows at a time. Within one batch, each column is represented as a
> vector of a primitive data type. SQL engines could apply predicates very
> efficiently on these vectors, avoiding a single row going through all the
> operators before the next row can be processed.
> As an efficient columnar data representation, it would be nice if Parquet
> could support Vectorized APIs, so that all SQL engines could read vectors
> from Parquet files, and do vectorized execution for Parquet File Format.
>
> Detail proposal:
> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)