[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Dong Chen (JIRA) Mon, 24 Nov 2014 03:23:33 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222902#comment-14222902
 ]


Dong Chen commented on PARQUET-131:
-----------------------------------

Hi [~zhenxiao], [~brocknoland], [~jnadeau]
I am working on HIVE-8128 and find here. Thank you for creating and discussing 
this. From Hive perspective, hope below feedback could help.

1. The current code implementation of vectorization in Hive for reference.
* In HIVE-4160, it mainly use data structure {{VectorizedRowBatch}} and 
{{ColumnVector}} to feed the vectorized sql engine.
* For {{VectorizedRowBatch}}, it has an array of {{ColumnVector}} to hold data 
of each column. And has an int size to indicate the number of rows in this 
batch. 
* For {{ColumnVector}}, it has some booleans like noNulls and isRepeating, 
which help the engine skip some data. Also its subclass representing concrete 
type (e.g. Long) holds an array of primitive data.
* To generate the {{VectorizedRowBatch}}, the reader (of ORC file) was added a 
new method nextBatch(), which delegate each column to its type-suitable 
vectorized reader to load data. Similiar with the VectorReader in Zhenxiao's 
design.

2. A few thoughs.
* I agree with Jacques's comment about build a batch of records at a time. 
Maybe a class {{ParquetRowBatch}} could be added to hold the columns.
* {{ColumnVector}} could has the boolean indicators about null or repeating 
values, as they are computed and set during extracting and building data from 
storage layer. These values in vector provide useful info to sql engines.
* How about give a length to the VectorReader? Maybe there is a possibility 
that sql engine wants to specify the rows fetched in a batch.
* A rough idea: add a readPatch() method in {{InternalParquetRecordReader<T>}}. 
And when vector mode is on, reader will invoke this method to get 
{{ParquetRowBatch}}.  The sql engine like Hive, Drill will convert this batch 
to the type they need.  Primitive arrays in the vectors of batch might make 
conversion efficiently. The conversion procedure is reading values in 
{{ParquetRowBatch}} and set them to XxxRowBatch object, which is sent to sql 
engine.

I will keep going on this work to make things more detailed and joining the 
discussion.

> Supporting Vectorized APIs in Parquet
> -------------------------------------
>
>                 Key: PARQUET-131
>                 URL: https://issues.apache.org/jira/browse/PARQUET-131
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>            Assignee: Zhenxiao Luo
>
> Vectorized Query Execution could have big performance improvement for SQL 
> engines like Hive, Drill, and Presto. Instead of processing one row at a 
> time, Vectorized Query Execution could streamline operations by processing a 
> batch of rows at a time. Within one batch, each column is represented as a 
> vector of a primitive data type. SQL engines could apply predicates very 
> efficiently on these vectors, avoiding a single row going through all the 
> operators before the next row can be processed.
> As an efficient columnar data representation, it would be nice if Parquet 
> could support Vectorized APIs, so that all SQL engines could read vectors 
> from Parquet files, and do vectorized execution for Parquet File Format.
>  
> Detail proposal:
> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

Reply via email to