[
https://issues.apache.org/jira/browse/HIVE-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699820#comment-13699820
]
Vinod Kumar Vavilapalli commented on HIVE-4160:
-----------------------------------------------
A huge +1 to that. Having a common set of operators will be a huge win. That
said, I already see that the current branch follows Hive's operator base
classes, uses HiveConf etc. I believe with little effort, this can be cleaned
and pulled apart into one separate maven module that everyone can use.
Some points to think about:
- The target location of the module. The dependency graph can become
un-wieldly.
- Given the use of base Operator, OperatorDesc etc from Hive, if at all there
is interest and commitment, we should do this ASAP when we only have a handful
of operators.
- Make one other project demonstrate how it can be reused across ecosystem
projects, PIG will be great - just a few operators will be a great start
Thoughts?
> Vectorized Query Execution in Hive
> ----------------------------------
>
> Key: HIVE-4160
> URL: https://issues.apache.org/jira/browse/HIVE-4160
> Project: Hive
> Issue Type: New Feature
> Reporter: Jitendra Nath Pandey
> Assignee: Jitendra Nath Pandey
> Attachments: Hive-Vectorized-Query-Execution-Design.docx,
> Hive-Vectorized-Query-Execution-Design-rev2.docx,
> Hive-Vectorized-Query-Execution-Design-rev3.docx,
> Hive-Vectorized-Query-Execution-Design-rev3.docx,
> Hive-Vectorized-Query-Execution-Design-rev3.pdf,
> Hive-Vectorized-Query-Execution-Design-rev4.docx,
> Hive-Vectorized-Query-Execution-Design-rev4.pdf,
> Hive-Vectorized-Query-Execution-Design-rev5.docx,
> Hive-Vectorized-Query-Execution-Design-rev5.pdf,
> Hive-Vectorized-Query-Execution-Design-rev6.docx,
> Hive-Vectorized-Query-Execution-Design-rev6.pdf,
> Hive-Vectorized-Query-Execution-Design-rev7.docx,
> Hive-Vectorized-Query-Execution-Design-rev8.docx,
> Hive-Vectorized-Query-Execution-Design-rev8.pdf,
> Hive-Vectorized-Query-Execution-Design-rev9.docx,
> Hive-Vectorized-Query-Execution-Design-rev9.pdf
>
>
> The Hive query execution engine currently processes one row at a time. A
> single row of data goes through all the operators before the next row can be
> processed. This mode of processing is very inefficient in terms of CPU usage.
> Research has demonstrated that this yields very low instructions per cycle
> [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization
> and data columns go through a layer of object inspectors that identify column
> type, deserialize data and determine appropriate expression routines in the
> inner loop. These layers of virtual method calls further slow down the
> processing.
> This work will add support for vectorized query execution to Hive, where,
> instead of individual rows, batches of about a thousand rows at a time are
> processed. Each column in the batch is represented as a vector of a primitive
> data type. The inner loop of execution scans these vectors very fast,
> avoiding method calls, deserialization, unnecessary if-then-else, etc. This
> substantially reduces CPU time used, and gives excellent instructions per
> cycle (i.e. improved processor pipeline utilization). See the attached design
> specification for more details.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira