[ 
https://issues.apache.org/jira/browse/IMPALA-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048024#comment-17048024
 ] 

ASF subversion and git services commented on IMPALA-9226:
---------------------------------------------------------

Commit c48efd407e7b857c6df0d167aafe02f93c81e2fb in impala's branch 
refs/heads/master from norbert.luksa
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c48efd4 ]

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be acquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
   and tpch query tests.
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name     | encoding | before | after | speedup
   =============================================================
   l_comment      DIRECT     16.42s   14.38s  14%
   l_shipinstruct DICTIONARY 5.26s    3.80s   32%
   l_commitdate   DICTIONARY 5.46s    5.19s   5%
   all string col BOTH       39.06s   32.18s  21%

   The queries were run on a desktop PC with MT_DOP and NUM_NODES
   set to 1.
 * Also run TPC-H queries on the TPC-H benchmark where some
   queries' runtime improved by around 10-15%, while there were
   no regression for the others.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Reviewed-on: http://gerrit.cloudera.org:8080/15051
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Improve string allocations of the ORC scanner
> ---------------------------------------------
>
>                 Key: IMPALA-9226
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9226
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Norbert Luksa
>            Priority: Major
>              Labels: orc
>
> Currently the ORC scanner allocates new memory for each string values (except 
> for fixed size strings):
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]
> Besides the too many allocations and copying it's also bad for memory 
> locality.
> Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
> strings in the batch: 
> [https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]
> 'blob' has type DataBuffer which is movable, so Impala might be able to get 
> ownership of it. Or, at least we could copy the whole blob array instead of 
> copying the strings one-by-one.
> ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
> 1.5.5.
> ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:
> [https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]
> It uses dictionary encoding for storing the values. Impala could copy/move 
> the dictionary as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to