[ https://issues.apache.org/jira/browse/IMPALA-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Norbert Luksa resolved IMPALA-9226. ----------------------------------- Fix Version/s: Impala 4.0 Resolution: Fixed > Improve string allocations of the ORC scanner > --------------------------------------------- > > Key: IMPALA-9226 > URL: https://issues.apache.org/jira/browse/IMPALA-9226 > Project: IMPALA > Issue Type: Improvement > Reporter: Zoltán Borók-Nagy > Assignee: Norbert Luksa > Priority: Major > Labels: orc > Fix For: Impala 4.0 > > > Currently the ORC scanner allocates new memory for each string values (except > for fixed size strings): > [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172] > Besides the too many allocations and copying it's also bad for memory > locality. > Since ORC-501 StringVectorBatch has a member named 'blob' that contains the > strings in the batch: > [https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126] > 'blob' has type DataBuffer which is movable, so Impala might be able to get > ownership of it. Or, at least we could copy the whole blob array instead of > copying the strings one-by-one. > ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC > 1.5.5. > ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch: > [https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153] > It uses dictionary encoding for storing the values. Impala could copy/move > the dictionary as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org