[jira] [Updated] (IMPALA-9226) Improve string allocations of the ORC scanner

2020-06-19 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-9226:
---
Fix Version/s: Impala 3.4.0

> Improve string allocations of the ORC scanner
> -
>
> Key: IMPALA-9226
> URL: https://issues.apache.org/jira/browse/IMPALA-9226
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Assignee: Norbert Luksa
>Priority: Major
>  Labels: orc
> Fix For: Impala 4.0, Impala 3.4.0
>
>
> Currently the ORC scanner allocates new memory for each string values (except 
> for fixed size strings):
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]
> Besides the too many allocations and copying it's also bad for memory 
> locality.
> Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
> strings in the batch: 
> [https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]
> 'blob' has type DataBuffer which is movable, so Impala might be able to get 
> ownership of it. Or, at least we could copy the whole blob array instead of 
> copying the strings one-by-one.
> ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
> 1.5.5.
> ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:
> [https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]
> It uses dictionary encoding for storing the values. Impala could copy/move 
> the dictionary as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9226) Improve string allocations of the ORC scanner

2019-12-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-9226:
--
Description: 
Currently the ORC scanner allocates new memory for each string values (except 
for fixed size strings):

[https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]

Besides the two many allocations and copying it's also bad for memory locality.

Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
strings in the batch: 
[https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]

'blob' has type DataBuffer which is movable, so Impala might be able to get 
ownership of it. Or, at least we could copy the whole blob array instead of 
copying the strings one-by-one.

ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
1.5.5.

ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:

[https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]

It uses dictionary encoding for storing the values. Impala could copy/move the 
dictionary as well.

  was:
Currently the ORC scanner allocates new memory for each string values (except 
for fixed size strings):

https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172

Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
strings in the batch: 
[https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]

'blob' has type DataBuffer which is movable, so Impala might be able to get 
ownership of it. Or, at least we could copy the whole blob array instead of 
copying the strings one-by-one.

ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
1.5.5.

ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:

[https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]

It uses dictionary encoding for storing the values. Impala could copy/move the 
dictionary as well.


> Improve string allocations of the ORC scanner
> -
>
> Key: IMPALA-9226
> URL: https://issues.apache.org/jira/browse/IMPALA-9226
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>  Labels: orc
>
> Currently the ORC scanner allocates new memory for each string values (except 
> for fixed size strings):
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]
> Besides the two many allocations and copying it's also bad for memory 
> locality.
> Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
> strings in the batch: 
> [https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]
> 'blob' has type DataBuffer which is movable, so Impala might be able to get 
> ownership of it. Or, at least we could copy the whole blob array instead of 
> copying the strings one-by-one.
> ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
> 1.5.5.
> ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:
> [https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]
> It uses dictionary encoding for storing the values. Impala could copy/move 
> the dictionary as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9226) Improve string allocations of the ORC scanner

2019-12-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-9226:
--
Description: 
Currently the ORC scanner allocates new memory for each string values (except 
for fixed size strings):

[https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]

Besides the too many allocations and copying it's also bad for memory locality.

Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
strings in the batch: 
[https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]

'blob' has type DataBuffer which is movable, so Impala might be able to get 
ownership of it. Or, at least we could copy the whole blob array instead of 
copying the strings one-by-one.

ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
1.5.5.

ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:

[https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]

It uses dictionary encoding for storing the values. Impala could copy/move the 
dictionary as well.

  was:
Currently the ORC scanner allocates new memory for each string values (except 
for fixed size strings):

[https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]

Besides the two many allocations and copying it's also bad for memory locality.

Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
strings in the batch: 
[https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]

'blob' has type DataBuffer which is movable, so Impala might be able to get 
ownership of it. Or, at least we could copy the whole blob array instead of 
copying the strings one-by-one.

ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
1.5.5.

ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:

[https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]

It uses dictionary encoding for storing the values. Impala could copy/move the 
dictionary as well.


> Improve string allocations of the ORC scanner
> -
>
> Key: IMPALA-9226
> URL: https://issues.apache.org/jira/browse/IMPALA-9226
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>  Labels: orc
>
> Currently the ORC scanner allocates new memory for each string values (except 
> for fixed size strings):
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172]
> Besides the too many allocations and copying it's also bad for memory 
> locality.
> Since ORC-501 StringVectorBatch has a member named 'blob' that contains the 
> strings in the batch: 
> [https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]
> 'blob' has type DataBuffer which is movable, so Impala might be able to get 
> ownership of it. Or, at least we could copy the whole blob array instead of 
> copying the strings one-by-one.
> ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 
> 1.5.5.
> ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:
> [https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]
> It uses dictionary encoding for storing the values. Impala could copy/move 
> the dictionary as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org