Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683
PS3, Line 683:  dst_batch->tuple_data_pool()
As we have discussed offline, this is no longer correct. Using the current 
batch's mempool means that the memory can be reclaimed by the consumer of this 
row batch (the caller of this node's GetNext()) after the current GetNext() 
call returned. So it cannot be used in future batches.

Before your optimization, only those string were allocated here, which were 
used in the current row batch, so there was no problem. Now if batch_one starts 
the read rows from the Orc vector, and batch_two finishes it, the blob will 
belong to batch_one, so it can be already freed/reused when batch_two starts to 
use it.

The solution is to give the scanner its own mempool, use that to initialize the 
blobs, and attach the blobs to the last row batch that reads rows from the 
given vector.

For Parquet this is mempool is 
https://github.com/apache/impala/blob/4fa6b5260d9e28dee63a87de3bea1434706a9d05/be/src/exec/parquet/parquet-column-chunk-reader.h#L140

Parquet dictionaries use a different pool: 
https://github.com/apache/impala/blob/4fa6b5260d9e28dee63a87de3bea1434706a9d05/be/src/exec/parquet/hdfs-parquet-scanner.h#L430

Doing this in Orc will be much simpler, as the library hides 
compression/decoding of data, but the way buffers are attached to row batches 
should be the same.



--
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Jan 2020 16:24:07 +0000
Gerrit-HasComments: Yes

Reply via email to