Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683
PS3, Line 683:
> As I see (and as you also suggested offline), we do not need to pass the me
It is still used for CollectionValueBuilder.


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@202
PS6, Line 202:     if (orc_batch == nullptr) return Status::OK();
             :     batch_ = static_cast<orc::StringVectorBatch*>(orc_batch);
Is it ok that batch_ is not set to null if orc_batch is null?


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@210
PS6, Line 210: InitBlob
Does Orc support mixed encoding, e.g. a stripe that has both dictionary and 
direct encoded encoded chunks? In that case if there were already dictionary 
encoded vectors, then we "forget" the dictionary blob here. If it is possible 
that there will be more dictionary encoded chunks, then blob_ will still point 
to a direct encoded blob instead of the dictionary.

According to https://orc.apache.org/specification/ORCv2/ the same encoding will 
be used for a column in the whole stripe, so the issue above is not possible, 
but the code suggests that encoding can vary from ColumnVectorBatch to 
ColumnVectorBatch.

It would be clearer to add a state to OrcStringColumnReader that stores the 
encoding, and treat it as an error if it changes within a stripe. In Parquet 
encoding can vary from page to page, it generally starts with dictionary 
encoded pages and falls back to "plain" when the dictionary gets too large, so 
readers (like me) may assume that Orc works similarly.


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@211
PS6, Line 211: return Status::OK();
"RETURN_IF_ERROR" + "return Status::OK()" could be simplified to "return".



--
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 15:20:45 +0000
Gerrit-HasComments: Yes

Reply via email to