Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )
Change subject: IMPALA-9226: Improve string allocations of the ORC scanner ...................................................................... Patch Set 6: (4 comments) http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683 PS3, Line 683: > As I see (and as you also suggested offline), we do not need to pass the me It is still used for CollectionValueBuilder. http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@202 PS6, Line 202: if (orc_batch == nullptr) return Status::OK(); : batch_ = static_cast<orc::StringVectorBatch*>(orc_batch); Is it ok that batch_ is not set to null if orc_batch is null? http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@210 PS6, Line 210: InitBlob Does Orc support mixed encoding, e.g. a stripe that has both dictionary and direct encoded encoded chunks? In that case if there were already dictionary encoded vectors, then we "forget" the dictionary blob here. If it is possible that there will be more dictionary encoded chunks, then blob_ will still point to a direct encoded blob instead of the dictionary. According to https://orc.apache.org/specification/ORCv2/ the same encoding will be used for a column in the whole stripe, so the issue above is not possible, but the code suggests that encoding can vary from ColumnVectorBatch to ColumnVectorBatch. It would be clearer to add a state to OrcStringColumnReader that stores the encoding, and treat it as an error if it changes within a stripe. In Parquet encoding can vary from page to page, it generally starts with dictionary encoded pages and falls back to "plain" when the dictionary gets too large, so readers (like me) may assume that Orc works similarly. http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@211 PS6, Line 211: return Status::OK(); "RETURN_IF_ERROR" + "return Status::OK()" could be simplified to "return". -- To view, visit http://gerrit.cloudera.org:8080/15051 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f Gerrit-Change-Number: 15051 Gerrit-PatchSet: 6 Gerrit-Owner: Norbert Luksa <norbert.lu...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 30 Jan 2020 15:20:45 +0000 Gerrit-HasComments: Yes