Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )
Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... Patch Set 3: (9 comments) http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner-ir.cc File be/src/exec/hdfs-columnar-scanner-ir.cc: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner-ir.cc@25 PS3, Line 25: selected_rows Can you make this a member of RowBatch or a devived classso you don't need to pass the 2 (related) arguments separately? http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-orc-scanner.cc@629 PS3, Line 629: bool selected_rows[scratch_batch_->capacity]; Should check end of stack here or allocate memory if capacity is anything substantial. If not, comment/DCHECK that size is small. Non-issue if this moves into RowBatch. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@430 PS3, Line 430: std::vector<ParquetColumnReader*> column_readers_; Consider packaging these in a class if they are going to be passed together. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@928 PS3, Line 928: vector<SlotId> GetSlotIdsForConjuncts(vector<ScalarExpr*> conjuncts); Pass vector as reference to avoid copying. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@250 PS3, Line 250: vector<ParquetColumnReader*>& filter_readers, : vector<ParquetColumnReader*>& non_filter_readers) > In Impala BE, prefer to use vector<ParquetColumnReader*>* when the argument Declare function const. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@259 PS3, Line 259: slot_ids.insert(std::begin(conjunct_slot_ids), std::end(conjunct_slot_ids)); use conjunct_slot_ids instead of making a copy. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2342 PS3, Line 2342: // continue the old range as 'last' is within 'skip_length' of last range. How does it end up in this case? http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2357 PS3, Line 2357: batches[range].start = start; Can we only get here if batch_size==0? http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2388 PS3, Line 2388: return Status(Substitute("Couldn't skip rows in file $0.", filename())); Add comments how this can occur -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 3 Gerrit-Owner: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 05 Oct 2021 17:09:17 +0000 Gerrit-HasComments: Yes