Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )
Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... Patch Set 1: (13 comments) http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-columnar-scanner-ir.cc File be/src/exec/hdfs-columnar-scanner-ir.cc: http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-columnar-scanner-ir.cc@25 PS1, Line 25: selected_rows Can you make this a member of RowBatch or a devived classso you don't need to pass the 2 (related) arguments separately? http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-orc-scanner.cc@629 PS1, Line 629: bool selected_rows[scratch_batch_->capacity]; Should check end of stack here or allocate memory if capacity is anything substantial. If not, comment/DCHECK that size is small. Non-issue if this moves into RowBatch. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h@431 PS1, Line 431: /// Column readers among 'column_readers_' used for filtering Consider packaging these in a class if they are going to be passed together. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h@911 PS1, Line 911: vector<SlotId> GetSlotIdsForConjuncts(vector<ScalarExpr*> conjuncts); Pass vector as reference to avoid copying. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@246 PS1, Line 246: vector<ParquetColumnReader*>& non_filter_readers) { Declare function const. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@254 PS1, Line 254: slot_ids.insert(std::begin(conjunct_slot_ids), std::end(conjunct_slot_ids)); use conjunct_slot_ids instead of making a copy. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2147 PS1, Line 2147: // Old logic Old vs new not meaningful comment here. This doesn't look like the old logic.. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2148 PS1, Line 2148: ScratchMicroBatch batches[1]; change array[1] to single variable. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2151 PS1, Line 2151: batches[0].length = scratch_batch_->capacity; Probably best to leave the old logic in-tact here both to avoid any potential performance regressions and so that the new logic can be completely disabled via parameter. Plus calling the new interface for a single batch seems clunky. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2166 PS1, Line 2166: ScratchMicroBatch batches[1]; Move this new logic into a function as well. http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2238 PS1, Line 2238: // continue the old range as 'last' is within 'skip_length' of last range. How does it end up in this case? http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253 PS1, Line 2253: batches[range].start = start; Can we only get here if batch_size==0? http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2275 PS1, Line 2275: return Status(Substitute("Couldn't skip rows in file $0.", filename())); Add comments how this can occur -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 1 Gerrit-Owner: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Comment-Date: Wed, 29 Sep 2021 18:07:13 +0000 Gerrit-HasComments: Yes