Kurt Deschler has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered 
out rows in Parquet table.
......................................................................


Patch Set 3:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner-ir.cc
File be/src/exec/hdfs-columnar-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner-ir.cc@25
PS3, Line 25: selected_rows
Can you make this a member of RowBatch or a devived classso you don't need to 
pass the 2 (related) arguments separately?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-orc-scanner.cc@629
PS3, Line 629:     bool selected_rows[scratch_batch_->capacity];
Should check end of stack here or allocate memory if capacity is anything 
substantial. If not, comment/DCHECK that size is small. Non-issue if this moves 
into RowBatch.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@430
PS3, Line 430:   std::vector<ParquetColumnReader*> column_readers_;
Consider packaging these in a class if they are going to be passed together.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@928
PS3, Line 928:   vector<SlotId> GetSlotIdsForConjuncts(vector<ScalarExpr*> 
conjuncts);
Pass vector as reference to avoid copying.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@250
PS3, Line 250: vector<ParquetColumnReader*>& filter_readers,
             :     vector<ParquetColumnReader*>& non_filter_readers)
> In Impala BE, prefer to use vector<ParquetColumnReader*>* when the argument
Declare function const.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@259
PS3, Line 259:   slot_ids.insert(std::begin(conjunct_slot_ids), 
std::end(conjunct_slot_ids));
use conjunct_slot_ids instead of making a copy.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2342
PS3, Line 2342:         // continue the old range as 'last' is within 
'skip_length' of last range.
How does it end up in this case?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2357
PS3, Line 2357:     batches[range].start = start;
Can we only get here if batch_size==0?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2388
PS3, Line 2388:           return Status(Substitute("Couldn't skip rows in file 
$0.", filename()));
Add comments how this can occur



--
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <amarg...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Oct 2021 17:09:17 +0000
Gerrit-HasComments: Yes

Reply via email to