Amogh Margoor has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered 
out rows in Parquet table.
......................................................................


Patch Set 5:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h@60
PS3, Line 60: bool* selected_rows
> Need to document the use of this new argument in the comment above.
I am thinking to push it to ScratchTupleBatch. If not will document it in next 
revision. I will keep this open for now.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS3, Line 253:   conjuncts.insert(std::end(conjuncts), 
std::begin(scan_node_->conjuncts()),
             :       std::end(scan_node_->conjuncts()));
             :   conjuncts.insert(std::end(conjuncts), 
std::begin(scan_node_->filter_exprs()),
> Can you please check if the code is sufficient to deal with min/max filters
Oh thanks for this info - didn't know. Will take a look at this and revert back.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330
PS3, Line 2330:
> By reading the code, my guess is that each batch covers a number of, up to
>> By reading the code, my guess is that each batch covers a >> number of, up 
>> to skip length, selected rows.
Actually skip_length is not the max length of batch. It is just the minimum gap 
between two ranges. If two clusters of True values have a gap less than 
skip_length we merge them into 1 cluster.

>> Is it possible to work with selected_rows directly?

Do you mean to say directly use selected rows instead of micro batches in 
FillScratchBatch ?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2342
PS3, Line 2342:         batches[range].start = start;
> How does it end up in this case?
I just realised you had to re-comment as versions moved ahead. Sorry for the 
trouble, was not intentional - I generally go back to all the comments in older 
versions too. I will paste response from comment in older version for 
completeness.
"It ends up here even for simple case like consecutive true values. But for 
more complicated cases: assume skip_length as 5 and batch of 10: TTTFFFTTTT - 
it will enter here for all true values except the first one."


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2357
PS3, Line 2357:   /// TTTTTTTTTT or even FFFFFTTTTT. In both cases we would 
need below.
> Can we only get here if batch_size==0?
commenting from earlier:
We will get here for pretty much everything except when batch_size is 0 or when 
all values in batch are false. I am assuming you meant to ask when will we hit 
the false branch here. However, I am removing condition (start = -1) and adding 
DCHECK to ensure batch sizes are not 0 and values are not false.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2377
PS3, Line 2377: continue_execution
> should init this boolean, as the code below conditionally set it.
right, missed this one. will fix it in next revision. will keep this open for 
now.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2381
PS3, Line 2381: if (micro_batches[0].start > 0) {
              :           if 
(UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) {
              :             return Status(Substitute("Couldn't skip rows in 
file $0.", filename()));
              :           }
              :         }
> Do we need an else branch to deal with micro_batches[0].start==0? If it is
It is possible to have micro_batches[0].start==0, but in that case we don't 
need any Skip call. Skip call is needed only when micro_batches[0].start > 0 
i.e., everything from 0 to micro_batches[0].start needs to be skipped. Hence 
else is not required, neither is DCHECK reqd.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2388
PS3, Line 2388:           return Status(Substitute("Couldn't skip rows in file 
$0.", filename()));
> Add comments how this can occur
have added comment to the SkipRow declaration for 'false' return value.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2417
PS3, Line 2417: return Status::OK();
> I wonder if a non Status::OK() object should return here.
This behaviour is retained from earlier code. You will find the code in 
AssembleRows in old code.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@427
PS3, Line 427: ReadNextDataPageHe
> Name it as ReadNextDataPageHeader()?
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@513
PS3, Line 513: /// is more than the rows left in current
> Missing comment?
Added it now.



--
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 5
Gerrit-Owner: Amogh Margoor <amarg...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Oct 2021 15:37:57 +0000
Gerrit-HasComments: Yes

Reply via email to