Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18704 )

Change subject: IMPALA-11350: Add virtual column FILE__POSITION for Parquet 
tables
......................................................................


Patch Set 4:

(7 comments)

I quickly went through the code except the tests. Will make another round next 
week.

http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/base-sequence-scanner.cc@101
PS4, Line 101:   for (SlotDescriptor* sd : scan_node_->virtual_column_slots()) {
I see very similar code in multiple scanner types. Can you move this to a 
common base class?


http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/hdfs-text-scanner.cc@820
PS4, Line 820:   for (SlotDescriptor* sd : scan_node_->virtual_column_slots()) {
Just curious: can't we decide in analysis time that we're going to have some 
files in scanning that don't support FILE__POSITION?


http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/parquet/parquet-column-readers.h@415
PS4, Line 415: When updated, and its value is N, it means that we already 
processed the Nth row
             :   /// completely.
Do I get it right that this also means that current_row_ starts from zero where 
-1 is for invalid value and zero is for the first row? Could you please also 
add this to the comment?


http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/parquet/parquet-column-readers.h@667
PS4, Line 667: int64_t* pos, int64_t* file_pos
This approach with 2 mutually exclusive params seems weird for me. Shouldn't 
this be 2 separate functions, one for pos_slot and one for file_pos_slot?


http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/parquet/parquet-column-readers.cc@576
PS4, Line 576: ReadPositionBatched<true>(rep_level, nullptr,
             :           
tuple->GetBigIntSlot(file_pos_slot_desc_->tuple_offset()));
Similary, if we always set one of the pos params null at the callsite and the 
logic is kind of different for each pos column inside the functions, shouldn't 
this be 2 separate functions?


http://gerrit.cloudera.org:8080/#/c/18704/4/be/src/exec/parquet/parquet-column-readers.cc@931
PS4, Line 931: true
At the callsite it's not that easy to figure out what this new bool parameter 
is for. If it's feasible I'd vote for split up the function for the 2 pos cols.


http://gerrit.cloudera.org:8080/#/c/18704/4/fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java
File fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java:

http://gerrit.cloudera.org:8080/#/c/18704/4/fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java@186
PS4, Line 186:       addVirtualColumn(virtCol);
nit: this should fit into the line above.



--
To view, visit http://gerrit.cloudera.org:8080/18704
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7
Gerrit-Change-Number: 18704
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 Jul 2022 14:13:11 +0000
Gerrit-HasComments: Yes

Reply via email to