Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18704 )

Change subject: IMPALA-11350: Add virtual column FILE__POSITION for Parquet 
tables
......................................................................


Patch Set 8:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/hdfs-scan-node-base.cc@1042
PS8, Line 1042:     } else if (sd->virtual_column_type() == 
TVirtualColumnType::FILE_POSITION) {
nit: you could do a 'continue here' instead of no-op.


http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.h@820
PS8, Line 820:   Status InitScalarColumns(int64_t row_group_first_row) 
WARN_UNUSED_RESULT;
Could you comment the purpose of 'row_group_first_row' parameter?


http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@856
PS8, Line 856: group_idx_
Bit confused but shouldn't this index should be 'i' instead of 'group_idx_'?
This would only work if all the row group below the index 'group_idx_' has the 
same number of rows. If this is the case then I think the loop isn't needed.


http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/parquet-column-readers.h@174
PS8, Line 174: Only valid to call when doing non-batched reading
This part might be misleading as this is also called in 
CollectionColumnReader::ReadValueBatch()


http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogObjects.thrift@77
PS8, Line 77: enum TVirtualColumnType {
Since TVirtualColumnType is optional in TPartialTableInfo I wonder if there is 
any use of the "NONE" value of this enum.


http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogService.thrift
File common/thrift/CatalogService.thrift:

http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogService.thrift@471
PS8, Line 471:   4: optional list<CatalogObjects.TColumn> virtual_columns
Was there any reason this new member came to the 4th position and not at the 
end?


http://gerrit.cloudera.org:8080/#/c/18704/8/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/18704/8/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@417
PS8, Line 417:   private void addVirtualColumns() {
This function seems the same as the one in HdfsTable. Can this be moved to a 
common parent class, e.g. Table? (Or to a new common parent class, something 
like TableWithVirtualColumns?)


http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test
File 
testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test:

http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test@101
PS8, Line 101: ====
Could you please add a test where file__position is present multiple times in 
the select list?


http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test
File 
testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test:

http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test@116
PS8, Line 116: ====
Would it be a valid use case when two tables are joined and we get the 
file__position from both them (table1.file__position, table2.file__position)?


http://gerrit.cloudera.org:8080/#/c/18704/8/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/18704/8/tests/query_test/test_scanners.py@160
PS8, Line 160: TODO
nit: addig a Jira ID here would be nice.



--
To view, visit http://gerrit.cloudera.org:8080/18704
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7
Gerrit-Change-Number: 18704
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Aug 2022 11:53:28 +0000
Gerrit-HasComments: Yes

Reply via email to