Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18704 )

Change subject: IMPALA-11350: Add virtual column FILE__POSITION for Parquet 
tables
......................................................................


Patch Set 10:

(9 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/hdfs-scan-node-base.cc@1042
PS8, Line 1042:     } else if (sd->virtual_column_type() == 
TVirtualColumnType::FILE_POSITION) {
> nit: you could do a 'continue here' instead of no-op.
Done


http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.h@820
PS8, Line 820:   /// row in this row group, it is used to track the file 
position of the rows.
> Could you comment the purpose of 'row_group_first_row' parameter?
Done


http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@856
PS8, Line 856: i];
> Bit confused but shouldn't this index should be 'i' instead of 'group_idx_'
Ah, good catch, thanks, you saved a couple of debugging hours for us.. :)

Unfortunately the row groups of functional_parquet.lineitem_multiblocks have 
the same number of rows (100). I've added a new Parquet file 
lineitem_multiblocks_variable_num_rows and some tests.


http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/parquet-column-readers.h@174
PS8, Line 174: Only valid to call when doing non-batched reading
> This part might be misleading as this is also called in CollectionColumnRea
Yeah it's a bit confusing, but CollectionColumnReader::ReadValueBatch() invokes 
the non-batched interfaces of the column-readers (e.g. NextLevels(), 
ReadValue(), ReadSlots()). Plus already had the same comment for 
'ReadPositionNonBatched' in Base.


http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogObjects.thrift@77
PS8, Line 77: enum TVirtualColumnType {
> Since TVirtualColumnType is optional in TPartialTableInfo I wonder if there
In the FE we use the NONE value at some places. It would be possible to use 
NULL values instead I think, but I'm not sure if it worth doing this in the 
context of this change.


http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogService.thrift
File common/thrift/CatalogService.thrift:

http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogService.thrift@471
PS8, Line 471:   4: optional list<CatalogObjects.TColumn> virtual_columns
> Was there any reason this new member came to the 4th position and not at th
It just seemed more logical to list it earlier than some other fields.


http://gerrit.cloudera.org:8080/#/c/18704/8/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/18704/8/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@417
PS8, Line 417:   private void addVirtualColumns() {
> This function seems the same as the one in HdfsTable. Can this be moved to
Non-FS tables also extend the Table class.

I don't really want to further complicate the class hierarchy with a new base 
class just because of this small method. Especially if we add different virtual 
columns to the different FS-tables. E.g. if we add SEQUENCE_NUMBER or some 
other Iceberg-specific column.


http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test
File 
testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test:

http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test@101
PS8, Line 101: ====
> Could you please add a test where file__position is present multiple times
Done


http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test
File 
testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test:

http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test@116
PS8, Line 116: ---- RESULTS
> Would it be a valid use case when two tables are joined and we get the file
Sure, added few queries.



--
To view, visit http://gerrit.cloudera.org:8080/18704
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7
Gerrit-Change-Number: 18704
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Aug 2022 17:39:17 +0000
Gerrit-HasComments: Yes

Reply via email to