Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/18704 )
Change subject: IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables ...................................................................... Patch Set 10: (9 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/hdfs-scan-node-base.cc@1042 PS8, Line 1042: } else if (sd->virtual_column_type() == TVirtualColumnType::FILE_POSITION) { > nit: you could do a 'continue here' instead of no-op. Done http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.h@820 PS8, Line 820: /// row in this row group, it is used to track the file position of the rows. > Could you comment the purpose of 'row_group_first_row' parameter? Done http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@856 PS8, Line 856: i]; > Bit confused but shouldn't this index should be 'i' instead of 'group_idx_' Ah, good catch, thanks, you saved a couple of debugging hours for us.. :) Unfortunately the row groups of functional_parquet.lineitem_multiblocks have the same number of rows (100). I've added a new Parquet file lineitem_multiblocks_variable_num_rows and some tests. http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/parquet-column-readers.h@174 PS8, Line 174: Only valid to call when doing non-batched reading > This part might be misleading as this is also called in CollectionColumnRea Yeah it's a bit confusing, but CollectionColumnReader::ReadValueBatch() invokes the non-batched interfaces of the column-readers (e.g. NextLevels(), ReadValue(), ReadSlots()). Plus already had the same comment for 'ReadPositionNonBatched' in Base. http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogObjects.thrift@77 PS8, Line 77: enum TVirtualColumnType { > Since TVirtualColumnType is optional in TPartialTableInfo I wonder if there In the FE we use the NONE value at some places. It would be possible to use NULL values instead I think, but I'm not sure if it worth doing this in the context of this change. http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogService.thrift File common/thrift/CatalogService.thrift: http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogService.thrift@471 PS8, Line 471: 4: optional list<CatalogObjects.TColumn> virtual_columns > Was there any reason this new member came to the 4th position and not at th It just seemed more logical to list it earlier than some other fields. http://gerrit.cloudera.org:8080/#/c/18704/8/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java: http://gerrit.cloudera.org:8080/#/c/18704/8/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@417 PS8, Line 417: private void addVirtualColumns() { > This function seems the same as the one in HdfsTable. Can this be moved to Non-FS tables also extend the Table class. I don't really want to further complicate the class hierarchy with a new base class just because of this small method. Especially if we add different virtual columns to the different FS-tables. E.g. if we add SEQUENCE_NUMBER or some other Iceberg-specific column. http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test File testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test: http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test@101 PS8, Line 101: ==== > Could you please add a test where file__position is present multiple times Done http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test File testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test: http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test@116 PS8, Line 116: ---- RESULTS > Would it be a valid use case when two tables are joined and we get the file Sure, added few queries. -- To view, visit http://gerrit.cloudera.org:8080/18704 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7 Gerrit-Change-Number: 18704 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 09 Aug 2022 17:39:17 +0000 Gerrit-HasComments: Yes