Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/18704 )
Change subject: IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables ...................................................................... Patch Set 8: (10 comments) http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/hdfs-scan-node-base.cc@1042 PS8, Line 1042: } else if (sd->virtual_column_type() == TVirtualColumnType::FILE_POSITION) { nit: you could do a 'continue here' instead of no-op. http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.h@820 PS8, Line 820: Status InitScalarColumns(int64_t row_group_first_row) WARN_UNUSED_RESULT; Could you comment the purpose of 'row_group_first_row' parameter? http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@856 PS8, Line 856: group_idx_ Bit confused but shouldn't this index should be 'i' instead of 'group_idx_'? This would only work if all the row group below the index 'group_idx_' has the same number of rows. If this is the case then I think the loop isn't needed. http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/18704/8/be/src/exec/parquet/parquet-column-readers.h@174 PS8, Line 174: Only valid to call when doing non-batched reading This part might be misleading as this is also called in CollectionColumnReader::ReadValueBatch() http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogObjects.thrift@77 PS8, Line 77: enum TVirtualColumnType { Since TVirtualColumnType is optional in TPartialTableInfo I wonder if there is any use of the "NONE" value of this enum. http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogService.thrift File common/thrift/CatalogService.thrift: http://gerrit.cloudera.org:8080/#/c/18704/8/common/thrift/CatalogService.thrift@471 PS8, Line 471: 4: optional list<CatalogObjects.TColumn> virtual_columns Was there any reason this new member came to the 4th position and not at the end? http://gerrit.cloudera.org:8080/#/c/18704/8/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java: http://gerrit.cloudera.org:8080/#/c/18704/8/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@417 PS8, Line 417: private void addVirtualColumns() { This function seems the same as the one in HdfsTable. Can this be moved to a common parent class, e.g. Table? (Or to a new common parent class, something like TableWithVirtualColumns?) http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test File testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test: http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/mixing-virtual-columns.test@101 PS8, Line 101: ==== Could you please add a test where file__position is present multiple times in the select list? http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test File testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test: http://gerrit.cloudera.org:8080/#/c/18704/8/testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test@116 PS8, Line 116: ==== Would it be a valid use case when two tables are joined and we get the file__position from both them (table1.file__position, table2.file__position)? http://gerrit.cloudera.org:8080/#/c/18704/8/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/18704/8/tests/query_test/test_scanners.py@160 PS8, Line 160: TODO nit: addig a Jira ID here would be nice. -- To view, visit http://gerrit.cloudera.org:8080/18704 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7 Gerrit-Change-Number: 18704 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 09 Aug 2022 11:53:28 +0000 Gerrit-HasComments: Yes