Tamas Mate has posted comments on this change. ( http://gerrit.cloudera.org:8080/20010 )
Change subject: IMPALA-11996: Scanner change for Iceberg metadata querying ...................................................................... Patch Set 11: (59 comments) Hi Peter and Gabor, thank you for the detailed review, updated the change. Let me know your thoughts. http://gerrit.cloudera.org:8080/#/c/20010/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20010/10//COMMIT_MSG@21 PS10, Line 21: nested column types > You still only mention struct type, however, other nested types are affecte Done http://gerrit.cloudera.org:8080/#/c/20010/10//COMMIT_MSG@24 PS10, Line 24: Testing: > I do recall something about a perf test to see how long it takes to query s It was quite slow to build a table like that so I just skipped it for now. http://gerrit.cloudera.org:8080/#/c/20010/10//COMMIT_MSG@25 PS10, Line 25: - Added e2e tests for querying metadata tables > I'd really like to see some auth tests as well to see that changing privile Done http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/exec-node.cc File be/src/exec/exec-node.cc: http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/exec-node.cc@55 PS9, Line 55: #include "exec/unnest-node.h" > nit: could go to L41 Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h: http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@33 PS10, Line 33: /// Iceberg API provides predefined metadata tables, these tables can be scanned through > I really like this description of the ScanNode. Thanks! Thanks! http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@35 PS10, Line 35: ner to scan Iceberg data, due to the virtua > nit: "just like any other regular Iceberg tables"? Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@39 PS10, Line 39: s > nit: 'an' Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@69 PS10, Line 69: SED_RESULT; > Would it cause any harm to call this multiple times? If yes, can we return Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@89 PS10, Line 89: scan_metadata_table_ = nullptr > nit: this is the constructor, right? Can we name this 'iceberg_metadata_sca Done http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h: http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@23 PS9, Line 23: #include <jni.h> > Unused include Done http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@35 PS9, Line 35: /// its Parquet scanner to scan Iceberg data, due to the virtual nature of the metadata > nit: Parquet Done http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc: http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@41 PS9, Line 41: > nit: ; not needed Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc: http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@71 PS10, Line 71: if (tuple_desc_ == nullptr) { > Could you include more info here for easier debugging, e.g. tuple Id? Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@90 PS10, Line 90: d_global_ref; > I see aboce that after calling this function you also called "RETURN_ERROR_ RETURN_ERROR_IF_EXC checks the environment, LocalToGlobalRef is a wrapper around the JNI NewGlobalRef and env is checked in it. At this point, in case there is an exception it is already an Impala Status. http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@131 PS10, Line 131: row_batch->tuple_data_pool( > You've already checked for nullness few lines above. Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@165 PS10, Line 165: > You don't use 'env' in this function. Anyway, Frontend::GetCatalogTable() c Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@168 PS10, Line 168: RETURN_IF_ERROR(fe->GetCatalogTable(table_name_, jtable)); > Should we check if 'jtable' is nullptr after this call or would we get an e Good point, added a DCHECK to the GetCatalogTable, because LocalToGlobalRef wraps the null pointer. http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.h File be/src/exec/iceberg-metadata/iceberg-row-reader.h: http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.h@20 PS10, Line 20: // #include "exec/scan-node.h" > Is this needed for the descriptor classes? As I see all of them are pointer Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.h@67 PS10, Line 67: /// TupleDescriptor received from the ScanNode. > This is a slot Id to accessor map, right? Could you please mention this in Done http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.h File be/src/exec/iceberg-metadata/iceberg-row-reader.h: http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.h@37 PS9, Line 37: /// Initialize the tuple descriptor and accessors > jaccessors could be passed as a reference Done http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc File be/src/exec/iceberg-metadata/iceberg-row-reader.cc: http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@128 PS9, Line 128: jint result = env->CallIntMethod(accessed_value, int_value_); > slot could be casted to int32_t* Done http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@181 PS9, Line 181: return Status::OK(); > nit: static_cast may be a better approach here Done http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@186 PS9, Line 186: RETURN_ERROR_IF_EXC(env); > Could be size_t Although, this should be technically the same, it did not work. http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@190 PS9, Line 190: int str_len = strlen(str_guard.get()); > Could be nullptr Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc File be/src/exec/iceberg-metadata/iceberg-row-reader.cc: http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@30 PS10, Line 30: jaccessors_(jaccessor) > This copies the map, right? I'm wondering if we can pass here a pointer to Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@67 PS10, Line 67: > As this is a public method, an entry point to the class, I think we should Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@91 PS10, Line 91: break; > For a 1000 lines long metadata table that has multiple nested types this wo Done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@99 PS10, Line 99: return Status::OK(); : } : : Status IcebergRowReader::ReadBooleanValue(JNIEnv* env, : SlotDescriptor* slot_desc, jobject struct_like_row, Tuple* tuple) { : jobject accessed_value = env->CallObjectMethod(jaccessors_->at(slot_desc->col_pos()), : > this part is the same for all the Read* function. Could you move it to a fu The null check would still be necessary in every Read* with the current design and the method has to return Status. I think we could only save an error check effectively. I believe it is more readable like this, but I am open for refactor ideas. http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@108 PS10, Line 108: lot_des > is a boolean value stored as an uint8 in the rowbatch? nice catch, changed to bool* based on text-converter http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@126 PS10, Line 126: > isn't this meant to be int32_t? yes, done http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@178 PS10, Line 178: N_ERROR_IF_EXC(env > is there a long conversion for string type as well? This is actually a toString, renamed it and updated the comment. http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/service/frontend.cc File be/src/service/frontend.cc: http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/service/frontend.cc@258 PS9, Line 258: Status Frontend::GetCatalogTable(const TTableName& table_name, jobject* result) { > GetCatalogTable could use the JniUtil::CallJniMethod function to fetch the Good, point, refarctored. http://gerrit.cloudera.org:8080/#/c/20010/10/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/20010/10/common/thrift/PlanNodes.thrift@683 PS10, Line 683: MetadataScanNod > It's not clear that this is additional to what? I think this part of the co Done http://gerrit.cloudera.org:8080/#/c/20010/10/common/thrift/PlanNodes.thrift@723 PS10, Line 723: 12: optional TKuduScanNode kudu_scan_node > You skipped index 11 here Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java File fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java: http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java@51 PS10, Line 51: scanRangeSpecs_ = new TScanRangeSpec(); > drop line if not needed Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/service/JniFrontend.java File fe/src/main/java/org/apache/impala/service/JniFrontend.java: http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/service/JniFrontend.java@621 PS10, Line 621: tableNamePar > tableNameParam or something similar? and then we don't have to guess what i Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/service/JniFrontend.java@623 PS10, Line 623: tableN > nit: tableName? Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/service/JniFrontend.java@627 PS10, Line 627: > Can' we simply throw the DatabaseNotFoundException without converting it to Removed the try-catch, the deserializeThrift throws an ImpalaException and DatabaseNotFoundException is an ImpalaException as well. http://gerrit.cloudera.org:8080/#/c/20010/9/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java File fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java: http://gerrit.cloudera.org:8080/#/c/20010/9/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@37 PS9, Line 37: * > nit: javadoc code references should be formatted as {@code text} Done http://gerrit.cloudera.org:8080/#/c/20010/9/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@39 PS9, Line 39: * caller of {@code IcebergMetadataScanner}. > nit: throws Done http://gerrit.cloudera.org:8080/#/c/20010/9/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@43 PS9, Line 43: private FeIcebergTable iceTbl_ = null; > nit: private static final removed, unused http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java File fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java: http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@36 PS10, Line 36: : * > nit: by {IcebergMetadataScanNode} at the backend Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@39 PS10, Line 39: ceber > throws Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@43 PS10, Line 43: private FeIcebergTable iceTbl_ = null; > not used Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@46 PS10, Line 46: private Table metadataTable_ = null; > Shouldn't these members be private? Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@60 PS10, Line 60: this.metadataTableName_ = metadataTableName; > public? Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@61 PS10, Line 61: } > Precondition check that FeTable is actually an FeIcebergTable? Or should we Added a Precondition check and changed to FeIcebergTable. Passing an IcebergApi table would mean that the frontend has to extract the IcebergApi table from the FeTable. I think that method should be more general. http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@71 PS10, Line 71: metadataTable_ = MetadataTableUtils.createMetadataTableInstance( > These methods should be public, right? Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@82 PS10, Line 82: > nit: merge with line above Done http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@78 PS10, Line 78: dataRowsIterator_ = dataTask.rows().iterator(); : if (dataRowsIterator_.hasNext()) break; : } : } : : /** : * > I see a very similar code in GetNext(). Would it make sense to move this to ScanMetadataTable just inits the dataRowsIterator_ it does not gets the element. It is only two occurrence, which feels okay for me. http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@103 PS10, Line 103: DataTask dataTask = (DataTask)fileScanTaskIterator_.next(); > nit: I think this fits into the line above. Done http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/datasets/functional/functional_schema_template.sql@a3772 PS10, Line 3772: > Any reason you dropped 'iceberg_lineitem_sixblocks' in this patch? It was an accident. http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-metadata-table-scan.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-metadata-table-scan.test: http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-metadata-table-scan.test@44 PS10, Line 44: select > I don't think you have to explicitly write explain in planner tests. Done http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test: http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@1 PS10, Line 1: l > nit: trailing space Done http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@1 PS10, Line 1: test > typo: tests Done http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@2 PS10, Line 2: additiona > typo: additional Done http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@2 PS10, Line 2: schemat > typo Done http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@313 PS10, Line 313: ---- QUERY > Frankly, I think we could find more interesting Join scenarios than joining Added an extra test with this scenario. http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@384 PS10, Line 384: describe functional_parquet.iceberg_query_metadata.snapshots; > I'm wondering if a desribe would make sense even for metadata tables. We co I thought I have already opened a Jira for this but I did not, I agree: IMPALA-12495 -- To view, visit http://gerrit.cloudera.org:8080/20010 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0e943cecd77f5ef7af7cd07e2b596f2c5b4331e7 Gerrit-Change-Number: 20010 Gerrit-PatchSet: 11 Gerrit-Owner: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Anonymous Coward <lipeng...@apache.org> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Gergely Fürnstáhl <g.furnst...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Sun, 15 Oct 2023 12:38:05 +0000 Gerrit-HasComments: Yes