Tamas Mate has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20010 )

Change subject: IMPALA-11996: Scanner change for Iceberg metadata querying
......................................................................


Patch Set 11:

(59 comments)

Hi Peter and Gabor, thank you for the detailed review, updated the change. Let 
me know your thoughts.

http://gerrit.cloudera.org:8080/#/c/20010/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20010/10//COMMIT_MSG@21
PS10, Line 21: nested column types
> You still only mention struct type, however, other nested types are affecte
Done


http://gerrit.cloudera.org:8080/#/c/20010/10//COMMIT_MSG@24
PS10, Line 24: Testing:
> I do recall something about a perf test to see how long it takes to query s
It was quite slow to build a table like that so I just skipped it for now.


http://gerrit.cloudera.org:8080/#/c/20010/10//COMMIT_MSG@25
PS10, Line 25:  - Added e2e tests for querying metadata tables
> I'd really like to see some auth tests as well to see that changing privile
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/exec-node.cc
File be/src/exec/exec-node.cc:

http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/exec-node.cc@55
PS9, Line 55: #include "exec/unnest-node.h"
> nit: could go to L41
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h
File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@33
PS10, Line 33: /// Iceberg API provides predefined metadata tables, these 
tables can be scanned through
> I really like this description of the ScanNode. Thanks!
Thanks!


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@35
PS10, Line 35: ner to scan Iceberg data, due to the virtua
> nit: "just like any other regular Iceberg tables"?
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@39
PS10, Line 39: s
> nit: 'an'
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@69
PS10, Line 69: SED_RESULT;
> Would it cause any harm to call this multiple times? If yes, can we return
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@89
PS10, Line 89: scan_metadata_table_ = nullptr
> nit: this is the constructor, right? Can we name this 'iceberg_metadata_sca
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h
File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@23
PS9, Line 23: #include <jni.h>
> Unused include
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@35
PS9, Line 35: /// its Parquet scanner to scan Iceberg data, due to the virtual 
nature of the metadata
> nit: Parquet
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc
File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@41
PS9, Line 41:
> nit: ; not needed
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc
File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@71
PS10, Line 71:   if (tuple_desc_ == nullptr) {
> Could you include more info here for easier debugging, e.g. tuple Id?
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@90
PS10, Line 90: d_global_ref;
> I see aboce that after calling this function you also called "RETURN_ERROR_
RETURN_ERROR_IF_EXC checks the environment, LocalToGlobalRef is a wrapper 
around the JNI NewGlobalRef and env is checked in it. At this point, in case 
there is an exception it is already an Impala Status.


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@131
PS10, Line 131:     row_batch->tuple_data_pool(
> You've already checked for nullness few lines above.
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@165
PS10, Line 165:
> You don't use 'env' in this function. Anyway, Frontend::GetCatalogTable() c
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@168
PS10, Line 168:   RETURN_IF_ERROR(fe->GetCatalogTable(table_name_, jtable));
> Should we check if 'jtable' is nullptr after this call or would we get an e
Good point, added a DCHECK to the GetCatalogTable, because LocalToGlobalRef 
wraps the null pointer.


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.h
File be/src/exec/iceberg-metadata/iceberg-row-reader.h:

http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.h@20
PS10, Line 20: // #include "exec/scan-node.h"
> Is this needed for the descriptor classes? As I see all of them are pointer
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.h@67
PS10, Line 67:   /// TupleDescriptor received from the ScanNode.
> This is a slot Id to accessor map, right? Could you please mention this in
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.h
File be/src/exec/iceberg-metadata/iceberg-row-reader.h:

http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.h@37
PS9, Line 37:   /// Initialize the tuple descriptor and accessors
> jaccessors could be passed as a reference
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc
File be/src/exec/iceberg-metadata/iceberg-row-reader.cc:

http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@128
PS9, Line 128:   jint result = env->CallIntMethod(accessed_value, int_value_);
> slot could be casted to int32_t*
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@181
PS9, Line 181:     return Status::OK();
> nit: static_cast may be a better approach here
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@186
PS9, Line 186:   RETURN_ERROR_IF_EXC(env);
> Could be size_t
Although, this should be technically the same, it did not work.


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@190
PS9, Line 190:   int str_len = strlen(str_guard.get());
> Could be nullptr
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc
File be/src/exec/iceberg-metadata/iceberg-row-reader.cc:

http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@30
PS10, Line 30: jaccessors_(jaccessor)
> This copies the map, right? I'm wondering if we can pass here a pointer to
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@67
PS10, Line 67:
> As this is a public method, an entry point to the class, I think we should
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@91
PS10, Line 91:         break;
> For a 1000 lines long metadata table that has multiple nested types this wo
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@99
PS10, Line 99: return Status::OK();
             : }
             :
             : Status IcebergRowReader::ReadBooleanValue(JNIEnv* env,
             :     SlotDescriptor* slot_desc, jobject struct_like_row, Tuple* 
tuple) {
             :   jobject accessed_value = 
env->CallObjectMethod(jaccessors_->at(slot_desc->col_pos()),
             :
> this part is the same for all the Read* function. Could you move it to a fu
The null check would still be necessary in every Read* with the current design 
and the method has to return Status. I think we could only save an error check 
effectively. I believe it is more readable like this, but I am open for 
refactor ideas.


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@108
PS10, Line 108: lot_des
> is a boolean value stored as an uint8 in the rowbatch?
nice catch, changed to bool* based on text-converter


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@126
PS10, Line 126:
> isn't this meant to be int32_t?
yes, done


http://gerrit.cloudera.org:8080/#/c/20010/10/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@178
PS10, Line 178: N_ERROR_IF_EXC(env
> is there a long conversion for string type as well?
This is actually a toString, renamed it and updated the comment.


http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/service/frontend.cc
File be/src/service/frontend.cc:

http://gerrit.cloudera.org:8080/#/c/20010/9/be/src/service/frontend.cc@258
PS9, Line 258: Status Frontend::GetCatalogTable(const TTableName& table_name, 
jobject* result) {
> GetCatalogTable could use the JniUtil::CallJniMethod function to fetch the
Good, point, refarctored.


http://gerrit.cloudera.org:8080/#/c/20010/10/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/20010/10/common/thrift/PlanNodes.thrift@683
PS10, Line 683: MetadataScanNod
> It's not clear that this is additional to what? I think this part of the co
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/common/thrift/PlanNodes.thrift@723
PS10, Line 723:   12: optional TKuduScanNode kudu_scan_node
> You skipped index 11 here
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java@51
PS10, Line 51:     scanRangeSpecs_ = new TScanRangeSpec();
> drop line if not needed
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/service/JniFrontend.java
File fe/src/main/java/org/apache/impala/service/JniFrontend.java:

http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/service/JniFrontend.java@621
PS10, Line 621: tableNamePar
> tableNameParam or something similar? and then we don't have to guess what i
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/service/JniFrontend.java@623
PS10, Line 623: tableN
> nit: tableName?
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/service/JniFrontend.java@627
PS10, Line 627:
> Can' we simply throw the DatabaseNotFoundException without converting it to
Removed the try-catch, the deserializeThrift throws an ImpalaException and 
DatabaseNotFoundException is an ImpalaException as well.


http://gerrit.cloudera.org:8080/#/c/20010/9/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java
File fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java:

http://gerrit.cloudera.org:8080/#/c/20010/9/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@37
PS9, Line 37:  *
> nit: javadoc code references should be formatted as {@code text}
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@39
PS9, Line 39:  * caller of {@code IcebergMetadataScanner}.
> nit: throws
Done


http://gerrit.cloudera.org:8080/#/c/20010/9/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@43
PS9, Line 43:   private FeIcebergTable iceTbl_ = null;
> nit: private static final
removed, unused


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java
File fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java:

http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@36
PS10, Line 36:
             :  *
> nit: by {IcebergMetadataScanNode} at the backend
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@39
PS10, Line 39: ceber
> throws
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@43
PS10, Line 43:   private FeIcebergTable iceTbl_ = null;
> not used
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@46
PS10, Line 46:   private Table metadataTable_ = null;
> Shouldn't these members be private?
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@60
PS10, Line 60:     this.metadataTableName_ = metadataTableName;
> public?
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@61
PS10, Line 61:   }
> Precondition check that FeTable is actually an FeIcebergTable? Or should we
Added a Precondition check and changed to FeIcebergTable.

Passing an IcebergApi table would mean that the frontend has to extract the 
IcebergApi table from the FeTable. I think that method should be more general.


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@71
PS10, Line 71:     metadataTable_ = 
MetadataTableUtils.createMetadataTableInstance(
> These methods should be public, right?
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@82
PS10, Line 82:
> nit: merge with line above
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@78
PS10, Line 78:   dataRowsIterator_ = dataTask.rows().iterator();
             :       if (dataRowsIterator_.hasNext()) break;
             :     }
             :   }
             :
             :   /**
             :    *
> I see a very similar code in GetNext(). Would it make sense to move this to
ScanMetadataTable just inits the dataRowsIterator_ it does not gets the 
element. It is only two occurrence, which feels okay for me.


http://gerrit.cloudera.org:8080/#/c/20010/10/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java@103
PS10, Line 103:       DataTask dataTask = 
(DataTask)fileScanTaskIterator_.next();
> nit: I think this fits into the line above.
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/datasets/functional/functional_schema_template.sql@a3772
PS10, Line 3772:
> Any reason you dropped 'iceberg_lineitem_sixblocks' in this patch?
It was an accident.


http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-metadata-table-scan.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-metadata-table-scan.test:

http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-metadata-table-scan.test@44
PS10, Line 44: select
> I don't think you have to explicitly write explain in planner tests.
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test:

http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@1
PS10, Line 1: l
> nit: trailing space
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@1
PS10, Line 1: test
> typo: tests
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@2
PS10, Line 2: additiona
> typo: additional
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@2
PS10, Line 2:  schemat
> typo
Done


http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@313
PS10, Line 313: ---- QUERY
> Frankly, I think we could find more interesting Join scenarios than joining
Added an extra test with this scenario.


http://gerrit.cloudera.org:8080/#/c/20010/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@384
PS10, Line 384: describe functional_parquet.iceberg_query_metadata.snapshots;
> I'm wondering if a desribe would make sense even for metadata tables. We co
I thought I have already opened a Jira for this but I did not, I agree: 
IMPALA-12495



--
To view, visit http://gerrit.cloudera.org:8080/20010
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0e943cecd77f5ef7af7cd07e2b596f2c5b4331e7
Gerrit-Change-Number: 20010
Gerrit-PatchSet: 11
Gerrit-Owner: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Anonymous Coward <lipeng...@apache.org>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Gergely Fürnstáhl <g.furnst...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Sun, 15 Oct 2023 12:38:05 +0000
Gerrit-HasComments: Yes

Reply via email to