Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/21269 )
Change subject: IMPALA-12973,IMPALA-11491,IMPALA-12651: Support BINARY nested in complex types in select list ...................................................................... Patch Set 1: (5 comments) looks good, one long comment about the original design of the JNI interface for Iceberg metadata reading http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/exec/iceberg-metadata/iceberg-row-reader.cc File be/src/exec/iceberg-metadata/iceberg-row-reader.cc: http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@135 PS1, Line 135: // Skip the unsupported type and set it to NULL Are there still unsupported types that are intentionally nulled? How is this handled during planning, do we return a warning or error? http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@198 PS1, Line 198: if constexpr (IS_BINARY) { tl/dr: The binary handling seems good to me, but I have concerns about the string handling (and handling of all types in general):) My whole comment is pretty much out of scope, should have wrote this during the original implementation of IcebergRowReader. One issue is that char_sequence_to_string_ (which seems java.lang.String's toString method()) may not always return standard utf8 https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni I tried to test it with different Unicode strings and the only case when the output seemed weird was for \0 characters The other issue is generally about the way values are read from Iceberg API - if I understand correctly they are plumbed directly from Iceberg's Java api as Object without checking their type in Java. https://github.com/apache/impala/blob/5c003cdcda604c43dd836571db02afcfbdc05dbc/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java#L122 We make assumptions about about the exact types returned by Iceberg, and in a few minutes of googling and Iceberg doc reading I couldn't find what exactly these types are. For example for timestamp in line 108 the comment says org.apache.iceberg.types.TimestampType, while in line 175 we assume Long. I think that it would be much clearer to handle different types in Java first and throw an error there if the type does not match the expected one. This would also allow making conversions there, for example converting String to ByteBuffer, making the templating of this class unnecessary. http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@209 PS1, Line 209: uint32_t jbuffer_size = jbuffer_guard.get_size();; nit: extra ; http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/rpc/jni-thrift-util.h File be/src/rpc/jni-thrift-util.h: http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/rpc/jni-thrift-util.h@62 PS1, Line 62: return status; With JniByteArrayGuard we could just return DeserializeThriftMsg directly http://gerrit.cloudera.org:8080/#/c/21269/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test: http://gerrit.cloudera.org:8080/#/c/21269/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@780 PS1, Line 780: nit: extra lines? (line 791) -- To view, visit http://gerrit.cloudera.org:8080/21269 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7b1d7fa332a901f05a46e0199e13fb841d2687c2 Gerrit-Change-Number: 21269 Gerrit-PatchSet: 1 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Comment-Date: Tue, 09 Apr 2024 16:09:04 +0000 Gerrit-HasComments: Yes