[Impala-ASF-CR] IMPALA-12973,IMPALA-11491,IMPALA-12651: Support BINARY nested in complex types in select list

Csaba Ringhofer (Code Review) Tue, 09 Apr 2024 09:13:14 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21269 )


Change subject: IMPALA-12973,IMPALA-11491,IMPALA-12651: Support BINARY nested 
in complex types in select list
......................................................................


Patch Set 1:

(5 comments)

looks good, one long comment about the original design of the JNI interface for 
Iceberg metadata reading

http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/exec/iceberg-metadata/iceberg-row-reader.cc
File be/src/exec/iceberg-metadata/iceberg-row-reader.cc:

http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@135
PS1, Line 135:       // Skip the unsupported type and set it to NULL
Are there still unsupported types that are intentionally nulled? How is this 
handled during planning, do we return a warning or error?


http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@198
PS1, Line 198:   if constexpr (IS_BINARY) {
tl/dr: The binary handling seems good to me, but I have concerns about the 
string handling (and handling of all types in general):)

My whole comment is pretty much out of scope, should have wrote this during the 
original implementation of IcebergRowReader.

One issue is that char_sequence_to_string_ (which seems java.lang.String's 
toString method()) may not always return standard utf8
https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
I tried to test it with different Unicode strings and the only case when the 
output seemed weird was for \0 characters

The other issue is generally about the way values are read from Iceberg API - 
if I understand correctly they are plumbed directly from Iceberg's Java api as 
Object without checking their type in Java. 
https://github.com/apache/impala/blob/5c003cdcda604c43dd836571db02afcfbdc05dbc/fe/src/main/java/org/apache/impala/util/IcebergMetadataScanner.java#L122

We make assumptions about about the exact types returned by Iceberg, and in a 
few minutes of googling and Iceberg doc reading I couldn't find what exactly 
these types are. For example for timestamp in line 108 the comment says 
org.apache.iceberg.types.TimestampType, while in line 175 we assume Long. I 
think that it would be much clearer to handle different types in Java first and 
throw an error there if the type does not match the expected one. This would 
also allow making conversions there, for example converting String to 
ByteBuffer, making the templating of this class unnecessary.


http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/exec/iceberg-metadata/iceberg-row-reader.cc@209
PS1, Line 209:   uint32_t jbuffer_size = jbuffer_guard.get_size();;
nit: extra ;


http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/rpc/jni-thrift-util.h
File be/src/rpc/jni-thrift-util.h:

http://gerrit.cloudera.org:8080/#/c/21269/1/be/src/rpc/jni-thrift-util.h@62
PS1, Line 62:   return status;
With JniByteArrayGuard we could just return DeserializeThriftMsg directly


http://gerrit.cloudera.org:8080/#/c/21269/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test:

http://gerrit.cloudera.org:8080/#/c/21269/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@780
PS1, Line 780:
nit: extra lines? (line 791)



--
To view, visit http://gerrit.cloudera.org:8080/21269
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7b1d7fa332a901f05a46e0199e13fb841d2687c2
Gerrit-Change-Number: 21269
Gerrit-PatchSet: 1
Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Apr 2024 16:09:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12973,IMPALA-11491,IMPALA-12651: Support BINARY nested in complex types in select list

Reply via email to