Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/21761 )
Change subject: IMPALA-13364: Schema resolution doesn't work for migrated partitioned Iceberg tables that have complex types ...................................................................... Patch Set 4: (7 comments) Thanks for fixing this, Zoltan! http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc File be/src/exec/file-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@95 PS4, Line 95: file_metadata should we DCHECK if 'file_metadata' is not null http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@239 PS4, Line 239: int fileFieldID = *fieldID; 'fieldId' could be null. Add a DCHECK? http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@241 PS4, Line 241: using namespace org::apache::impala::fb; I think that the code between L241-L269 is not related to the actual fieldId you gave to this function and the same code runs multiple times but would give the same answer. I think we could extract it somewhere else and we could pass this function only the result of that code block. As I see the actual sourceId vector isn't relevant either after L271, just the first, last elements and the size. In fact once we verified that they are consecutive numbers, only the first element and the size matters. http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@248 PS4, Line 248: for (int i = 0; i < transforms->size(); ++i) { We don't need 'i' within the loop. Can this be a foreach? http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@251 PS4, Line 251: FbIcebergTransformType::FbIcebergTransformType_IDENTITY) { nit: too much indentation http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@280 PS4, Line 280: return Status(errorMsg); I think we can be more specific here with the error msg. We could let the user know that only the subset of the partitions are written into the data file. http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/parquet/parquet-metadata-utils.cc@1022 PS4, Line 1022: // Partition columns are not stored in file metadata, but they get field IDs I'm just thinking about the feasibility of the comment I left in FileMetadatUtils. So we come to this part of the code for the children of complex types, and only if this is a migrated, partitioned Iceberg table, right? What if we collect such fields first, do the validations for the Partition fields once, and then adjust the fildIds? -- To view, visit http://gerrit.cloudera.org:8080/21761 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie32952021b63d6b55b8820489e434bfc2a91580b Gerrit-Change-Number: 21761 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Comment-Date: Fri, 13 Sep 2024 12:07:07 +0000 Gerrit-HasComments: Yes
