Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19043 )

Change subject: IMPALA-11591: Avoid calling planFiles() on Iceberg tables
......................................................................


Patch Set 3:

(7 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java@567
PS2, Line 567:     table, new ArrayList<>(), /*timeTravelSpecl=*/null);
             :       for (ContentFile contentFile : 
Iterables.concat(allFiles.first, allFiles.second)) {
             :         addContentFileToFileStore(contentFile, fileStore, table, 
hdfsFileDescMap);
             :       }
             :       return fileStore;
             :     }
             :
             :     private static void addContentFileToFileStore(ContentFile 
contentFile,
             :         IcebergContentFileStore fileStore, IcebergTable table,
             :         Map<String, HdfsPartition.FileDescriptor> 
hdfsFileDescMap) throws IOException {
             :       String pathHash = IcebergUtil.getFilePathHash(contentFile);
             :       H
> nit: The logic of these two for loops is similar, so would it be better if
Refactored the code a bit.


http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java
File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java:

http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@44
PS2, Line 44:  // List of Iceberg data files (doesn't include delete files)
            :   private final List<FileDescriptor> dataFiles_ = new ArrayList<
> nit: We can add a comment so that others can understand them quickly.
Added comments.


http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@48
PS2, Line 48: private final List<Fi
> We already initialized the member variable at declaration moment and will n
Added final declarations.


http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@106
PS2, Line 106: etworkAddresses,
> The Field Requiredness of 'TIcebergContentFileStore.path_hash_to_data_file
Done


http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@108
PS2, Line 108: et = new IcebergContentFileStore();
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/IcebergPositionDeleteTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergPositionDeleteTable.java:

http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/catalog/IcebergPositionDeleteTable.java@119
PS2, Line 119:
             :   @Override
> This is OK, but I think 'throw new NotImplementedException("msg");' is much
Done


http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19043/2/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@471
PS2, Line 471: getDefaultPartitionSpec
> Whether time-travel should be considered? If this gets the most recent 'Par
Seems like we can't retrieve the default partition spec for a snapshot:
https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Snapshot.java

We can only get the partition spec from the table object:
https://github.com/apache/iceberg/blob/7c2fcccf4a033ce620a254fae4ddc92cdc10bf85/api/src/main/java/org/apache/iceberg/Table.java#L103

So at this point, to determine if a column is involved in partitioning, we can 
either:
* use the current partition spec
* use all partition specs of the table 
(https://github.com/apache/iceberg/blob/7c2fcccf4a033ce620a254fae4ddc92cdc10bf85/api/src/main/java/org/apache/iceberg/Table.java#L110)

I think using the current partition spec makes sense, as typically users tend 
to include more columns in partitioning over time. Even if this method returns 
false negatives/positives we can only lose on performance, but the end results 
should be correct. Let me know if you think otherwise and we should use all 
partition specs.



--
To view, visit http://gerrit.cloudera.org:8080/19043
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iadb883a28602bb68cf4f61e57cdd691605045ac5
Gerrit-Change-Number: 19043
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Mon, 17 Oct 2022 16:36:02 +0000
Gerrit-HasComments: Yes

Reply via email to