Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/23042 )
Change subject: IMPALA-12337: Implement delete orphan files for Iceberg table ...................................................................... Patch Set 6: (5 comments) Thanks for working on this! http://gerrit.cloudera.org:8080/#/c/23042/6/fe/src/main/java/org/apache/impala/analysis/AlterTableExecuteRemoveOrphanFilesStmt.java File fe/src/main/java/org/apache/impala/analysis/AlterTableExecuteRemoveOrphanFilesStmt.java: http://gerrit.cloudera.org:8080/#/c/23042/6/fe/src/main/java/org/apache/impala/analysis/AlterTableExecuteRemoveOrphanFilesStmt.java@38 PS6, Line 38: as NOW() When we document this feature we should make it very clear not to use NOW() as parameter as it will remove files of in-progress operations. Or, it would be nice to have a limit on the timestamp, e.g. only allow timestamps earlier than "NOW() - interval 1 hours". But that would require the introduction of another tunable, at least for testing, e.g. "min_orphan_file_age_minutes". http://gerrit.cloudera.org:8080/#/c/23042/6/fe/src/main/java/org/apache/impala/analysis/AlterTableExecuteRemoveOrphanFilesStmt.java@79 PS6, Line 79: if (timestampExpr != null) { nit: to limit nesting, we could just: if (timestampExpr == null) { throw ... } try { olderThanMillis_ = ... http://gerrit.cloudera.org:8080/#/c/23042/6/fe/src/main/java/org/apache/impala/catalog/iceberg/ImpalaIcebergDeleteOrphanFiles.java File fe/src/main/java/org/apache/impala/catalog/iceberg/ImpalaIcebergDeleteOrphanFiles.java: http://gerrit.cloudera.org:8080/#/c/23042/6/fe/src/main/java/org/apache/impala/catalog/iceberg/ImpalaIcebergDeleteOrphanFiles.java@61 PS6, Line 61: This is copied from HiveIcebergDeleteOrphanFiles.java with minimal changes : * (HIVE-27906). Please also include the git hash, so later we can quickly check for changes. http://gerrit.cloudera.org:8080/#/c/23042/6/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/23042/6/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1687 PS6, Line 1687: iceTxn.commitTransaction(); I just wonder if we should return early after L1631, this way we might save an unnecessary table reload. Though we also won't update TBL_PROP_LAST_DDL_TIME, but it might be OK for this case. http://gerrit.cloudera.org:8080/#/c/23042/6/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/23042/6/tests/query_test/test_iceberg.py@464 PS6, Line 464: @SkipIf.not_dfs Can we make this test work on S3 as well? Since we are removing files I think it's critical to test it against as many storage systems as we can. -- To view, visit http://gerrit.cloudera.org:8080/23042 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5979cdf15048d5a2c4784918533f65f32e888de0 Gerrit-Change-Number: 23042 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 24 Jun 2025 16:12:26 +0000 Gerrit-HasComments: Yes
