[PR] Fix `rewrite_position_delete_files` result file set [iceberg]
bk-mz opened a new pull request, #9945: URL: https://github.com/apache/iceberg/pull/9945 The SparkBinPackPositionDeletesRewriter module disables Adaptive Query Execution (AQE) to correct issues with partitioning. However, if the parent Spark session is configured with a `spark.sql.shuffle.partitions` value not equal to 1, this can lead to the creation of multiple delete files, rather than a single file. To address this issue, enforce setting sql.shuffle.partitions to 1. Fixes https://github.com/apache/iceberg/issues/9833 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
nastra commented on PR #9945: URL: https://github.com/apache/iceberg/pull/9945#issuecomment-2034532286 @bk-mz I don't think the test actually reproduces the underlying issue. See my comment in https://github.com/apache/iceberg/pull/9945#discussion_r1526415663 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
bk-mz commented on PR #9945: URL: https://github.com/apache/iceberg/pull/9945#issuecomment-2036840825 @nastra hey, yes, you are correct. sorry for that. unfortunately I can't setup proper infrastructure to ensure this test requirements. I've found this bug in prod during production under high load which also involves setting custom spark.sql.partitions and this PR fixes that. can you help with setting up the tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
bk-mz closed pull request #9945: Fix `rewrite_position_delete_files` result file set URL: https://github.com/apache/iceberg/pull/9945 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
bk-mz commented on PR #9945: URL: https://github.com/apache/iceberg/pull/9945#issuecomment-2165151682 This looks plain wrong. If I we set shuffle.partititions to 1 this will affect the rewrite procedure, it will collapse all delete files into one, meaning we'd be having a situation where 1 delete file affects all files of the partitions, which in turn would lead to querying all files from it for any query, also damage pruning/statistics update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
nastra commented on code in PR #9945: URL: https://github.com/apache/iceberg/pull/9945#discussion_r1523241967 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java: ## @@ -64,6 +64,8 @@ class SparkBinPackPositionDeletesRewriter extends SizeBasedPositionDeletesRewrit // Disable Adaptive Query Execution as this may change the output partitioning of our write this.spark = spark.cloneSession(); this.spark.conf().set(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), false); +// Because we're turning off AQE we need to turn off shuffle partitions b/c deletes rewriter makes a join later +this.spark.conf().set(SQLConf.SHUFFLE_PARTITIONS().key(), "1"); Review Comment: can you please add some tests to `TestRewritePositionDeleteFilesProcedure` that reproduce the behavior you were seeing in #9833? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
bk-mz commented on PR #9945: URL: https://github.com/apache/iceberg/pull/9945#issuecomment-1996688216 Done: ``` ./gradlew :iceberg-spark:iceberg-spark-extensions-3.5_2.12:test --tests TestRewritePositionDeleteFilesProcedure BUILD SUCCESSFUL in 30s 41 actionable tasks: 2 executed, 39 up-to-date ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
nastra commented on code in PR #9945: URL: https://github.com/apache/iceberg/pull/9945#discussion_r1524645127 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewritePositionDeleteFilesProcedure.java: ## @@ -119,11 +121,22 @@ public void testExpireDeleteFilesAll() throws Exception { public void testExpireDeleteFilesNoOption() throws Exception { createTable(); -sql("DELETE FROM %s WHERE id=1", tableName); -sql("DELETE FROM %s WHERE id=2", tableName); -sql("DELETE FROM %s WHERE id=3", tableName); -sql("DELETE FROM %s WHERE id=4", tableName); -sql("DELETE FROM %s WHERE id=5", tableName); +// those should be unset on procedure call +// if not then procedure will create not 1 but many files +Map sqlConf = Review Comment: I think it's better to just add a new test rather than adjusting existing tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
bk-mz commented on PR #9945: URL: https://github.com/apache/iceberg/pull/9945#issuecomment-1998464433 @nastra done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]
nastra commented on code in PR #9945: URL: https://github.com/apache/iceberg/pull/9945#discussion_r1526415663 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java: ## @@ -64,6 +64,8 @@ class SparkBinPackPositionDeletesRewriter extends SizeBasedPositionDeletesRewrit // Disable Adaptive Query Execution as this may change the output partitioning of our write this.spark = spark.cloneSession(); this.spark.conf().set(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), false); +// Because we're turning off AQE we need to turn off shuffle partitions b/c deletes rewriter makes a join later +this.spark.conf().set(SQLConf.SHUFFLE_PARTITIONS().key(), "1"); Review Comment: I checked and the test you added passes for me even when this isn't being forced to 1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org