[PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-03-13 Thread via GitHub


bk-mz opened a new pull request, #9945:
URL: https://github.com/apache/iceberg/pull/9945

   The SparkBinPackPositionDeletesRewriter module disables Adaptive Query 
Execution (AQE) to correct issues with partitioning.
   
   However, if the parent Spark session is configured with a 
`spark.sql.shuffle.partitions` value not equal to 1, this can lead to the 
creation of multiple delete files, rather than a single file.
   
   To address this issue, enforce setting sql.shuffle.partitions to 1.
   
   Fixes https://github.com/apache/iceberg/issues/9833


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-04-03 Thread via GitHub


nastra commented on PR #9945:
URL: https://github.com/apache/iceberg/pull/9945#issuecomment-2034532286

   @bk-mz I don't think the test actually reproduces the underlying issue. See 
my comment in https://github.com/apache/iceberg/pull/9945#discussion_r1526415663


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-04-04 Thread via GitHub


bk-mz commented on PR #9945:
URL: https://github.com/apache/iceberg/pull/9945#issuecomment-2036840825

   @nastra hey, yes, you are correct. sorry for that.
   
   unfortunately I can't setup proper infrastructure to ensure this test 
requirements. 
   
   I've found this bug in prod during production under high load which also 
involves setting custom spark.sql.partitions and this PR fixes that.
   
   can you help with setting up the tests? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-06-13 Thread via GitHub


bk-mz closed pull request #9945: Fix `rewrite_position_delete_files` result 
file set
URL: https://github.com/apache/iceberg/pull/9945


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-06-13 Thread via GitHub


bk-mz commented on PR #9945:
URL: https://github.com/apache/iceberg/pull/9945#issuecomment-2165151682

   This looks plain wrong. 
   
   If I we set shuffle.partititions to 1 this will affect the rewrite 
procedure, it will collapse all delete files into one, meaning we'd be having a 
situation where 1 delete file affects all files of the partitions, which in 
turn would lead to querying all files from it for any query, also damage 
pruning/statistics update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-03-13 Thread via GitHub


nastra commented on code in PR #9945:
URL: https://github.com/apache/iceberg/pull/9945#discussion_r1523241967


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java:
##
@@ -64,6 +64,8 @@ class SparkBinPackPositionDeletesRewriter extends 
SizeBasedPositionDeletesRewrit
 // Disable Adaptive Query Execution as this may change the output 
partitioning of our write
 this.spark = spark.cloneSession();
 this.spark.conf().set(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), false);
+// Because we're turning off AQE we need to turn off shuffle partitions 
b/c deletes rewriter makes a join later
+this.spark.conf().set(SQLConf.SHUFFLE_PARTITIONS().key(), "1");

Review Comment:
   can you please add some tests to `TestRewritePositionDeleteFilesProcedure` 
that reproduce the behavior you were seeing in #9833?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-03-14 Thread via GitHub


bk-mz commented on PR #9945:
URL: https://github.com/apache/iceberg/pull/9945#issuecomment-1996688216

   Done:
   
   ```
   ./gradlew :iceberg-spark:iceberg-spark-extensions-3.5_2.12:test --tests 
TestRewritePositionDeleteFilesProcedure
   BUILD SUCCESSFUL in 30s
   41 actionable tasks: 2 executed, 39 up-to-date
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-03-14 Thread via GitHub


nastra commented on code in PR #9945:
URL: https://github.com/apache/iceberg/pull/9945#discussion_r1524645127


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewritePositionDeleteFilesProcedure.java:
##
@@ -119,11 +121,22 @@ public void testExpireDeleteFilesAll() throws Exception {
   public void testExpireDeleteFilesNoOption() throws Exception {
 createTable();
 
-sql("DELETE FROM %s WHERE id=1", tableName);
-sql("DELETE FROM %s WHERE id=2", tableName);
-sql("DELETE FROM %s WHERE id=3", tableName);
-sql("DELETE FROM %s WHERE id=4", tableName);
-sql("DELETE FROM %s WHERE id=5", tableName);
+// those should be unset on procedure call
+// if not then procedure will create not 1 but many files
+Map sqlConf =

Review Comment:
   I think it's better to just add a new test rather than adjusting existing 
tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-03-14 Thread via GitHub


bk-mz commented on PR #9945:
URL: https://github.com/apache/iceberg/pull/9945#issuecomment-1998464433

   @nastra done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix `rewrite_position_delete_files` result file set [iceberg]

2024-03-15 Thread via GitHub


nastra commented on code in PR #9945:
URL: https://github.com/apache/iceberg/pull/9945#discussion_r1526415663


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java:
##
@@ -64,6 +64,8 @@ class SparkBinPackPositionDeletesRewriter extends 
SizeBasedPositionDeletesRewrit
 // Disable Adaptive Query Execution as this may change the output 
partitioning of our write
 this.spark = spark.cloneSession();
 this.spark.conf().set(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), false);
+// Because we're turning off AQE we need to turn off shuffle partitions 
b/c deletes rewriter makes a join later
+this.spark.conf().set(SQLConf.SHUFFLE_PARTITIONS().key(), "1");

Review Comment:
   I checked and the test you added passes for me even when this isn't being 
forced to 1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org