[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

GitBox Sat, 14 Jan 2023 04:54:28 -0800


RussellSpitzer commented on code in PR #6588:
URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261827



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -47,4 +47,8 @@ private SparkSQLProperties() {}
   public static final String PRESERVE_DATA_GROUPING =
       "spark.sql.iceberg.planning.preserve-data-grouping";
   public static final boolean PRESERVE_DATA_GROUPING_DEFAULT = false;
+
+  // Controls how many physical file deletes to execute in parallel when not 
otherwise specified
+  public static final String DELETE_PARALLELISM = 
"driver-delete-default-parallelism";
+  public static final String DELETE_PARALLELISM_DEFAULT = "25";

Review Comment:
   With S3's request throttling around 4k requests a second this gives us a lot 
of overhead. 
   Assuming a 50ms response time
   4000 max requests / Second / 20 requests per thread per second =~  200 max 
concurrent requests. 
   
   Another option for this is to also incorporate the "bulk delete" apis but 
that would only help with S3 based filesystems.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

Reply via email to