Re: [PR] Spark 3.5: Parallelize reading files in add_files procedure [iceberg]

via GitHub Wed, 27 Dec 2023 23:02:45 -0800


amogh-jahagirdar commented on code in PR #9274:
URL: https://github.com/apache/iceberg/pull/9274#discussion_r1437431367



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -374,14 +376,16 @@ private static Iterator<ManifestFile> buildManifest(
    * @param partitionFilter only import partitions whose values match those in 
the map, can be
    *     partially defined
    * @param checkDuplicateFiles if true, throw exception if import results in 
a duplicate data file
+   * @param parallelism number of threads to use for file reading
    */
   public static void importSparkTable(
       SparkSession spark,
       TableIdentifier sourceTableIdent,
       Table targetTable,
       String stagingDir,
       Map<String, String> partitionFilter,
-      boolean checkDuplicateFiles) {
+      boolean checkDuplicateFiles,
+      int parallelism) {

Review Comment:
   Ah apologies missed this. I think we're still breaking API compatibility 
here by adding a new parameter? Could we address this. Some level of 
duplication is fine, but we should avoid breaking APIs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 3.5: Parallelize reading files in add_files procedure [iceberg]

Reply via email to