Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

via GitHub Fri, 09 May 2025 17:57:35 -0700


coderfender commented on code in PR #12824:
URL: https://github.com/apache/iceberg/pull/12824#discussion_r2082637481



##########
core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java:
##########
@@ -199,30 +214,48 @@ protected long defaultTargetFileSize() {
   public FileRewritePlan<FileGroupInfo, FileScanTask, DataFile, 
RewriteFileGroup> plan() {
     StructLikeMap<List<List<FileScanTask>>> plan = planFileGroups();
     RewriteExecutionContext ctx = new RewriteExecutionContext();
-    Stream<RewriteFileGroup> groups =
-        plan.entrySet().stream()
-            .filter(e -> !e.getValue().isEmpty())
-            .flatMap(
-                e -> {
-                  StructLike partition = e.getKey();
-                  List<List<FileScanTask>> scanGroups = e.getValue();
-                  return scanGroups.stream()
-                      .map(
-                          tasks -> {
-                            long inputSize = inputSize(tasks);
-                            return newRewriteGroup(
-                                ctx,
-                                partition,
-                                tasks,
-                                inputSplitSize(inputSize),
-                                expectedOutputFiles(inputSize));
-                          });
-                })
-            .sorted(RewriteFileGroup.comparator(rewriteJobOrder));
+    List<RewriteFileGroup> selectedFileGroups = new ArrayList<>();
+    AtomicInteger fileCountRunner = new AtomicInteger();
+    plan.entrySet().stream()

Review Comment:
   @pvary  , I pushed a commit which moved the pruning logic rught after we get 
fileScanTasks from the scan API . The one good thing with this is that the 
implementation is easier than the above approach and the other thing to note is 
the file scan tasks getting pruned are always guaranteed to be random (since we 
are pruning before grouping the partitions)  . Let me know if you think this is 
a clearer approach than the previous one or other wise 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

Reply via email to