coderfender commented on code in PR #12824:
URL: https://github.com/apache/iceberg/pull/12824#discussion_r2085140996
##########
core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java:
##########
@@ -199,30 +214,48 @@ protected long defaultTargetFileSize() {
public FileRewritePlan<FileGroupInfo, FileScanTask, DataFile,
RewriteFileGroup> plan() {
StructLikeMap<List<List<FileScanTask>>> plan = planFileGroups();
RewriteExecutionContext ctx = new RewriteExecutionContext();
- Stream<RewriteFileGroup> groups =
- plan.entrySet().stream()
- .filter(e -> !e.getValue().isEmpty())
- .flatMap(
- e -> {
- StructLike partition = e.getKey();
- List<List<FileScanTask>> scanGroups = e.getValue();
- return scanGroups.stream()
- .map(
- tasks -> {
- long inputSize = inputSize(tasks);
- return newRewriteGroup(
- ctx,
- partition,
- tasks,
- inputSplitSize(inputSize),
- expectedOutputFiles(inputSize));
- });
- })
- .sorted(RewriteFileGroup.comparator(rewriteJobOrder));
+ List<RewriteFileGroup> selectedFileGroups = new ArrayList<>();
+ AtomicInteger fileCountRunner = new AtomicInteger();
+ plan.entrySet().stream()
Review Comment:
Thank you for the feedback @pvary . The main goal here is to optimize the
`rewrite` action itself. So in essence we want users to compact data at a
volume which is more practical (which is more or less a combination of number
of files and the size of the load). The option might not solve all the tricks
at once but it is a potential step in the right direction .
You are right about the fact that we need to pick our poison so to speak in
this usecase . Either we truncate the file list right after we get the total
number of files returned after the scan and clearly mention to the users that
the `max files to rewrite` are indeed the upper bound and IMHO users should be
okay with that . Now, we could revert to filter the number of files per group
right after the other filters / ops are applied but this might create the plan
too big as per your valid concern.
I would still be inclined to go ahead and provide this option to the users
and build on it in future releases
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]