prashantwason commented on PR #18092:
URL: https://github.com/apache/hudi/pull/18092#issuecomment-3947823227
@danny0405 Thanks for the review!
Regarding your questions:
1. **Why sort in `filterPartitionPaths()` vs `generateClusteringPlan()`**:
The sorting is intentionally placed in `filterPartitionPaths()` rather
than near the limit in `generateClusteringPlan()` because:
- `filterPartitionPaths()` is the source of partition paths for
clustering - sorting here ensures consistency regardless of how partitions are
used downstream
- The `clusteringMaxNumGroups` limit in `generateClusteringPlan()`
operates on clustering groups (not partitions directly), so sorting partitions
there would be too late
- Sorting at the filter level is more defensive and ensures deterministic
behavior in all code paths that consume the filtered partitions
2. **Similar issues in compaction**:
Looking at the compaction code, the partition-aware strategies already
handle sorting:
- `BoundedPartitionAwareCompactionStrategy.filterPartitionPaths()` sorts
with `Comparator.reverseOrder()` (line 73-74)
- `UnBoundedPartitionAwareCompactionStrategy.filterPartitionPaths()` also
sorts in reverse order
So compaction doesn't have this issue - the bounded strategies already
include sorting as part of their logic.
Let me know if you have any other concerns!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]