Hi everyone, I’d like your expert reviews on two PRs that leverage Iceberg's sort order metadata to improve both read performance and compaction efficiency. These are complementary and together make the sort order a more actionable property of the table.
1. Sort-aware reads: Spark sort elimination via SupportsReportOrdering PR: https://github.com/apache/iceberg/pull/14948 This PR implements the Spark DSv2 SupportsReportOrdering API so that Spark can eliminate redundant sorts when reading from sorted Iceberg tables. When files carry a valid sort order ID matching the table's current sort order, the scan reports ordering to the optimizer, removing unnecessary sort stages in joins, merge-into, and order-by queries. It uses a k-way merge at the read path to produce globally sorted output from multiple sorted files within a partition. 2. Sort-preserving compaction: K-way merge rewrite strategy PR: https://github.com/apache/iceberg/pull/16305 This PR adds a new k-way-merge strategy to RewriteDataFiles that compacts pre-sorted files without shuffle. For tables that are already sorted but accumulate overlapping files from daily ingestion, k-way merge re-compacts them in O(n log k) with zero shuffle and zero spill. This is significantly cheaper than re-running the sort strategy, which shuffles data that is already sorted. Relationship between PRs The sort-aware read optimization (PR #14948) benefits directly from having well-maintained sorted files. The k-way merge strategy (PR #16305) provides an efficient way to maintain that sorted state over time without paying the full cost of a sort compaction on each cycle. Together, they establish a pattern of sort once and maintaining that sort cheaply, which benefits every read. I tested both of these on large scale tables at my employer and observed a significant reduction in resources. I'd appreciate reviews and feedback on both PRs, specifically: - Whether the API surface (kWayMerge() method, k-way-merge procedure strategy name) is appropriate. - Whether the planner/runner separation in the new architecture is the right place for these abstractions. - Any concerns about the generic reader/writer approach vs Spark's vectorized path for the compaction runner Thanks, Anurag
