Hi everyone,

I’d like your expert reviews on two PRs that leverage Iceberg's sort order
metadata to improve both read performance and compaction efficiency. These
are complementary and together make the sort order a more actionable
property of the table.

1. Sort-aware reads: Spark sort elimination via SupportsReportOrdering
PR: https://github.com/apache/iceberg/pull/14948

This PR implements the Spark DSv2 SupportsReportOrdering API so that Spark
can eliminate redundant sorts when reading from sorted Iceberg tables. When
files carry a valid sort order ID matching the table's current sort order,
the scan reports ordering to the optimizer, removing unnecessary sort
stages in joins, merge-into, and order-by queries. It uses a k-way merge at
the read path to produce globally sorted output from multiple sorted files
within a partition.

2. Sort-preserving compaction: K-way merge rewrite strategy
PR: https://github.com/apache/iceberg/pull/16305

This PR adds a new k-way-merge strategy to RewriteDataFiles that compacts
pre-sorted files without shuffle. For tables that are already sorted but
accumulate overlapping files from daily ingestion, k-way merge re-compacts
them in O(n log k) with zero shuffle and zero spill. This is significantly
cheaper than re-running the sort strategy, which shuffles data that is
already sorted.

Relationship between PRs

The sort-aware read optimization (PR #14948) benefits directly from having
well-maintained sorted files. The k-way merge strategy (PR #16305) provides
an efficient way to maintain that sorted state over time without paying the
full cost of a sort compaction on each cycle. Together, they establish a
pattern of sort once and maintaining that sort cheaply, which benefits
every read.

I tested both of these on large scale tables at my employer and observed a
significant reduction in resources. I'd appreciate reviews and feedback on
both PRs, specifically:

   -   Whether the API surface (kWayMerge() method, k-way-merge procedure
   strategy name) is appropriate.
   -   Whether the planner/runner separation in the new architecture is the
   right place for these abstractions.
   -   Any concerns about the generic reader/writer approach vs Spark's
   vectorized path for the compaction runner

Thanks,
Anurag

Reply via email to