[DISCUSS] Leveraging Iceberg sort order metadata for read and compaction optimizations in Spark

Anurag Mantripragada Tue, 12 May 2026 16:20:38 -0700

Hi everyone,

I’d like your expert reviews on two PRs that leverage Iceberg's sort order
metadata to improve both read performance and compaction efficiency. These
are complementary and together make the sort order a more actionable
property of the table.


1. Sort-aware reads: Spark sort elimination via SupportsReportOrdering
PR: https://github.com/apache/iceberg/pull/14948

This PR implements the Spark DSv2 SupportsReportOrdering API so that Spark
can eliminate redundant sorts when reading from sorted Iceberg tables. When
files carry a valid sort order ID matching the table's current sort order,
the scan reports ordering to the optimizer, removing unnecessary sort
stages in joins, merge-into, and order-by queries. It uses a k-way merge at
the read path to produce globally sorted output from multiple sorted files
within a partition.

2. Sort-preserving compaction: K-way merge rewrite strategy
PR: https://github.com/apache/iceberg/pull/16305

This PR adds a new k-way-merge strategy to RewriteDataFiles that compacts
pre-sorted files without shuffle. For tables that are already sorted but
accumulate overlapping files from daily ingestion, k-way merge re-compacts
them in O(n log k) with zero shuffle and zero spill. This is significantly
cheaper than re-running the sort strategy, which shuffles data that is
already sorted.

Relationship between PRs

The sort-aware read optimization (PR #14948) benefits directly from having
well-maintained sorted files. The k-way merge strategy (PR #16305) provides
an efficient way to maintain that sorted state over time without paying the
full cost of a sort compaction on each cycle. Together, they establish a
pattern of sort once and maintaining that sort cheaply, which benefits
every read.

I tested both of these on large scale tables at my employer and observed a
significant reduction in resources. I'd appreciate reviews and feedback on
both PRs, specifically:

   -   Whether the API surface (kWayMerge() method, k-way-merge procedure
   strategy name) is appropriate.
   -   Whether the planner/runner separation in the new architecture is the
   right place for these abstractions.
   -   Any concerns about the generic reader/writer approach vs Spark's
   vectorized path for the compaction runner

Thanks,
Anurag

[DISCUSS] Leveraging Iceberg sort order metadata for read and compaction optimizations in Spark

Reply via email to