Re: [PR] Spark 4.1: Set data file sort_order_id in manifest for writes from Spark [iceberg]

via GitHub Mon, 02 Feb 2026 19:49:32 -0800


jbewing commented on code in PR #15150:
URL: https://github.com/apache/iceberg/pull/15150#discussion_r2757050707



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteRequirements.java:
##########
@@ -26,18 +26,32 @@
 /** A set of requirements such as distribution and ordering reported to Spark 
during writes. */
 public class SparkWriteRequirements {
 
+  public static final long NO_ADVISORY_PARTITION_SIZE = 0;
   public static final SparkWriteRequirements EMPTY =
-      new SparkWriteRequirements(Distributions.unspecified(), new 
SortOrder[0], 0);
+      new SparkWriteRequirements(
+          Distributions.unspecified(),
+          new SortOrder[0],
+          org.apache.iceberg.SortOrder.unsorted(),
+          NO_ADVISORY_PARTITION_SIZE);
 
   private final Distribution distribution;
   private final SortOrder[] ordering;
+  private final org.apache.iceberg.SortOrder icebergOrdering;
   private final long advisoryPartitionSize;
 
   SparkWriteRequirements(

Review Comment:
   So you probably could get away with just passing the id all the way down and 
that is actually what is _effectively_ happening here. 
   
   We just end up unwrapping from an Id to an Iceberg Sort Order as it's 
effective at making the code a bit more expressive & readable in some places 
IMO. SparkWriteRequirement is a nice example of that in that having the Iceberg 
sort order available makes it _really_ easy to express how the Spark execution 
sort orders should behave when the Spark ordering doesn't necessarily match the 
iceberg ordering (and an additional prefix is thrown in there because we're 
using a range write distribution for example. 
   
   I'm happy to unwind this if you don't think that this is the case and the 
other way is more expressive. I did find in my many iterations of solving this 
problem "cleanly" that keeping the Sort Orders together—despite the fully 
qualified class name terribleness—shows the relationship nicely between the two 
& keeps things concise. 
   
   Not passing the Iceberg ordering through is substantially more brittle and 
prone to breakage (although a bit more concise), however, correctness felt more 
important the being concise. And passing an Iceberg Sort Order Id down just 
leads to it being unwrapped from an id in quite a few places and a ton of `if 
(id == 0 / UNSORTED) {}` checks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 4.1: Set data file sort_order_id in manifest for writes from Spark [iceberg]

Reply via email to