Re: [PR] Spark 4.1: Set data file sort_order_id in manifest for writes from Spark [iceberg]

via GitHub Tue, 24 Feb 2026 21:48:28 -0800


jbewing commented on PR #15150:
URL: https://github.com/apache/iceberg/pull/15150#issuecomment-3957003794

No worries @RussellSpitzer. I missed some of your comments inadvertently so
is absolutely goes both ways :sorry:!

> "We look at what Spark actually ended up sorting by, match it against the
table's known sort orders, and pass the matched ID to the file builder."

So this is exactly what I did when I was first designing this PR. I really
like the approach. On paper it's way simpler. Your cursor _almost_ one shotted
the fundamentals, it missed a few details I outlined in
https://github.com/apache/iceberg/pull/15150#discussion_r2850809699. I agree,
it's cleaner and it ought to work that way.

That being said and not that it _should_ matter in practice because [per the
spec position deletes don't have a defined sort
order](https://iceberg.apache.org/spec/#data-file-fields:~:text=Position%20deletes%20are%20required%20to%20be%20sorted%20by%20file%20and%20position%2C%20not%20a%20table%20order%2C%20and%20should%20set%20sort%20order%20id%20to%20null.),
is that in the scheme you outlined above, you cannot go backwards from the
Spark order to the Iceberg order. Your implementation will return the table
sort order for MoR position deletes for a sorted table (the spark ordering is
just [POS_DEL]).

A lot of this manifests from the fact that while a Spark ordering may exist
when a table sort order isn't set in partitioned write scenarios. Typically,
the iceberg sort order fields are a suffix of the Spark ordering, but in many
cases the spark ordering isn't == iceberg ordering.

I chose to make the relationship between Spark ordering & Iceberg ordering
explicit in my PR to make it impossible for an implicit relationship between
the two to inadvertently break things. Yes, it's _slightly_ more verbose, but
does reflect the reality that the physical ordering and logical ordering aren't
always equal. The Spark ordering != Iceberg ordering was a big learning for me
in the codebase—as it was one of my first modifications to iceberg when making
this change—and I think the version I've proposed here makes things clearer for
readers while being slightly less terse.

All to say: I think both can work. It's a matter of personal preference. Is
this blocking for you?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 4.1: Set data file sort_order_id in manifest for writes from Spark [iceberg]

Reply via email to