Re: [PR] Spark 4.1: Set data file sort_order_id in manifest for writes from Spark [iceberg]

via GitHub Tue, 24 Feb 2026 11:37:35 -0800


RussellSpitzer commented on PR #15150:
URL: https://github.com/apache/iceberg/pull/15150#issuecomment-3954311799


   
   > I think @jbewing did it this way because Iceberg may request Spark to 
order by fields outside of just the table sort-order-id (for example, sort by 
row positions, etc.). We need the transform to convert them to Iceberg sort 
order id metadata.
   > 
   
   This is why I suggest the other approach since it's more genericy. Instead 
of assuming what gets passed through is what is set, we can always just look at 
what Spark has chosen to use as it's distribution and ordering and figure out 
if that maps to a given sort Id. 
   
   So the approach i'm describing is essentially
   
   "We look at what Spark actually ended up sorting by, match it against the 
table's known sort orders, and pass the matched ID to the file builder."
   
   While the current approach is
   
   "We decide the sort order ID up front when planning the write, then carry it 
alongside Spark's ordering all the way to the file builder."
   
   I did a full example here with some more tests - 
https://github.com/RussellSpitzer/iceberg/commit/c48c6a600379a773b15fc68b933a0ce51dfe994b
 if you want to check it out
   
   I think we may have a larger disagreement here, and sorry for the amount of 
time it's taken me in between checks. Let me see if we can get @szehon-ho or 
@aokolnychyi to also take a look here. They are both very focused on the spark 
side and probably have some opinions on the right way to pass this through. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 4.1: Set data file sort_order_id in manifest for writes from Spark [iceberg]

Reply via email to