jbewing commented on PR #15150: URL: https://github.com/apache/iceberg/pull/15150#issuecomment-3957003794
No worries @RussellSpitzer. I missed some of your comments inadvertently so is absolutely goes both ways :sorry:! > "We look at what Spark actually ended up sorting by, match it against the table's known sort orders, and pass the matched ID to the file builder." So this is exactly what I did when I was first designing this PR. I really like the approach. On paper it's way simpler. Your cursor _almost_ one shotted the fundamentals, it missed a few details I outlined in https://github.com/apache/iceberg/pull/15150#discussion_r2850809699. I agree, it's cleaner and it ought to work that way. That being said and not that it _should_ matter in practice because [per the spec position deletes don't have a defined sort order](https://iceberg.apache.org/spec/#data-file-fields:~:text=Position%20deletes%20are%20required%20to%20be%20sorted%20by%20file%20and%20position%2C%20not%20a%20table%20order%2C%20and%20should%20set%20sort%20order%20id%20to%20null.), is that in the scheme you outlined above, you cannot go backwards from the Spark order to the Iceberg order. Your implementation will return the table sort order for MoR position deletes for a sorted table (the spark ordering is just [POS_DEL]). A lot of this manifests from the fact that while a Spark ordering may exist when a table sort order isn't set in partitioned write scenarios. Typically, the iceberg sort order fields are a suffix of the Spark ordering, but in many cases the spark ordering isn't == iceberg ordering. I chose to make the relationship between Spark ordering & Iceberg ordering explicit in my PR to make it impossible for an implicit relationship between the two to inadvertently break things. Yes, it's _slightly_ more verbose, but does reflect the reality that the physical ordering and logical ordering aren't always equal. The Spark ordering != Iceberg ordering was a big learning for me in the codebase—as it was one of my first modifications to iceberg when making this change—and I think the version I've proposed here makes things clearer for readers while being slightly less terse. All to say: I think both can work. It's a matter of personal preference. Is this blocking for you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
