RussellSpitzer commented on PR #15150: URL: https://github.com/apache/iceberg/pull/15150#issuecomment-3954311799
> I think @jbewing did it this way because Iceberg may request Spark to order by fields outside of just the table sort-order-id (for example, sort by row positions, etc.). We need the transform to convert them to Iceberg sort order id metadata. > This is why I suggest the other approach since it's more genericy. Instead of assuming what gets passed through is what is set, we can always just look at what Spark has chosen to use as it's distribution and ordering and figure out if that maps to a given sort Id. So the approach i'm describing is essentially "We look at what Spark actually ended up sorting by, match it against the table's known sort orders, and pass the matched ID to the file builder." While the current approach is "We decide the sort order ID up front when planning the write, then carry it alongside Spark's ordering all the way to the file builder." I did a full example here with some more tests - https://github.com/RussellSpitzer/iceberg/commit/c48c6a600379a773b15fc68b933a0ce51dfe994b if you want to check it out I think we may have a larger disagreement here, and sorry for the amount of time it's taken me in between checks. Let me see if we can get @szehon-ho or @aokolnychyi to also take a look here. They are both very focused on the spark side and probably have some opinions on the right way to pass this through. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
