Re: [PR] [WIP] Spark 4.1: Implement SupportsReportOrdering DSv2 API [iceberg]

via GitHub Thu, 19 Feb 2026 08:15:13 -0800


peter-toth commented on PR #14948:
URL: https://github.com/apache/iceberg/pull/14948#issuecomment-3928283054


   My concern from Spark PoV is that unnecessary partition grouping can cause 
performance degradations. 
[SPARK-55092](https://issues.apache.org/jira/browse/SPARK-55092) is a ticket 
about the problem and https://github.com/apache/spark/pull/53859 / 
https://github.com/apache/spark/pull/54330 PRs try to fix the problem.
   
   If this PR disables bin packing then the above PRs won't be able to fix the 
issue.
   > 1. Bin-packing of file scan tasks is disabled when ordering is required 
since [Spark will discard 
](https://github.com/apache/spark/blob/2fc65e1c98ed53641f5204215b840e33463df987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala#L163)ordering
 if multiple input partitions exist with the same grouping key.
   
   So I would suggest keeping bin packing and reporting sort order for those 
packed partitions, but when partition grouping is needed then Spark should also 
merge the sorted partitions with the same key using k-way merge.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP] Spark 4.1: Implement SupportsReportOrdering DSv2 API [iceberg]

Reply via email to