peter-toth commented on PR #14948: URL: https://github.com/apache/iceberg/pull/14948#issuecomment-3928283054
My concern from Spark PoV is that unnecessary partition grouping can cause performance degradations. [SPARK-55092](https://issues.apache.org/jira/browse/SPARK-55092) is a ticket about the problem and https://github.com/apache/spark/pull/53859 / https://github.com/apache/spark/pull/54330 PRs try to fix the problem. If this PR disables bin packing then the above PRs won't be able to fix the issue. > 1. Bin-packing of file scan tasks is disabled when ordering is required since [Spark will discard ](https://github.com/apache/spark/blob/2fc65e1c98ed53641f5204215b840e33463df987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala#L163)ordering if multiple input partitions exist with the same grouping key. So I would suggest keeping bin packing and reporting sort order for those packed partitions, but when partition grouping is needed then Spark should also merge the sorted partitions with the same key using k-way merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
