marin-ma opened a new issue, #11050: URL: https://github.com/apache/incubator-gluten/issues/11050
### Description The current method for coalescing input splits into partitions is based on sorted file sizes. After sorting the input splits, only adjacent splits are coalesced into a single partition. If the input splits include some small files, the smallest files are likely to be grouped into the same partition, which may cause that task to read many small files and become a straggler. The coalescing method should be optimized to evenly distribute small files across different partitions. ### Gluten version None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
