mbutrovich commented on PR #3349: URL: https://github.com/apache/datafusion-comet/pull/3349#issuecomment-3863033259
> however some particular executor would use only few of transferred tasks? Yep, exactly. Iceberg calculates all of the FileScanTasks for partitions, but Comet currently distributes all of them to every partition (like a broadcast). > This PR makes some FileScanTask distribution between executors, only needed are sent? Yep, exactly. An executor now only receives its partition's tasks. > if so what is the distribution algorithm and do you envision any shuffle increase if executors reads not collocated data? That's implementation defined. In this case Iceberg does its own bin-packing algorithm and has its own configs for target split size (amount of data per task) to read. We respect and that and accept the distribution that Iceberg generates both at planning time (initial tasks) and after subqueries and DPP (if applicable). Iceberg handles it all, and Comet just accepts them. If the Iceberg bin-packing/task allocation algorithm changes, Comet will be oblivious. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
