BlakeOrth commented on issue #19971: URL: https://github.com/apache/datafusion/issues/19971#issuecomment-3800841802
I'm sure there are performance gains to be had during the file listing phase of a cold query. I'm skeptical (read: actual evidence of performance improvement should be required here) that there's much performance to be had in terms of parallelizing calls for actually listing of objects backing a table. The issue with parallelizing calls for listing itself is the underlying `object_store` machinery is inherently sequential. Even if you invoke it in a parallel manner, the underlying implementations have to make sequential calls (how do you parallelize the discovery of a set that doesn't guarantee deterministic ordering of results that's also of unknown size?). @Dandandan I'm not familiar with the output of the benchmarking tool you're using here. Is the single thread that's operating during the listing operations actually exhibiting CPU bottlenecking, or is DataFusion spending time waiting on IO here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
