alamb commented on issue #5882: URL: https://github.com/apache/arrow-rs/issues/5882#issuecomment-2337775365
> If we identify the CPU intensive sections of Datafusion and use `spawn_blocking` combined with a channel we could move all the blocking tasks to that separate threadpool and use the default tokio threadpool for IO. If we did this I think it is important to do performance tests -- by default tokio potentially uses many (100s I think?) of threads for this blocking thread pool and if we are not careful launching CPU bound work on them will mean the threads are over subscribed (more threads than CPUs) which will reduce effectiveness > This is what we do at InfluxData and it works reasonably well. You have to be slightly careful so that you don't miss some IO calls or that you don't hand IO handles (e.g. sockets, or HTTP connections wrapping them) from the IO runtime to the CPU runtime. It would be really helpful to document / write a blog about how this works -- I think it would be widely read and appreciated. @ion-elgreco any interest / chance that you or someone else in the delta lake team would be able to? I would be happy to collaborate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
