jorgecarleitao opened a new pull request #9086: URL: https://github.com/apache/arrow/pull/9086
The motivation behind this PR is that Tokio does not really like blocking (e.g. cpu-intensive) operations, because (by definition) it forbids the scheduler from switching to other tasks. Because of this, tokio recommends (throughout its documentation, but most notably [here](https://docs.rs/tokio/1.0.1/tokio/index.html#cpu-bound-tasks-and-blocking-code)) to use `spawn_blocking` or `rayon` to handle blocking tasks, such as IO and CPU-bounded tasks. This PR is just an experiment / proposal / idea of how we could handle this within tokio. Specifically, it is using `spawn_blocking` to spawn a thread on tokio's "blocking-dedicated" thread pool to handle a blocking operation, thereby avoiding starving the "async-dedicated" thread pool. I do not expect this code to have much difference in performance, as `filter` is not such a blocking operation compared to e.g. a group by. However, I think that this could address performance issues when we have multiple stages (as one stage currently blocks the whole thread due to how we perform blocking ops inside `async` code). @andygrove @alamb @Dandandan , I have been looking at DataFusion's code and tokio's documentation, and I hypothesize that this is would be one way to follow tokio's recommendations for our use-case, but I would really like to get your opinions. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org