[GitHub] [arrow] Dandandan commented on pull request #9086: [Rust] [DataFusion] [Experiment] Blocking threads filter

GitBox Mon, 04 Jan 2021 11:14:27 -0800


Dandandan commented on pull request #9086:
URL: https://github.com/apache/arrow/pull/9086#issuecomment-754162220

@jorgecarleitao

This is really cool, thanks for creating this experiment!
I am not very deep yet into the Rust way of doing parallelism, the
documentation of tokio makes sense to me.

Some ideas:

* In general, I think it is best if the parallelism is on a high level as
possible to reduce the amount of overhead related to scheduling / context
switching, etc.
* But in order to utilize parallelism best it should be fine-grained enough.
* I think there is some balance between total control of large amount of
control control over parallelism. I think Spark concurrency via partitions is
an example where you can have a larger amount of control over it. It is not
always fine-grained enough, e.g. if you have one 1 / a couple of files as
input.
* I think filtering batches is relatively fine-grained, so I am wondering if
this a good level for parallelism.

* Tokios default config `max_blocking_threads` is 512, this is I think very
large for CPU intensive work (and would have a negative effect on performance)
https://docs.rs/tokio/1.0.1/tokio/runtime/struct.Builder.html#method.max_blocking_threads.
Maybe if using different "scopes" it makes sense to use a different runtime
for CPU-intensive work where you use a different `max_blocking_threads` config?
* Tokio's documentation seems to hint that Rayon would be a better choice
for CPU intensive work?
* In the `ParquetExec` `thread::spawn` is being used. `task::spawn_blocking`
seems a better choice there as it handles errors in a better way and can limit
the nr. of threads compared to thread::spawn` I guess?

* I think just as the statistics @andygrove started to add for `Exec`s it
would be good to have something here as well to debug issues and make sure we
are not doing things in an inefficient way

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9086: [Rust] [DataFusion] [Experiment] Blocking threads filter

Reply via email to