Dandandan commented on pull request #9086:
URL: https://github.com/apache/arrow/pull/9086#issuecomment-754162220


   @jorgecarleitao 
   
   This is really cool, thanks for creating this experiment!
   I am not very deep yet into the Rust way of doing parallelism, the 
documentation of tokio makes sense to me.
   
   Some ideas:
   
   * In general, I think it is best if the parallelism is on a high level as 
possible to reduce the amount of overhead related to scheduling / context 
switching, etc.
   * But in order to utilize parallelism best it should be fine-grained enough.
   * I think there is some balance between total control of large amount of 
control control over parallelism. I think Spark concurrency via partitions is 
an example where you can have a larger amount of control over it. It is not 
always fine-grained enough, e.g. if you have one 1  / a couple of files as 
input.
   * I think filtering batches is relatively fine-grained, so I am wondering if 
this a good level for parallelism.
   
   * Tokios default config `max_blocking_threads` is 512, this is I think very 
large for CPU intensive work (and would have a negative effect on performance) 
https://docs.rs/tokio/1.0.1/tokio/runtime/struct.Builder.html#method.max_blocking_threads.
 Maybe if using different "scopes" it makes sense to use a different runtime 
for CPU-intensive work where you use a different `max_blocking_threads` config?
   * Tokio's documentation seems to hint that Rayon would be a better choice 
for CPU intensive work?
   * In the `ParquetExec` `thread::spawn` is being used. `task::spawn_blocking` 
seems a better choice there as it handles errors in a better way and can limit 
the nr. of threads compared to  thread::spawn` I guess?
   
   * I think just as the statistics @andygrove started to add for `Exec`s it 
would be good to have something here as well to debug issues and make sure we 
are not doing things in an inefficient way


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to