alamb commented on issue #92: URL: https://github.com/apache/arrow-datafusion/issues/92#issuecomment-826828467
Comment from Andy Grove(andygrove) @ 2020-08-12T15:19:33.316+0000: <pre>Based on recent experience testing query execution with async, I no longer feel that async makes sense for DataFusion. Async is good for network io but not for file io. It is better to have a single thread per partition when executing queries. Also, we can't use async with Parquet currently without launching a dedicated thread per partition which pretty much defeats the point of using async in the first place. I believe that we do need the concept of executors and a scheduler in DataFusion, where each executor would run on a dedicated thread. Other projects would then be able to extend this for distributed execution for example.</pre> Comment from Adam Lippai(alippai) @ 2020-08-12T15:33:01.481+0000: <pre>I think using sync file io is a good compromise, Arrow or Datafusion doesn't perform low-latency or highly concurrent file io, at least not yet. Does "It is better to have a single thread per partition when executing queries." contradict "we do need the concept of executors and a scheduler in DataFusion"? What do you think about my initial concern regarding the number of max threads? Does limiting the concurrency or using a threadpool make sense? If I have a partitioned dataset (let's say 1000 or 10k files), each with 1000 columns I should be able to read and process it without spawning this amount of threads _at once_.</pre> Comment from Andy Grove(andygrove) @ 2020-08-12T15:58:30.506+0000: <pre>I did a slightly better job of explaining this in https://issues.apache.org/jira/browse/ARROW-9707 "The current threading model is very simple and does not scale. We currently use 1-2 dedicated threads per partition and they all run simultaneously, which is a huge problem if you have more partitions than logical or physical cores. This task is to re-implement the threading model so that query execution uses a fixed (configurable) number of threads. Work will be broken down into stages and tasks and each in-process executor (running on a dedicated thread) will process its queue of tasks. This process will be driven by a scheduler."</pre> Comment from Andy Grove(andygrove) @ 2020-08-23T15:06:44.701+0000: <pre>Issue resolved by pull request 8029 [https://github.com/apache/arrow/pull/8029]</pre> Comment from Andy Grove(andygrove) @ 2020-08-26T02:14:08.199+0000: <pre>Issue resolved by pull request 8034 [https://github.com/apache/arrow/pull/8034]</pre> Comment from Andy Grove(andygrove) @ 2020-10-03T18:25:46.419+0000: <pre>There are subtasks that are not complete yet. Reopening this for 3.0.0</pre> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org