[GitHub] [arrow] andygrove commented on pull request #8283: ARROW-9707: [Rust] [DataFusion] DataFusion Scheduler Prototype [WIP]

GitBox Sun, 27 Sep 2020 09:19:43 -0700


andygrove commented on pull request #8283:
URL: https://github.com/apache/arrow/pull/8283#issuecomment-699655553



   @jorgecarleitao Async/await helps a lot but we also need our own scheduler 
to orchestrate how a query is executed. I am going to write up something more 
detailed with my reasoning on this soon but here is one example. When I run the 
TPC-H query I am testing against a data set that has 240 Parquet files. If we 
just try and run everything at once with async/await and have tokio do the 
scheduling, we will end up with 240 files open at once with reads happening 
against all of them, which is inefficient. It is better to process a smaller 
number of files concurrently (better use of page caches, fewer file handles 
open, etc) and process them in batches. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] andygrove commented on pull request #8283: ARROW-9707: [Rust] [DataFusion] DataFusion Scheduler Prototype [WIP]

Reply via email to