andygrove commented on pull request #8283: URL: https://github.com/apache/arrow/pull/8283#issuecomment-700199462
That makes sense, and we already have some funky channel and thread interaction in the DataFusion parquet reader that we could probably adapt fairly easily. We could introduce a config setting for max concurrent parquet readers. On Mon, Sep 28, 2020 at 12:12 PM Andrew Lamb <[email protected]> wrote: > When I run the TPC-H query I am testing against a data set that has 240 > Parquet files. If we just try and run everything at once with async/await > and have tokio do the scheduling, we will end up with 240 files open at > once with reads happening against all of them, which is inefficient. > > One way to avoid this type of resource usage explosion is if the Parquet > reader itself limits the number of outstanding Tasks that it submits. For > example, with a tokio channel or something. > > It seems to me the challenge is not really "scheduling" per se, but more > "resource allocation" > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/arrow/pull/8283#issuecomment-700197576>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAHEBRGAPSBS2HWZRE2PI73SIDGX5ANCNFSM4R3A4JHA> > . > ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
