Le 29/04/2020 à 23:30, David Li a écrit : > Sure - > > The use case is to read a large partitioned dataset, consisting of > tens or hundreds of Parquet files. A reader expects to scan through > the data in order of the partition key. However, to improve > performance, we'd like to begin loading files N+1, N+2, ... N + k > while the consumer is still reading file N, so that it doesn't have to > wait every time it opens a new file, and to help hide any latency or > slowness that might be happening on the backend. We also don't want to > be in a situation where file N+2 is ready but file N+1 isn't, because > that doesn't help us (we still have to wait for N+1 to load).
But depending on network conditions, you may very well get file N+2 before N+1, even if you start loading it after... > This is why I mention the project is quite similar to the Datasets > project - Datasets likely covers all the functionality we would > eventually need. Except that datasets are essentially unordered. Regards Antoine.
