Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Antoine Pitrou Wed, 29 Apr 2020 16:13:17 -0700


Le 29/04/2020 à 23:30, David Li a écrit :
> Sure -
> 
> The use case is to read a large partitioned dataset, consisting of
> tens or hundreds of Parquet files. A reader expects to scan through
> the data in order of the partition key. However, to improve
> performance, we'd like to begin loading files N+1, N+2, ... N + k
> while the consumer is still reading file N, so that it doesn't have to
> wait every time it opens a new file, and to help hide any latency or
> slowness that might be happening on the backend. We also don't want to
> be in a situation where file N+2 is ready but file N+1 isn't, because
> that doesn't help us (we still have to wait for N+1 to load).


But depending on network conditions, you may very well get file N+2
before N+1, even if you start loading it after...

> This is why I mention the project is quite similar to the Datasets
> project - Datasets likely covers all the functionality we would
> eventually need.

Except that datasets are essentially unordered.

Regards

Antoine.

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to