Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Joris Van den Bossche Thu, 30 Apr 2020 05:13:30 -0700

On Thu, 30 Apr 2020 at 04:06, Wes McKinney <wesmck...@gmail.com> wrote:


> On Wed, Apr 29, 2020 at 6:54 PM David Li <li.david...@gmail.com> wrote:
> >
> > Ah, sorry, so I am being somewhat unclear here. Yes, you aren't
> > guaranteed to download all the files in order, but with more control,
> > you can make this more likely. You can also prevent the case where due
> > to scheduling, file N+1 doesn't even start downloading until after
> > file N+2, which can happen if you just submit all reads to a thread
> > pool, as demonstrated in the linked trace.
> >
> > And again, with this level of control, you can also decide to reduce
> > or increase parallelism based on network conditions, memory usage,
> > other readers, etc. So it is both about improving/smoothing out
> > performance, and limiting resource consumption.
> >
> > Finally, I do not mean to propose that we necessarily build all of
> > this into Arrow, just that it we would like to make it possible to
> > build this with Arrow, and that Datasets may find this interesting for
> > its optimization purposes, if concurrent reads are a goal.
> >
> > >  Except that datasets are essentially unordered.
> >
> > I did not realize this, but that means it's not really suitable for
> > our use case, unfortunately.
>
> It would be helpful to understand things a bit better so that we do
> not miss out on an opportunity to collaborate. I don't know that the
> current mode of the some of the public Datasets APIs is a dogmatic
> view about how everything should always work, and it's possible that
> some relatively minor changes could allow you to use it. So let's try
> not to be closing any doors right now
>

Note that a Dataset itself is actually ordered, AFAIK. Meaning: the list of
Fragments it is composed of is an ordered vector. It's only when eg
consuming scan tasks that the result might not be ordered (this is
currently the case in ToTable, but see
https://issues.apache.org/jira/browse/ARROW-8447 for an issue about
potentially changing this).


> > Thanks,
> > David
> >
> > On 4/29/20, Antoine Pitrou <anto...@python.org> wrote:
> > >
> > > Le 29/04/2020 à 23:30, David Li a écrit :
> > >> Sure -
> > >>
> > >> The use case is to read a large partitioned dataset, consisting of
> > >> tens or hundreds of Parquet files. A reader expects to scan through
> > >> the data in order of the partition key. However, to improve
> > >> performance, we'd like to begin loading files N+1, N+2, ... N + k
> > >> while the consumer is still reading file N, so that it doesn't have to
> > >> wait every time it opens a new file, and to help hide any latency or
> > >> slowness that might be happening on the backend. We also don't want to
> > >> be in a situation where file N+2 is ready but file N+1 isn't, because
> > >> that doesn't help us (we still have to wait for N+1 to load).
> > >
> > > But depending on network conditions, you may very well get file N+2
> > > before N+1, even if you start loading it after...
> > >
> > >> This is why I mention the project is quite similar to the Datasets
> > >> project - Datasets likely covers all the functionality we would
> > >> eventually need.
> > >
> > > Except that datasets are essentially unordered.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to