Re: [Discuss] Storing metadata about the "sortedness" of data

Jorge Cardoso Leitão Tue, 11 May 2021 11:25:02 -0700

So, I think that both cases can be accomplished within DataFusion itself:

* When the data is sorted at rest, we can add a method to the TableProvider
to share this information with the query engine, like we do with
partitioning.
* When the data is sorted via some physical node / operation during the
execution, we can share this information via something like we do with
partitioning, like Andy suggested?


I think that a use-case for declaring an order on the spec would be to
share this piece of metadata across implementations. E.g. if we want to
share a RecordBatch between C++ and Rust via IPC and would like a contract
to transmit that the record is sorted by columns (e.g. X ASC and Y DESC).

Best,
Jorge



On Tue, May 11, 2021 at 8:14 PM Andrew Lamb <al...@influxdata.com> wrote:

> I was imagining something known at Query Planning time (e.g if the data you
> are reading in from a parquet file is already sorted by `time` and the
> query calls for sorting by time, the sort can be omitted). In this case, I
> was thinking "how would we communicate this information to DataFusion from
> a TableProvider"
>
> Another usecase for sortedness is if you are merging two parquet files into
> a single sorted output and you want to know the inputs are already sorted,
> you can simply merge the two streams together and save quite a lot of
> processing time and intermediate buffers.
>
>
>
> On Tue, May 11, 2021 at 2:01 PM Andy Grove <andygrov...@gmail.com> wrote:
>
> > I had been planning on adding a method to DataFusion's execution plan to
> > indicate the sort-order of the results (if known), similar to how we
> > currently have information about output partitioning.
> >
> > Would this cover your requirement or are you looking for something
> outside
> > the context of execution plans?
> >
> > On Tue, May 11, 2021 at 11:52 AM Andrew Lamb <al...@influxdata.com>
> wrote:
> >
> > > We are building a system that will likely make heavy use of sorted
> data,
> > > and we are trying to figure out how to encode the metadata of "how is
> > this
> > > data sorted". We can certainly use our own custom metadata fields, but
> > > wanted to check for prior art and gauge community interest in adding
> > > something to Arrow. More details are on [1].
> > >
> > > Recording sort-order in Schema  would likely be useful for DataFusion
> as
> > > well (to optimize away redundant computation if the data is already
> > sorted
> > > or pick more efficient algorithms (e.g. a MERGING grouping operator).
> > >
> > > I didn't see any obvious prior art on the mailing list [2] or in JIRA
> > > [3][4] so I figured I would ask if others had any backstory or other
> > > reactions.
> > >
> > > Thank you
> > > Andrew
> > >
> > >
> > >
> > >
> > > [1] https://github.com/apache/arrow-rs/issues/284
> > > [2]
> https://lists.apache.org/list.html?dev@arrow.apache.org:lte=1y:sort
> > > [3]
> > >
> > >
> >
> https://issues.apache.org/jira/browse/ARROW-12087?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20summary%20~%20sort%20ORDER%20BY%20created%20DESC
> > > [4]
> > >
> > >
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20description%20~%20sort%20and%20component%20in%20(format)
> > >
> >
>

Re: [Discuss] Storing metadata about the "sortedness" of data

Reply via email to