Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Will Jones Wed, 28 Jun 2023 13:47:36 -0700

>
> That wouldn't remove the feature from DuckDB, would it? It would just mean
> that we recognize that PyArrow expressions don't have well-defined
> semantics that we are committing to at this time.
>


That's a fair point, David. I would be fine excluding it from the protocol
initially, and keep the existing integrations in DuckDB, Polars, and
Datafusion "secret" or "not officially supported" for the time being. At
the very least, documenting the pattern to get a Arrow C stream will be a
step forward.

Best,

Will Jones

On Wed, Jun 28, 2023 at 12:35 PM Jonathan Keane <[email protected]> wrote:

> > I would understand this objection more if DuckDB hasn't been relying on
> > being able to pass PyArrow expressions for 18 months now [1]. Unless, do
> we
> > just think this isn't widely used enough that we don't care?
>
> This isn't a pro or a con of specifically adopting the PyArrow expression
> semantics as is / with a warning about changing / not at all, but having
> some kind of standardization in this interface would be very nice. This
> even came up while collaborating with the DuckDB folks that using some of
> the expression bits here (and in the R equivalents) was a little bit odd
> and having something like a proper API for that would have made that
> more natural (and likely that would have been used had it existed 18 months
> ago :))
>
> -Jon
>
>
> On Wed, Jun 28, 2023 at 1:17 PM David Li <[email protected]> wrote:
>
> > That wouldn't remove the feature from DuckDB, would it? It would just
> mean
> > that we recognize that PyArrow expressions don't have well-defined
> > semantics that we are committing to at this time. As long as we have
> > `**kwargs` everywhere, we can in the future introduce a
> > `substrait_filter_expression` or similar argument, while allowing current
> > implementors to handle `filter` if possible. (As a compromise, we could
> > reserve `filter` and existing arguments and note that PyArrow Expression
> > semantics are subject to change without notice?)
> >
> > On Wed, Jun 28, 2023, at 13:38, Will Jones wrote:
> > > Hi Ian,
> > >
> > >
> > >> I favor option 2 out of concern that option 1 could create a
> > >> temptation for users of this protocol to depend on a feature that we
> > >> intend to deprecate.
> > >>
> > >
> > > I would understand this objection more if DuckDB hasn't been relying on
> > > being able to pass PyArrow expressions for 18 months now [1]. Unless,
> do
> > we
> > > just think this isn't widely used enough that we don't care?
> > >
> > > Best,
> > > Will
> > >
> > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > >
> > > On Tue, Jun 27, 2023 at 11:19 AM Ian Cook <[email protected]> wrote:
> > >
> > >> > I think there's three routes we can go here:
> > >> >
> > >> > 1. We keep PyArrow expressions in the API initially, but once we
> have
> > >> > Substrait-based alternatives we deprecate the PyArrow expression
> > support.
> > >> > This is what I intended with the current design, and I think it
> > provides
> > >> > the most obvious migration paths for existing producers and
> consumers.
> > >> > 2. We keep the overall dataset API, but don't introduce the filter
> and
> > >> > projection arguments until we have Substrait support. I'm not sure
> > what
> > >> the
> > >> > migration path looks like for producers and consumers, but I think
> > this
> > >> > just implicitly becomes the same as (1), but with worse
> documentation.
> > >> > 3. We write a protocol completely from scratch, that doesn't try to
> > >> > describe the existing dataset API. Producers and consumers would
> then
> > >> > migrate to use the new protocol and deprecate their existing dataset
> > >> > integrations. We could introduce a dunder method in that API (sort
> of
> > >> like
> > >> > __arrow_array__) that would make the migration seamless from the
> > end-user
> > >> > perspective.
> > >> >
> > >> > *Which do you all think is the best path forward?*
> > >>
> > >> I favor option 2 out of concern that option 1 could create a
> > >> temptation for users of this protocol to depend on a feature that we
> > >> intend to deprecate. I think option 2 also creates a stronger
> > >> motivation to complete the Substrait expression integration work,
> > >> which is underway in https://github.com/apache/arrow/pull/34834.
> > >>
> > >> Ian
> > >>
> > >>
> > >> On Fri, Jun 23, 2023 at 1:25 PM Weston Pace <[email protected]>
> > wrote:
> > >> >
> > >> > > The trouble is that Dataset was not designed to serve as a
> > >> > > general-purpose unmaterialized dataframe. For example, the PyArrow
> > >> > > Dataset constructor [5] exposes options for specifying a list of
> > >> > > source files and a partitioning scheme, which are irrelevant for
> > many
> > >> > > of the applications that Will anticipates. And some work is needed
> > to
> > >> > > reconcile the methods of the PyArrow Dataset object [6] with the
> > >> > > methods of the Table object. Some methods like filter() are
> exposed
> > by
> > >> > > both and behave lazily on Datasets and eagerly on Tables, as a
> user
> > >> > > might expect. But many other Table methods are not implemented for
> > >> > > Dataset though they potentially could be, and it is unclear where
> we
> > >> > > should draw the line between adding methods to Dataset vs.
> > encouraging
> > >> > > new scanner implementations to expose options controlling what
> lazy
> > >> > > operations should be performed as they see fit.
> > >> >
> > >> > In my mind there is a distinction between the "compute domain"
> (e.g. a
> > >> > pandas dataframe or something like ibis or SQL) and the "data
> domain"
> > >> (e.g.
> > >> > pyarrow datasets).  I think, in a perfect world, you could push any
> > and
> > >> all
> > >> > compute up and down the chain as far as possible.  However, in
> > practice,
> > >> I
> > >> > think there is a healthy set of tools and libraries that say "simple
> > >> column
> > >> > projection and filtering is good enough".  I would argue that there
> is
> > >> room
> > >> > for both APIs and while the temptation is always present to "shove
> as
> > >> much
> > >> > compute as you can" I think pyarrow datasets seem to have found a
> > balance
> > >> > between the two that users like.
> > >> >
> > >> > So I would argue that this protocol may never become a
> general-purpose
> > >> > unmaterialized dataframe and that isn't necessarily a bad thing.
> > >> >
> > >> > > they are splittable and serializable, so that fragments can be
> > >> distributed
> > >> > > amongst processes / workers.
> > >> >
> > >> > Just to clarify, the proposal currently only requires the fragments
> > to be
> > >> > serializable correct?
> > >> >
> > >> > On Fri, Jun 23, 2023 at 11:48 AM Will Jones <
> [email protected]>
> > >> wrote:
> > >> >
> > >> > > Thanks Ian for your extensive feedback.
> > >> > >
> > >> > > I strongly agree with the comments made by David,
> > >> > > > Weston, and Dewey arguing that we should avoid any use of
> PyArrow
> > >> > > > expressions in this API. Expressions are an implementation
> detail
> > of
> > >> > > > PyArrow, not a part of the Arrow standard. It would be much
> safer
> > for
> > >> > > > the initial version of this protocol to not define *any*
> > >> > > > methods/arguments that take expressions.
> > >> > > >
> > >> > >
> > >> > > I would agree with this point, if we were starting from scratch.
> But
> > >> one of
> > >> > > my goals is for this protocol to be descriptive of the existing
> > dataset
> > >> > > integrations in the ecosystem, which all currently rely on PyArrow
> > >> > > expressions. For example, you'll notice in the PR that there are
> > unit
> > >> tests
> > >> > > to verify the current PyArrow Dataset classes conform to this
> > protocol,
> > >> > > without changes.
> > >> > >
> > >> > > I think there's three routes we can go here:
> > >> > >
> > >> > > 1. We keep PyArrow expressions in the API initially, but once we
> > have
> > >> > > Substrait-based alternatives we deprecate the PyArrow expression
> > >> support.
> > >> > > This is what I intended with the current design, and I think it
> > >> provides
> > >> > > the most obvious migration paths for existing producers and
> > consumers.
> > >> > > 2. We keep the overall dataset API, but don't introduce the filter
> > and
> > >> > > projection arguments until we have Substrait support. I'm not sure
> > >> what the
> > >> > > migration path looks like for producers and consumers, but I think
> > this
> > >> > > just implicitly becomes the same as (1), but with worse
> > documentation.
> > >> > > 3. We write a protocol completely from scratch, that doesn't try
> to
> > >> > > describe the existing dataset API. Producers and consumers would
> > then
> > >> > > migrate to use the new protocol and deprecate their existing
> dataset
> > >> > > integrations. We could introduce a dunder method in that API (sort
> > of
> > >> like
> > >> > > __arrow_array__) that would make the migration seamless from the
> > >> end-user
> > >> > > perspective.
> > >> > >
> > >> > > *Which do you all think is the best path forward?*
> > >> > >
> > >> > > Another concern I have is that we have not fully explained why we
> > want
> > >> > > > to use Dataset instead of RecordBatchReader [9] as the basis of
> > this
> > >> > > > protocol. I would like to see an explanation of why
> > RecordBatchReader
> > >> > > > is not sufficient for this. RecordBatchReader seems like another
> > >> > > > possible way to represent "unmaterialized dataframes" and there
> > are
> > >> > > > some parallels between RecordBatch/RecordBatchReader and
> > >> > > > Fragment/Dataset.
> > >> > > >
> > >> > >
> > >> > > This is a good point. I can add a section describing the
> > differences.
> > >> The
> > >> > > main ones I can think of are that: (1) Datasets are "pruneable":
> one
> > >> can
> > >> > > select a subset of columns and apply a filter on rows to avoid IO
> > and
> > >> (2)
> > >> > > they are splittable and serializable, so that fragments can be
> > >> distributed
> > >> > > amongst processes / workers.
> > >> > >
> > >> > > Best,
> > >> > >
> > >> > > Will Jones
> > >> > >
> > >> > > On Fri, Jun 23, 2023 at 10:48 AM Ian Cook <[email protected]>
> > wrote:
> > >> > >
> > >> > > > Thanks Will for this proposal!
> > >> > > >
> > >> > > > For anyone familiar with PyArrow, this idea has a clear
> intuitive
> > >> > > > logic to it. It provides an expedient solution to the current
> > lack of
> > >> > > > a practical means for interchanging "unmaterialized dataframes"
> > >> > > > between different Python libraries.
> > >> > > >
> > >> > > > To elaborate on that: If you look at how people use the Arrow
> > Dataset
> > >> > > > API—which is implemented in the Arrow C++ library [1] and has
> > >> bindings
> > >> > > > not just for Python [2] but also for Java [3] and R [4]—you'll
> see
> > >> > > > that Dataset is often used simply as a "virtual" variant of
> > Table. It
> > >> > > > is used in cases when the data is larger than memory or when it
> is
> > >> > > > desirable to defer reading (materializing) the data into memory.
> > >> > > >
> > >> > > > So we can think of a Table as a materialized dataframe and a
> > Dataset
> > >> > > > as an unmaterialized dataframe. That aspect of Dataset is I
> think
> > >> what
> > >> > > > makes it most attractive as a protocol for enabling
> > interoperability:
> > >> > > > it allows libraries to easily "speak Arrow" in cases where
> > >> > > > materializing the full data in memory upfront is impossible or
> > >> > > > undesirable.
> > >> > > >
> > >> > > > The trouble is that Dataset was not designed to serve as a
> > >> > > > general-purpose unmaterialized dataframe. For example, the
> PyArrow
> > >> > > > Dataset constructor [5] exposes options for specifying a list of
> > >> > > > source files and a partitioning scheme, which are irrelevant for
> > many
> > >> > > > of the applications that Will anticipates. And some work is
> > needed to
> > >> > > > reconcile the methods of the PyArrow Dataset object [6] with the
> > >> > > > methods of the Table object. Some methods like filter() are
> > exposed
> > >> by
> > >> > > > both and behave lazily on Datasets and eagerly on Tables, as a
> > user
> > >> > > > might expect. But many other Table methods are not implemented
> for
> > >> > > > Dataset though they potentially could be, and it is unclear
> where
> > we
> > >> > > > should draw the line between adding methods to Dataset vs.
> > >> encouraging
> > >> > > > new scanner implementations to expose options controlling what
> > lazy
> > >> > > > operations should be performed as they see fit.
> > >> > > >
> > >> > > > Will, I see that you've already addressed this issue to some
> > extent
> > >> in
> > >> > > > your proposal. For example, you mention that we should initially
> > >> > > > define this protocol to include only a minimal subset of the
> > Dataset
> > >> > > > API. I agree, but I think there are some loose ends we should be
> > >> > > > careful to tie up. I strongly agree with the comments made by
> > David,
> > >> > > > Weston, and Dewey arguing that we should avoid any use of
> PyArrow
> > >> > > > expressions in this API. Expressions are an implementation
> detail
> > of
> > >> > > > PyArrow, not a part of the Arrow standard. It would be much
> safer
> > for
> > >> > > > the initial version of this protocol to not define *any*
> > >> > > > methods/arguments that take expressions. This will allow us to
> > take
> > >> > > > some more time to finish up the Substrait expression
> > implementation
> > >> > > > work that is underway [7][8], then introduce Substrait-based
> > >> > > > expressions in a latter version of this protocol. This approach
> > will
> > >> > > > better position this protocol to be implemented in other
> languages
> > >> > > > besides Python.
> > >> > > >
> > >> > > > Another concern I have is that we have not fully explained why
> we
> > >> want
> > >> > > > to use Dataset instead of RecordBatchReader [9] as the basis of
> > this
> > >> > > > protocol. I would like to see an explanation of why
> > RecordBatchReader
> > >> > > > is not sufficient for this. RecordBatchReader seems like another
> > >> > > > possible way to represent "unmaterialized dataframes" and there
> > are
> > >> > > > some parallels between RecordBatch/RecordBatchReader and
> > >> > > > Fragment/Dataset. We should help developers and users understand
> > why
> > >> > > > Arrow needs both of these.
> > >> > > >
> > >> > > > Thanks Will for your thoughtful prose explanations about this
> > >> proposed
> > >> > > > API. After we arrive at a decision about this, I think we should
> > >> > > > reproduce some of these explanations in docs, blog posts,
> cookbook
> > >> > > > recipes, etc. because there is some important nuance here that
> > will
> > >> be
> > >> > > > important for integrators of this API to understand.
> > >> > > >
> > >> > > > Ian
> > >> > > >
> > >> > > > [1] https://arrow.apache.org/docs/cpp/api/dataset.html
> > >> > > > [2] https://arrow.apache.org/docs/python/dataset.html
> > >> > > > [3] https://arrow.apache.org/docs/java/dataset.html
> > >> > > > [4] https://arrow.apache.org/docs/r/articles/dataset.html
> > >> > > > [5]
> > >> > > >
> > >> > >
> > >>
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
> > >> > > > [6]
> > >> > > >
> > >> > >
> > >>
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
> > >> > > > [7] https://github.com/apache/arrow/issues/33985
> > >> > > > [8] https://github.com/apache/arrow/issues/34252
> > >> > > > [9]
> > >> > > >
> > >> > >
> > >>
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html
> > >> > > >
> > >> > > > On Wed, Jun 21, 2023 at 2:09 PM Will Jones <
> > [email protected]>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > Hello Arrow devs,
> > >> > > > >
> > >> > > > > I have drafted a PR defining an experimental protocol which
> > would
> > >> allow
> > >> > > > > third-party libraries to imitate the PyArrow Dataset API [5].
> > This
> > >> > > > protocol
> > >> > > > > is intended to endorse an integration pattern that is starting
> > to
> > >> be
> > >> > > used
> > >> > > > > in the Python ecosystem, where some libraries are providing
> > their
> > >> own
> > >> > > > > scanners with this API, while query engines are accepting
> these
> > as
> > >> > > > > duck-typed objects.
> > >> > > > >
> > >> > > > > To give some background: back at the end of 2021, we
> > collaborated
> > >> with
> > >> > > > > DuckDB to be able to read datasets (an Arrow C++ concept),
> > >> supporting
> > >> > > > > column selection and filter pushdown. This was accomplished by
> > >> having
> > >> > > > > DuckDB manipulating Python (or R) objects to get a
> > >> RecordBatchReader
> > >> > > and
> > >> > > > > then exporting over the C Stream Interface.
> > >> > > > >
> > >> > > > > Since then, DataFusion [2] and Polars have both made similar
> > >> > > > > implementations for their Python bindings, allowing them to
> > consume
> > >> > > > PyArrow
> > >> > > > > datasets. This has created an implicit protocol, whereby
> > arbitrary
> > >> > > > compute
> > >> > > > > engines can push down queries into the PyArrow dataset
> scanner.
> > >> > > > >
> > >> > > > > Now, libraries supporting table formats including Delta Lake,
> > >> Lance,
> > >> > > and
> > >> > > > > Iceberg are looking to be able to support these engines, while
> > >> bringing
> > >> > > > > their own scanners and metadata handling implementations. One
> > >> possible
> > >> > > > > route is allowing them to imitate the PyArrow datasets API.
> > >> > > > >
> > >> > > > > Bringing these use cases together, I'd like to propose an
> > >> experimental
> > >> > > > > protocol, made out of the minimal subset of the PyArrow
> Dataset
> > API
> > >> > > > > necessary to facilitate this kind of integration. This would
> > allow
> > >> any
> > >> > > > > library to produce a scanner implementation and that arbitrary
> > >> query
> > >> > > > > engines could call into. I've drafted a PR [3] and there is
> some
> > >> > > > background
> > >> > > > > research available in a google doc [4].
> > >> > > > >
> > >> > > > > I've already gotten some good feedback on both, and would
> > welcome
> > >> more.
> > >> > > > >
> > >> > > > > One last point: I'd like for this to be a first step rather
> > than a
> > >> > > > > comprehensive API. This PR focuses on making explicit a
> protocol
> > >> that
> > >> > > is
> > >> > > > > already in use in the ecosystem, but without much concrete
> > >> definition.
> > >> > > > Once
> > >> > > > > this is established, we can use our experience from this
> > protocol
> > >> to
> > >> > > > design
> > >> > > > > something more permanent that takes advantage of newer
> > innovations
> > >> in
> > >> > > the
> > >> > > > > Arrow ecosystem (such as the PyCapsule for C Data Interface or
> > >> > > > > Substrait for passing expressions / scan plans). I am tracking
> > such
> > >> > > > future
> > >> > > > > improvements in [5].
> > >> > > > >
> > >> > > > > Best,
> > >> > > > >
> > >> > > > > Will Jones
> > >> > > > >
> > >> > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > >> > > > > [2] https://github.com/apache/arrow-datafusion-python/pull/9
> > >> > > > > [3] https://github.com/apache/arrow/pull/35568
> > >> > > > > [4]
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> >
> https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
> > >> > > > > [5]
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> >
> https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit
> > >> > > >
> > >> > >
> > >>
> >
>

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Reply via email to