Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Dane Pitkin Fri, 01 Sep 2023 10:13:28 -0700

The Python Substrait package[1] is on PyPi[2] and currently has python
wrappers for the Substrait protobuf objects. I think this will be a great
opportunity to identify helper features that users of this protocol would
like to see. I'll be keeping an eye out as this develops, but also feel
free to file feature requests in the project!



[1]https://github.com/substrait-io/substrait-python
[2]https://pypi.org/project/substrait/


On Thu, Aug 31, 2023 at 10:05 PM Will Jones <[email protected]> wrote:

> Hello Arrow devs,
>
> We discussed this further in the Arrow community call on 2023-08-30 [1],
> and concluded we should create an entirely new protocol that uses Substrait
> expressions. I have created an issue [2] to track this and will start a PR
> soon.
>
> It does look like we might block this on creating a PyCapsule based
> protocol for arrays, schemas, and streams. That is tracked here [3].
> Hopefully that isn't too ambitious :)
>
> Best,
>
> Will Jones
>
>
> [1]
>
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit
> [2] https://github.com/apache/arrow/issues/37504
> [3] https://github.com/apache/arrow/issues/35531
>
>
> On Tue, Aug 29, 2023 at 2:59 PM Ian Cook <[email protected]> wrote:
>
> > An update about this:
> >
> > Weston's PR https://github.com/apache/arrow/pull/34834/ merged last
> > week. This makes it possible to convert PyArrow expressions to/from
> > Substrait expressions.
> >
> > As Fokko previously noted, the PR does not change the PyArrow Dataset
> > interface at all. It simply enables a Substrait expression to be
> > converted to a PyArrow expression, which can then be used to
> > filter/project a Dataset.
> >
> > There is a basic example here demonstrating this:
> > https://gist.github.com/ianmcook/f70fc185d29ae97bdf85ffe0378c68e0
> >
> > We might now consider whether to build upon this to create a Dataset
> > protocol that is independent of the PyArrow Expression implementation
> > and that could interoperate across languages.
> >
> > Ian
> >
> > On Mon, Jul 3, 2023 at 5:48 PM Will Jones <[email protected]>
> wrote:
> > >
> > > Hello,
> > >
> > > After thinking about it, I think I understand the approach David Li and
> > Ian
> > > are suggesting with respect to expressions. There will be some
> arguments
> > > that only PyArrow's own datasets support, but that aren't in the
> generic
> > > protocol. Passing
> > > PyArrow expressions to the filters argument should be considered one of
> > > those. DuckDB and others are currently passing them down, so they
> aren't
> > > yet using the protocol properly. But once we add support in the
> protocol
> > > for passing filters via Substrait expressions, we'll move DuckDB and
> > others
> > > over to be fully compliant with the protocol.
> > >
> > > It's a bit of an awkward temporary state for now, but so would having
> > > PyArrow expressions in the protocol just to be deprecated in a few
> > months.
> > > One caveat is that we'll need to provide DuckDB and other consumers
> with
> > a
> > > way to tell whether the dataset supports passing filters as Substrait
> > > expression or PyArrow ones, since I doubt they'll want to lose support
> > for
> > > integrating with older PyArrow versions.
> > >
> > > I've removed filters from the protocol for now, with the intention of
> > > bringing them back as soon as we can get Substrait support. I think we
> > can
> > > do this in the 14.0.0 release.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > >
> > > On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong <[email protected]>
> > wrote:
> > >
> > > > Hey everyone,
> > > >
> > > > Chiming in here from the PyIceberg side. I would love to see the
> > protocol
> > > > as proposed in the PR. I did a small test
> > > > <
> > https://github.com/apache/arrow/pull/35568#pullrequestreview-1480259722
> >,
> > > > and it seems to be quite straightforward to implement and it brings a
> > lot
> > > > of potential. Unsurprisingly, I leaning toward the first option:
> > > >
> > > > 1. We keep PyArrow expressions in the API initially, but once we have
> > > > > Substrait-based alternatives we deprecate the PyArrow expression
> > support.
> > > > > This is what I intended with the current design, and I think it
> > provides
> > > > > the most obvious migration paths for existing producers and
> > consumers.
> > > >
> > > >
> > > > Let me give my vision on some of the concerns raised.
> > > >
> > > > Will, I see that you've already addressed this issue to some extent
> in
> > > > > your proposal. For example, you mention that we should initially
> > > > > define this protocol to include only a minimal subset of the
> Dataset
> > > > > API. I agree, but I think there are some loose ends we should be
> > > > > careful to tie up. I strongly agree with the comments made by
> David,
> > > > > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > > > > expressions in this API. Expressions are an implementation detail
> of
> > > > > PyArrow, not a part of the Arrow standard. It would be much safer
> for
> > > > > the initial version of this protocol to not define *any*
> > > > > methods/arguments that take expressions. This will allow us to take
> > > > > some more time to finish up the Substrait expression implementation
> > > > > work that is underway [7][8], then introduce Substrait-based
> > > > > expressions in a latter version of this protocol. This approach
> will
> > > > > better position this protocol to be implemented in other languages
> > > > > besides Python.
> > > >
> > > >
> > > > I'm confused here. Looking at GH-33985
> > > > <https://github.com/apache/arrow/pull/34834/files> I don't see any
> new
> > > > primitives being introduced for composing an expression. As I
> > understand
> > > > it, in PyArrow the expression as it exists today will continue to
> > exist. In
> > > > the case of inter-process communication, it goes to Substrait, and
> > then it
> > > > gets de-serialized in the native expression construct (In PyIceberg,
> a
> > > > BoundPredicate). I would say that the protocol and substrait are
> > > > complementary.
> > > >
> > > > Another concern I have is that we have not fully explained why we
> want
> > > > > to use Dataset instead of RecordBatchReader [9] as the basis of
> this
> > > > > protocol. I would like to see an explanation of why
> RecordBatchReader
> > > > > is not sufficient for this. RecordBatchReader seems like another
> > > > > possible way to represent "unmaterialized dataframes" and there are
> > > > > some parallels between RecordBatch/RecordBatchReader and
> > > > > Fragment/Dataset. We should help developers and users understand
> why
> > > > > Arrow needs both of these.
> > > >
> > > >
> > > > Just to clarify, I think there are different use cases. For example,
> > Lance
> > > > provides its own readers, but PyIceberg does not have any intent to
> > provide
> > > > its own Parquet readers. Iceberg will generate the list of files that
> > need
> > > > to be read, and do the filtering/projection/deletes/etc. This would
> > make
> > > > the Dataset a better choice than the RecordBatchReader.
> > > >
> > > > That wouldn't remove the feature from DuckDB, would it? It would just
> > mean
> > > > > that we recognize that PyArrow expressions don't have well-defined
> > > > > semantics that we are committing to at this time. As long as we
> have
> > > > > `**kwargs` everywhere, we can in the future introduce a
> > > > > `substrait_filter_expression` or similar argument, while allowing
> > current
> > > > > implementors to handle `filter` if possible. (As a compromise, we
> > could
> > > > > reserve `filter` and existing arguments and note that PyArrow
> > Expression
> > > > > semantics are subject to change without notice?)
> > > >
> > > >
> > > > I think we can even re-use the existing filter argument. The
> signature
> > > > would evolve from pc.Expression to Union[pc.Expression,
> > > > pas.BoundExpressions]. In the case we get an expression, we'll
> convert
> > it
> > > > to substrait.
> > > >
> > > > Concluding, I think we can do things in parallel, and I don't think
> > they
> > > > are conflicting. I'm happy to contribute to the PyArrow side to make
> > this
> > > > happen.
> > > >
> > > > Kind regards,
> > > > Fokko
> > > >
> > > > Op wo 28 jun 2023 om 22:47 schreef Will Jones <
> [email protected]
> > >:
> > > >
> > > > > >
> > > > > > That wouldn't remove the feature from DuckDB, would it? It would
> > just
> > > > > mean
> > > > > > that we recognize that PyArrow expressions don't have
> well-defined
> > > > > > semantics that we are committing to at this time.
> > > > > >
> > > > >
> > > > > That's a fair point, David. I would be fine excluding it from the
> > > > protocol
> > > > > initially, and keep the existing integrations in DuckDB, Polars,
> and
> > > > > Datafusion "secret" or "not officially supported" for the time
> > being. At
> > > > > the very least, documenting the pattern to get a Arrow C stream
> will
> > be a
> > > > > step forward.
> > > > >
> > > > > Best,
> > > > >
> > > > > Will Jones
> > > > >
> > > > > On Wed, Jun 28, 2023 at 12:35 PM Jonathan Keane <[email protected]>
> > > > wrote:
> > > > >
> > > > > > > I would understand this objection more if DuckDB hasn't been
> > relying
> > > > on
> > > > > > > being able to pass PyArrow expressions for 18 months now [1].
> > Unless,
> > > > > do
> > > > > > we
> > > > > > > just think this isn't widely used enough that we don't care?
> > > > > >
> > > > > > This isn't a pro or a con of specifically adopting the PyArrow
> > > > expression
> > > > > > semantics as is / with a warning about changing / not at all, but
> > > > having
> > > > > > some kind of standardization in this interface would be very
> nice.
> > This
> > > > > > even came up while collaborating with the DuckDB folks that using
> > some
> > > > of
> > > > > > the expression bits here (and in the R equivalents) was a little
> > bit
> > > > odd
> > > > > > and having something like a proper API for that would have made
> > that
> > > > > > more natural (and likely that would have been used had it existed
> > 18
> > > > > months
> > > > > > ago :))
> > > > > >
> > > > > > -Jon
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 28, 2023 at 1:17 PM David Li <[email protected]>
> > wrote:
> > > > > >
> > > > > > > That wouldn't remove the feature from DuckDB, would it? It
> would
> > just
> > > > > > mean
> > > > > > > that we recognize that PyArrow expressions don't have
> > well-defined
> > > > > > > semantics that we are committing to at this time. As long as we
> > have
> > > > > > > `**kwargs` everywhere, we can in the future introduce a
> > > > > > > `substrait_filter_expression` or similar argument, while
> allowing
> > > > > current
> > > > > > > implementors to handle `filter` if possible. (As a compromise,
> we
> > > > could
> > > > > > > reserve `filter` and existing arguments and note that PyArrow
> > > > > Expression
> > > > > > > semantics are subject to change without notice?)
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023, at 13:38, Will Jones wrote:
> > > > > > > > Hi Ian,
> > > > > > > >
> > > > > > > >
> > > > > > > >> I favor option 2 out of concern that option 1 could create a
> > > > > > > >> temptation for users of this protocol to depend on a feature
> > that
> > > > we
> > > > > > > >> intend to deprecate.
> > > > > > > >>
> > > > > > > >
> > > > > > > > I would understand this objection more if DuckDB hasn't been
> > > > relying
> > > > > on
> > > > > > > > being able to pass PyArrow expressions for 18 months now [1].
> > > > Unless,
> > > > > > do
> > > > > > > we
> > > > > > > > just think this isn't widely used enough that we don't care?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Will
> > > > > > > >
> > > > > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > > > > > > >
> > > > > > > > On Tue, Jun 27, 2023 at 11:19 AM Ian Cook <
> [email protected]
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > >> > I think there's three routes we can go here:
> > > > > > > >> >
> > > > > > > >> > 1. We keep PyArrow expressions in the API initially, but
> > once we
> > > > > > have
> > > > > > > >> > Substrait-based alternatives we deprecate the PyArrow
> > expression
> > > > > > > support.
> > > > > > > >> > This is what I intended with the current design, and I
> > think it
> > > > > > > provides
> > > > > > > >> > the most obvious migration paths for existing producers
> and
> > > > > > consumers.
> > > > > > > >> > 2. We keep the overall dataset API, but don't introduce
> the
> > > > filter
> > > > > > and
> > > > > > > >> > projection arguments until we have Substrait support. I'm
> > not
> > > > sure
> > > > > > > what
> > > > > > > >> the
> > > > > > > >> > migration path looks like for producers and consumers,
> but I
> > > > think
> > > > > > > this
> > > > > > > >> > just implicitly becomes the same as (1), but with worse
> > > > > > documentation.
> > > > > > > >> > 3. We write a protocol completely from scratch, that
> > doesn't try
> > > > > to
> > > > > > > >> > describe the existing dataset API. Producers and consumers
> > would
> > > > > > then
> > > > > > > >> > migrate to use the new protocol and deprecate their
> existing
> > > > > dataset
> > > > > > > >> > integrations. We could introduce a dunder method in that
> API
> > > > (sort
> > > > > > of
> > > > > > > >> like
> > > > > > > >> > __arrow_array__) that would make the migration seamless
> > from the
> > > > > > > end-user
> > > > > > > >> > perspective.
> > > > > > > >> >
> > > > > > > >> > *Which do you all think is the best path forward?*
> > > > > > > >>
> > > > > > > >> I favor option 2 out of concern that option 1 could create a
> > > > > > > >> temptation for users of this protocol to depend on a feature
> > that
> > > > we
> > > > > > > >> intend to deprecate. I think option 2 also creates a
> stronger
> > > > > > > >> motivation to complete the Substrait expression integration
> > work,
> > > > > > > >> which is underway in
> > https://github.com/apache/arrow/pull/34834.
> > > > > > > >>
> > > > > > > >> Ian
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Fri, Jun 23, 2023 at 1:25 PM Weston Pace <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > The trouble is that Dataset was not designed to serve
> as a
> > > > > > > >> > > general-purpose unmaterialized dataframe. For example,
> the
> > > > > PyArrow
> > > > > > > >> > > Dataset constructor [5] exposes options for specifying a
> > list
> > > > of
> > > > > > > >> > > source files and a partitioning scheme, which are
> > irrelevant
> > > > for
> > > > > > > many
> > > > > > > >> > > of the applications that Will anticipates. And some work
> > is
> > > > > needed
> > > > > > > to
> > > > > > > >> > > reconcile the methods of the PyArrow Dataset object [6]
> > with
> > > > the
> > > > > > > >> > > methods of the Table object. Some methods like filter()
> > are
> > > > > > exposed
> > > > > > > by
> > > > > > > >> > > both and behave lazily on Datasets and eagerly on
> Tables,
> > as a
> > > > > > user
> > > > > > > >> > > might expect. But many other Table methods are not
> > implemented
> > > > > for
> > > > > > > >> > > Dataset though they potentially could be, and it is
> > unclear
> > > > > where
> > > > > > we
> > > > > > > >> > > should draw the line between adding methods to Dataset
> vs.
> > > > > > > encouraging
> > > > > > > >> > > new scanner implementations to expose options
> controlling
> > what
> > > > > > lazy
> > > > > > > >> > > operations should be performed as they see fit.
> > > > > > > >> >
> > > > > > > >> > In my mind there is a distinction between the "compute
> > domain"
> > > > > > (e.g. a
> > > > > > > >> > pandas dataframe or something like ibis or SQL) and the
> > "data
> > > > > > domain"
> > > > > > > >> (e.g.
> > > > > > > >> > pyarrow datasets).  I think, in a perfect world, you could
> > push
> > > > > any
> > > > > > > and
> > > > > > > >> all
> > > > > > > >> > compute up and down the chain as far as possible.
> However,
> > in
> > > > > > > practice,
> > > > > > > >> I
> > > > > > > >> > think there is a healthy set of tools and libraries that
> say
> > > > > "simple
> > > > > > > >> column
> > > > > > > >> > projection and filtering is good enough".  I would argue
> > that
> > > > > there
> > > > > > is
> > > > > > > >> room
> > > > > > > >> > for both APIs and while the temptation is always present
> to
> > > > "shove
> > > > > > as
> > > > > > > >> much
> > > > > > > >> > compute as you can" I think pyarrow datasets seem to have
> > found
> > > > a
> > > > > > > balance
> > > > > > > >> > between the two that users like.
> > > > > > > >> >
> > > > > > > >> > So I would argue that this protocol may never become a
> > > > > > general-purpose
> > > > > > > >> > unmaterialized dataframe and that isn't necessarily a bad
> > thing.
> > > > > > > >> >
> > > > > > > >> > > they are splittable and serializable, so that fragments
> > can be
> > > > > > > >> distributed
> > > > > > > >> > > amongst processes / workers.
> > > > > > > >> >
> > > > > > > >> > Just to clarify, the proposal currently only requires the
> > > > > fragments
> > > > > > > to be
> > > > > > > >> > serializable correct?
> > > > > > > >> >
> > > > > > > >> > On Fri, Jun 23, 2023 at 11:48 AM Will Jones <
> > > > > > [email protected]>
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> > > Thanks Ian for your extensive feedback.
> > > > > > > >> > >
> > > > > > > >> > > I strongly agree with the comments made by David,
> > > > > > > >> > > > Weston, and Dewey arguing that we should avoid any use
> > of
> > > > > > PyArrow
> > > > > > > >> > > > expressions in this API. Expressions are an
> > implementation
> > > > > > detail
> > > > > > > of
> > > > > > > >> > > > PyArrow, not a part of the Arrow standard. It would be
> > much
> > > > > > safer
> > > > > > > for
> > > > > > > >> > > > the initial version of this protocol to not define
> *any*
> > > > > > > >> > > > methods/arguments that take expressions.
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > I would agree with this point, if we were starting from
> > > > scratch.
> > > > > > But
> > > > > > > >> one of
> > > > > > > >> > > my goals is for this protocol to be descriptive of the
> > > > existing
> > > > > > > dataset
> > > > > > > >> > > integrations in the ecosystem, which all currently rely
> on
> > > > > PyArrow
> > > > > > > >> > > expressions. For example, you'll notice in the PR that
> > there
> > > > are
> > > > > > > unit
> > > > > > > >> tests
> > > > > > > >> > > to verify the current PyArrow Dataset classes conform to
> > this
> > > > > > > protocol,
> > > > > > > >> > > without changes.
> > > > > > > >> > >
> > > > > > > >> > > I think there's three routes we can go here:
> > > > > > > >> > >
> > > > > > > >> > > 1. We keep PyArrow expressions in the API initially, but
> > once
> > > > we
> > > > > > > have
> > > > > > > >> > > Substrait-based alternatives we deprecate the PyArrow
> > > > expression
> > > > > > > >> support.
> > > > > > > >> > > This is what I intended with the current design, and I
> > think
> > > > it
> > > > > > > >> provides
> > > > > > > >> > > the most obvious migration paths for existing producers
> > and
> > > > > > > consumers.
> > > > > > > >> > > 2. We keep the overall dataset API, but don't introduce
> > the
> > > > > filter
> > > > > > > and
> > > > > > > >> > > projection arguments until we have Substrait support.
> I'm
> > not
> > > > > sure
> > > > > > > >> what the
> > > > > > > >> > > migration path looks like for producers and consumers,
> > but I
> > > > > think
> > > > > > > this
> > > > > > > >> > > just implicitly becomes the same as (1), but with worse
> > > > > > > documentation.
> > > > > > > >> > > 3. We write a protocol completely from scratch, that
> > doesn't
> > > > try
> > > > > > to
> > > > > > > >> > > describe the existing dataset API. Producers and
> consumers
> > > > would
> > > > > > > then
> > > > > > > >> > > migrate to use the new protocol and deprecate their
> > existing
> > > > > > dataset
> > > > > > > >> > > integrations. We could introduce a dunder method in that
> > API
> > > > > (sort
> > > > > > > of
> > > > > > > >> like
> > > > > > > >> > > __arrow_array__) that would make the migration seamless
> > from
> > > > the
> > > > > > > >> end-user
> > > > > > > >> > > perspective.
> > > > > > > >> > >
> > > > > > > >> > > *Which do you all think is the best path forward?*
> > > > > > > >> > >
> > > > > > > >> > > Another concern I have is that we have not fully
> > explained why
> > > > > we
> > > > > > > want
> > > > > > > >> > > > to use Dataset instead of RecordBatchReader [9] as the
> > basis
> > > > > of
> > > > > > > this
> > > > > > > >> > > > protocol. I would like to see an explanation of why
> > > > > > > RecordBatchReader
> > > > > > > >> > > > is not sufficient for this. RecordBatchReader seems
> like
> > > > > another
> > > > > > > >> > > > possible way to represent "unmaterialized dataframes"
> > and
> > > > > there
> > > > > > > are
> > > > > > > >> > > > some parallels between RecordBatch/RecordBatchReader
> and
> > > > > > > >> > > > Fragment/Dataset.
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > This is a good point. I can add a section describing the
> > > > > > > differences.
> > > > > > > >> The
> > > > > > > >> > > main ones I can think of are that: (1) Datasets are
> > > > "pruneable":
> > > > > > one
> > > > > > > >> can
> > > > > > > >> > > select a subset of columns and apply a filter on rows to
> > avoid
> > > > > IO
> > > > > > > and
> > > > > > > >> (2)
> > > > > > > >> > > they are splittable and serializable, so that fragments
> > can be
> > > > > > > >> distributed
> > > > > > > >> > > amongst processes / workers.
> > > > > > > >> > >
> > > > > > > >> > > Best,
> > > > > > > >> > >
> > > > > > > >> > > Will Jones
> > > > > > > >> > >
> > > > > > > >> > > On Fri, Jun 23, 2023 at 10:48 AM Ian Cook <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Thanks Will for this proposal!
> > > > > > > >> > > >
> > > > > > > >> > > > For anyone familiar with PyArrow, this idea has a
> clear
> > > > > > intuitive
> > > > > > > >> > > > logic to it. It provides an expedient solution to the
> > > > current
> > > > > > > lack of
> > > > > > > >> > > > a practical means for interchanging "unmaterialized
> > > > > dataframes"
> > > > > > > >> > > > between different Python libraries.
> > > > > > > >> > > >
> > > > > > > >> > > > To elaborate on that: If you look at how people use
> the
> > > > Arrow
> > > > > > > Dataset
> > > > > > > >> > > > API—which is implemented in the Arrow C++ library [1]
> > and
> > > > has
> > > > > > > >> bindings
> > > > > > > >> > > > not just for Python [2] but also for Java [3] and R
> > > > [4]—you'll
> > > > > > see
> > > > > > > >> > > > that Dataset is often used simply as a "virtual"
> > variant of
> > > > > > > Table. It
> > > > > > > >> > > > is used in cases when the data is larger than memory
> or
> > when
> > > > > it
> > > > > > is
> > > > > > > >> > > > desirable to defer reading (materializing) the data
> into
> > > > > memory.
> > > > > > > >> > > >
> > > > > > > >> > > > So we can think of a Table as a materialized dataframe
> > and a
> > > > > > > Dataset
> > > > > > > >> > > > as an unmaterialized dataframe. That aspect of Dataset
> > is I
> > > > > > think
> > > > > > > >> what
> > > > > > > >> > > > makes it most attractive as a protocol for enabling
> > > > > > > interoperability:
> > > > > > > >> > > > it allows libraries to easily "speak Arrow" in cases
> > where
> > > > > > > >> > > > materializing the full data in memory upfront is
> > impossible
> > > > or
> > > > > > > >> > > > undesirable.
> > > > > > > >> > > >
> > > > > > > >> > > > The trouble is that Dataset was not designed to serve
> > as a
> > > > > > > >> > > > general-purpose unmaterialized dataframe. For example,
> > the
> > > > > > PyArrow
> > > > > > > >> > > > Dataset constructor [5] exposes options for
> specifying a
> > > > list
> > > > > of
> > > > > > > >> > > > source files and a partitioning scheme, which are
> > irrelevant
> > > > > for
> > > > > > > many
> > > > > > > >> > > > of the applications that Will anticipates. And some
> > work is
> > > > > > > needed to
> > > > > > > >> > > > reconcile the methods of the PyArrow Dataset object
> [6]
> > with
> > > > > the
> > > > > > > >> > > > methods of the Table object. Some methods like
> filter()
> > are
> > > > > > > exposed
> > > > > > > >> by
> > > > > > > >> > > > both and behave lazily on Datasets and eagerly on
> > Tables,
> > > > as a
> > > > > > > user
> > > > > > > >> > > > might expect. But many other Table methods are not
> > > > implemented
> > > > > > for
> > > > > > > >> > > > Dataset though they potentially could be, and it is
> > unclear
> > > > > > where
> > > > > > > we
> > > > > > > >> > > > should draw the line between adding methods to Dataset
> > vs.
> > > > > > > >> encouraging
> > > > > > > >> > > > new scanner implementations to expose options
> > controlling
> > > > what
> > > > > > > lazy
> > > > > > > >> > > > operations should be performed as they see fit.
> > > > > > > >> > > >
> > > > > > > >> > > > Will, I see that you've already addressed this issue
> to
> > some
> > > > > > > extent
> > > > > > > >> in
> > > > > > > >> > > > your proposal. For example, you mention that we should
> > > > > initially
> > > > > > > >> > > > define this protocol to include only a minimal subset
> > of the
> > > > > > > Dataset
> > > > > > > >> > > > API. I agree, but I think there are some loose ends we
> > > > should
> > > > > be
> > > > > > > >> > > > careful to tie up. I strongly agree with the comments
> > made
> > > > by
> > > > > > > David,
> > > > > > > >> > > > Weston, and Dewey arguing that we should avoid any use
> > of
> > > > > > PyArrow
> > > > > > > >> > > > expressions in this API. Expressions are an
> > implementation
> > > > > > detail
> > > > > > > of
> > > > > > > >> > > > PyArrow, not a part of the Arrow standard. It would be
> > much
> > > > > > safer
> > > > > > > for
> > > > > > > >> > > > the initial version of this protocol to not define
> *any*
> > > > > > > >> > > > methods/arguments that take expressions. This will
> > allow us
> > > > to
> > > > > > > take
> > > > > > > >> > > > some more time to finish up the Substrait expression
> > > > > > > implementation
> > > > > > > >> > > > work that is underway [7][8], then introduce
> > Substrait-based
> > > > > > > >> > > > expressions in a latter version of this protocol. This
> > > > > approach
> > > > > > > will
> > > > > > > >> > > > better position this protocol to be implemented in
> other
> > > > > > languages
> > > > > > > >> > > > besides Python.
> > > > > > > >> > > >
> > > > > > > >> > > > Another concern I have is that we have not fully
> > explained
> > > > why
> > > > > > we
> > > > > > > >> want
> > > > > > > >> > > > to use Dataset instead of RecordBatchReader [9] as the
> > basis
> > > > > of
> > > > > > > this
> > > > > > > >> > > > protocol. I would like to see an explanation of why
> > > > > > > RecordBatchReader
> > > > > > > >> > > > is not sufficient for this. RecordBatchReader seems
> like
> > > > > another
> > > > > > > >> > > > possible way to represent "unmaterialized dataframes"
> > and
> > > > > there
> > > > > > > are
> > > > > > > >> > > > some parallels between RecordBatch/RecordBatchReader
> and
> > > > > > > >> > > > Fragment/Dataset. We should help developers and users
> > > > > understand
> > > > > > > why
> > > > > > > >> > > > Arrow needs both of these.
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks Will for your thoughtful prose explanations
> about
> > > > this
> > > > > > > >> proposed
> > > > > > > >> > > > API. After we arrive at a decision about this, I think
> > we
> > > > > should
> > > > > > > >> > > > reproduce some of these explanations in docs, blog
> > posts,
> > > > > > cookbook
> > > > > > > >> > > > recipes, etc. because there is some important nuance
> > here
> > > > that
> > > > > > > will
> > > > > > > >> be
> > > > > > > >> > > > important for integrators of this API to understand.
> > > > > > > >> > > >
> > > > > > > >> > > > Ian
> > > > > > > >> > > >
> > > > > > > >> > > > [1]
> https://arrow.apache.org/docs/cpp/api/dataset.html
> > > > > > > >> > > > [2] https://arrow.apache.org/docs/python/dataset.html
> > > > > > > >> > > > [3] https://arrow.apache.org/docs/java/dataset.html
> > > > > > > >> > > > [4]
> > https://arrow.apache.org/docs/r/articles/dataset.html
> > > > > > > >> > > > [5]
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
> > > > > > > >> > > > [6]
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
> > > > > > > >> > > > [7] https://github.com/apache/arrow/issues/33985
> > > > > > > >> > > > [8] https://github.com/apache/arrow/issues/34252
> > > > > > > >> > > > [9]
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html
> > > > > > > >> > > >
> > > > > > > >> > > > On Wed, Jun 21, 2023 at 2:09 PM Will Jones <
> > > > > > > [email protected]>
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > > Hello Arrow devs,
> > > > > > > >> > > > >
> > > > > > > >> > > > > I have drafted a PR defining an experimental
> protocol
> > > > which
> > > > > > > would
> > > > > > > >> allow
> > > > > > > >> > > > > third-party libraries to imitate the PyArrow Dataset
> > API
> > > > > [5].
> > > > > > > This
> > > > > > > >> > > > protocol
> > > > > > > >> > > > > is intended to endorse an integration pattern that
> is
> > > > > starting
> > > > > > > to
> > > > > > > >> be
> > > > > > > >> > > used
> > > > > > > >> > > > > in the Python ecosystem, where some libraries are
> > > > providing
> > > > > > > their
> > > > > > > >> own
> > > > > > > >> > > > > scanners with this API, while query engines are
> > accepting
> > > > > > these
> > > > > > > as
> > > > > > > >> > > > > duck-typed objects.
> > > > > > > >> > > > >
> > > > > > > >> > > > > To give some background: back at the end of 2021, we
> > > > > > > collaborated
> > > > > > > >> with
> > > > > > > >> > > > > DuckDB to be able to read datasets (an Arrow C++
> > concept),
> > > > > > > >> supporting
> > > > > > > >> > > > > column selection and filter pushdown. This was
> > > > accomplished
> > > > > by
> > > > > > > >> having
> > > > > > > >> > > > > DuckDB manipulating Python (or R) objects to get a
> > > > > > > >> RecordBatchReader
> > > > > > > >> > > and
> > > > > > > >> > > > > then exporting over the C Stream Interface.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Since then, DataFusion [2] and Polars have both made
> > > > similar
> > > > > > > >> > > > > implementations for their Python bindings, allowing
> > them
> > > > to
> > > > > > > consume
> > > > > > > >> > > > PyArrow
> > > > > > > >> > > > > datasets. This has created an implicit protocol,
> > whereby
> > > > > > > arbitrary
> > > > > > > >> > > > compute
> > > > > > > >> > > > > engines can push down queries into the PyArrow
> dataset
> > > > > > scanner.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Now, libraries supporting table formats including
> > Delta
> > > > > Lake,
> > > > > > > >> Lance,
> > > > > > > >> > > and
> > > > > > > >> > > > > Iceberg are looking to be able to support these
> > engines,
> > > > > while
> > > > > > > >> bringing
> > > > > > > >> > > > > their own scanners and metadata handling
> > implementations.
> > > > > One
> > > > > > > >> possible
> > > > > > > >> > > > > route is allowing them to imitate the PyArrow
> datasets
> > > > API.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Bringing these use cases together, I'd like to
> > propose an
> > > > > > > >> experimental
> > > > > > > >> > > > > protocol, made out of the minimal subset of the
> > PyArrow
> > > > > > Dataset
> > > > > > > API
> > > > > > > >> > > > > necessary to facilitate this kind of integration.
> This
> > > > would
> > > > > > > allow
> > > > > > > >> any
> > > > > > > >> > > > > library to produce a scanner implementation and that
> > > > > arbitrary
> > > > > > > >> query
> > > > > > > >> > > > > engines could call into. I've drafted a PR [3] and
> > there
> > > > is
> > > > > > some
> > > > > > > >> > > > background
> > > > > > > >> > > > > research available in a google doc [4].
> > > > > > > >> > > > >
> > > > > > > >> > > > > I've already gotten some good feedback on both, and
> > would
> > > > > > > welcome
> > > > > > > >> more.
> > > > > > > >> > > > >
> > > > > > > >> > > > > One last point: I'd like for this to be a first step
> > > > rather
> > > > > > > than a
> > > > > > > >> > > > > comprehensive API. This PR focuses on making
> explicit
> > a
> > > > > > protocol
> > > > > > > >> that
> > > > > > > >> > > is
> > > > > > > >> > > > > already in use in the ecosystem, but without much
> > concrete
> > > > > > > >> definition.
> > > > > > > >> > > > Once
> > > > > > > >> > > > > this is established, we can use our experience from
> > this
> > > > > > > protocol
> > > > > > > >> to
> > > > > > > >> > > > design
> > > > > > > >> > > > > something more permanent that takes advantage of
> newer
> > > > > > > innovations
> > > > > > > >> in
> > > > > > > >> > > the
> > > > > > > >> > > > > Arrow ecosystem (such as the PyCapsule for C Data
> > > > Interface
> > > > > or
> > > > > > > >> > > > > Substrait for passing expressions / scan plans). I
> am
> > > > > tracking
> > > > > > > such
> > > > > > > >> > > > future
> > > > > > > >> > > > > improvements in [5].
> > > > > > > >> > > > >
> > > > > > > >> > > > > Best,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Will Jones
> > > > > > > >> > > > >
> > > > > > > >> > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > > > > > > >> > > > > [2]
> > > > > https://github.com/apache/arrow-datafusion-python/pull/9
> > > > > > > >> > > > > [3] https://github.com/apache/arrow/pull/35568
> > > > > > > >> > > > > [4]
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
> > > > > > > >> > > > > [5]
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Reply via email to