Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-09-01 Thread Will Jones
Thanks for pointing that out, Dane. I think that seems like an obvious
choice for Dask to be able to consume this protocol.

On Fri, Sep 1, 2023 at 10:13 AM Dane Pitkin 
wrote:

> The Python Substrait package[1] is on PyPi[2] and currently has python
> wrappers for the Substrait protobuf objects. I think this will be a great
> opportunity to identify helper features that users of this protocol would
> like to see. I'll be keeping an eye out as this develops, but also feel
> free to file feature requests in the project!
>
>
> [1]https://github.com/substrait-io/substrait-python
> [2]https://pypi.org/project/substrait/
>
>
> On Thu, Aug 31, 2023 at 10:05 PM Will Jones 
> wrote:
>
> > Hello Arrow devs,
> >
> > We discussed this further in the Arrow community call on 2023-08-30 [1],
> > and concluded we should create an entirely new protocol that uses
> Substrait
> > expressions. I have created an issue [2] to track this and will start a
> PR
> > soon.
> >
> > It does look like we might block this on creating a PyCapsule based
> > protocol for arrays, schemas, and streams. That is tracked here [3].
> > Hopefully that isn't too ambitious :)
> >
> > Best,
> >
> > Will Jones
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit
> > [2] https://github.com/apache/arrow/issues/37504
> > [3] https://github.com/apache/arrow/issues/35531
> >
> >
> > On Tue, Aug 29, 2023 at 2:59 PM Ian Cook  wrote:
> >
> > > An update about this:
> > >
> > > Weston's PR https://github.com/apache/arrow/pull/34834/ merged last
> > > week. This makes it possible to convert PyArrow expressions to/from
> > > Substrait expressions.
> > >
> > > As Fokko previously noted, the PR does not change the PyArrow Dataset
> > > interface at all. It simply enables a Substrait expression to be
> > > converted to a PyArrow expression, which can then be used to
> > > filter/project a Dataset.
> > >
> > > There is a basic example here demonstrating this:
> > > https://gist.github.com/ianmcook/f70fc185d29ae97bdf85ffe0378c68e0
> > >
> > > We might now consider whether to build upon this to create a Dataset
> > > protocol that is independent of the PyArrow Expression implementation
> > > and that could interoperate across languages.
> > >
> > > Ian
> > >
> > > On Mon, Jul 3, 2023 at 5:48 PM Will Jones 
> > wrote:
> > > >
> > > > Hello,
> > > >
> > > > After thinking about it, I think I understand the approach David Li
> and
> > > Ian
> > > > are suggesting with respect to expressions. There will be some
> > arguments
> > > > that only PyArrow's own datasets support, but that aren't in the
> > generic
> > > > protocol. Passing
> > > > PyArrow expressions to the filters argument should be considered one
> of
> > > > those. DuckDB and others are currently passing them down, so they
> > aren't
> > > > yet using the protocol properly. But once we add support in the
> > protocol
> > > > for passing filters via Substrait expressions, we'll move DuckDB and
> > > others
> > > > over to be fully compliant with the protocol.
> > > >
> > > > It's a bit of an awkward temporary state for now, but so would having
> > > > PyArrow expressions in the protocol just to be deprecated in a few
> > > months.
> > > > One caveat is that we'll need to provide DuckDB and other consumers
> > with
> > > a
> > > > way to tell whether the dataset supports passing filters as Substrait
> > > > expression or PyArrow ones, since I doubt they'll want to lose
> support
> > > for
> > > > integrating with older PyArrow versions.
> > > >
> > > > I've removed filters from the protocol for now, with the intention of
> > > > bringing them back as soon as we can get Substrait support. I think
> we
> > > can
> > > > do this in the 14.0.0 release.
> > > >
> > > > Best,
> > > >
> > > > Will Jones
> > > >
> > > >
> > > > On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong 
> > > wrote:
> > > >
> > > > > Hey everyone,
> > > > >
> > > > > Chiming in here from the PyIceberg side. I would love to see the
> > > protocol
> > > > > as proposed in the PR. I did a small test
> > > > > <
> > >
> https://github.com/apache/arrow/pull/35568#pullrequestreview-1480259722
> > >,
> > > > > and it seems to be quite straightforward to implement and it
> brings a
> > > lot
> > > > > of potential. Unsurprisingly, I leaning toward the first option:
> > > > >
> > > > > 1. We keep PyArrow expressions in the API initially, but once we
> have
> > > > > > Substrait-based alternatives we deprecate the PyArrow expression
> > > support.
> > > > > > This is what I intended with the current design, and I think it
> > > provides
> > > > > > the most obvious migration paths for existing producers and
> > > consumers.
> > > > >
> > > > >
> > > > > Let me give my vision on some of the concerns raised.
> > > > >
> > > > > Will, I see that you've already addressed this issue to some extent
> > in
> > > > > > your proposal. For example, you mention that we should initially
> > > > > > define th

Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-01 Thread Dewey Dunnington
Thank you for proposing this! I left a comment on the PR as well, but
I'm excited for this to standardize a few concepts that I have run
into whilst working on ADBC and GeoArrow:

- Properly returning an array with >1 dimension from the PostgreSQL ADBC driver
- As the basis for encoding raster tiles as rows in a table (e.g.,
http://www.geopackage.org/spec/#_tile_matrix_introduction )

Excited to see the PR progress!

-dewey

On Thu, Aug 17, 2023 at 9:54 AM Rok Mihevc  wrote:
>
> Hey all!
>
>
> Besides the recently added FixedShapeTensor [1] canonical extension type
> there appears to be a need for an already proposed VariableShapeTensor
> [2]. VariableShapeTensor
> would store tensors of variable shapes but uniform number of
> dimensions, dimension names and dimension permutations.
>
> There are examples of such types: Ray implements
> ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4].
>
> I propose we discuss adding the below text to
> format/CanonicalExtensions.rst to read as [5] and a C++/Python
> implementation as proposed in [6]. A vote can be called after a discussion
> here.
>
> Variable shape tensor
>
> =
>
> * Extension name: `arrow.variable_shape_tensor`.
>
> * The storage type of the extension is: ``StructArray`` where struct
>
>   is composed of **data** and **shape** fields describing a single
>
>   tensor per row:
>
>   * **data** is a ``List`` holding tensor elements of a single tensor.
>
> Data type of the list elements is uniform across the entire column
>
> and also provided in metadata.
>
>   * **shape** is a ``FixedSizeList`` of the tensor shape where
>
> the size of the list is equal to the number of dimensions of the
>
> tensor.
>
> * Extension type parameters:
>
>   * **value_type** = the Arrow data type of individual tensor elements.
>
>   * **ndim** = the number of dimensions of the tensor.
>
>   Optional parameters describing the logical layout:
>
>   * **dim_names** = explicit names to tensor dimensions
>
> as an array. The length of it should be equal to the shape
>
> length and equal to the number of dimensions.
>
> ``dim_names`` can be used if the dimensions have well-known
>
> names and they map to the physical layout (row-major).
>
>   * **permutation**  = indices of the desired ordering of the
>
> original dimensions, defined as an array.
>
> The indices contain a permutation of the values [0, 1, .., N-1] where
>
> N is the number of dimensions. The permutation indicates which
>
> dimension of the logical layout corresponds to which dimension of the
>
> physical tensor (the i-th dimension of the logical view corresponds
>
> to the dimension with number ``permutations[i]`` of the physical
> tensor).
>
> Permutation can be useful in case the logical order of
>
> the tensor is a permutation of the physical order (row-major).
>
> When logical and physical layout are equal, the permutation will always
>
> be ([0, 1, .., N-1]) and can therefore be left out.
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including number of
>
>   dimensions of the contained tensors as an integer with key **"ndim"**
>
>   plus optional dimension names with keys **"dim_names"** and ordering of
>
>   the dimensions with key **"permutation"**.
>
>   - Example: ``{ "ndim": 2}``
>
>   - Example with ``dim_names`` metadata for NCHW ordered data:
>
> ``{ "ndim": 3, "dim_names": ["C", "H", "W"]}``
>
>   - Example of permuted 3-dimensional tensor:
>
> ``{ "ndim": 3, "permutation": [2, 0, 1]}``
>
> This is the physical layout shape and the shape of the logical
>
> layout would given an individual tensor of shape [100, 200, 500]
>
> be ``[500, 100, 200]``.
>
> .. note::
>
>   Elements in a variable shape tensor extension array are stored
>
>   in row-major/C-contiguous order.
>
>
> [1] https://github.com/apache/arrow/issues/33924
>
> [2] https://github.com/apache/arrow/issues/24868
>
> [3]
> https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809
>
> [4] https://pytorch.org/docs/stable/nested.html
>
> [5]
> https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor
>
> [6] https://github.com/apache/arrow/pull/37166
>
>
>
> Best,
>
> Rok


Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-09-01 Thread Dane Pitkin
The Python Substrait package[1] is on PyPi[2] and currently has python
wrappers for the Substrait protobuf objects. I think this will be a great
opportunity to identify helper features that users of this protocol would
like to see. I'll be keeping an eye out as this develops, but also feel
free to file feature requests in the project!


[1]https://github.com/substrait-io/substrait-python
[2]https://pypi.org/project/substrait/


On Thu, Aug 31, 2023 at 10:05 PM Will Jones  wrote:

> Hello Arrow devs,
>
> We discussed this further in the Arrow community call on 2023-08-30 [1],
> and concluded we should create an entirely new protocol that uses Substrait
> expressions. I have created an issue [2] to track this and will start a PR
> soon.
>
> It does look like we might block this on creating a PyCapsule based
> protocol for arrays, schemas, and streams. That is tracked here [3].
> Hopefully that isn't too ambitious :)
>
> Best,
>
> Will Jones
>
>
> [1]
>
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit
> [2] https://github.com/apache/arrow/issues/37504
> [3] https://github.com/apache/arrow/issues/35531
>
>
> On Tue, Aug 29, 2023 at 2:59 PM Ian Cook  wrote:
>
> > An update about this:
> >
> > Weston's PR https://github.com/apache/arrow/pull/34834/ merged last
> > week. This makes it possible to convert PyArrow expressions to/from
> > Substrait expressions.
> >
> > As Fokko previously noted, the PR does not change the PyArrow Dataset
> > interface at all. It simply enables a Substrait expression to be
> > converted to a PyArrow expression, which can then be used to
> > filter/project a Dataset.
> >
> > There is a basic example here demonstrating this:
> > https://gist.github.com/ianmcook/f70fc185d29ae97bdf85ffe0378c68e0
> >
> > We might now consider whether to build upon this to create a Dataset
> > protocol that is independent of the PyArrow Expression implementation
> > and that could interoperate across languages.
> >
> > Ian
> >
> > On Mon, Jul 3, 2023 at 5:48 PM Will Jones 
> wrote:
> > >
> > > Hello,
> > >
> > > After thinking about it, I think I understand the approach David Li and
> > Ian
> > > are suggesting with respect to expressions. There will be some
> arguments
> > > that only PyArrow's own datasets support, but that aren't in the
> generic
> > > protocol. Passing
> > > PyArrow expressions to the filters argument should be considered one of
> > > those. DuckDB and others are currently passing them down, so they
> aren't
> > > yet using the protocol properly. But once we add support in the
> protocol
> > > for passing filters via Substrait expressions, we'll move DuckDB and
> > others
> > > over to be fully compliant with the protocol.
> > >
> > > It's a bit of an awkward temporary state for now, but so would having
> > > PyArrow expressions in the protocol just to be deprecated in a few
> > months.
> > > One caveat is that we'll need to provide DuckDB and other consumers
> with
> > a
> > > way to tell whether the dataset supports passing filters as Substrait
> > > expression or PyArrow ones, since I doubt they'll want to lose support
> > for
> > > integrating with older PyArrow versions.
> > >
> > > I've removed filters from the protocol for now, with the intention of
> > > bringing them back as soon as we can get Substrait support. I think we
> > can
> > > do this in the 14.0.0 release.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > >
> > > On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong 
> > wrote:
> > >
> > > > Hey everyone,
> > > >
> > > > Chiming in here from the PyIceberg side. I would love to see the
> > protocol
> > > > as proposed in the PR. I did a small test
> > > > <
> > https://github.com/apache/arrow/pull/35568#pullrequestreview-1480259722
> >,
> > > > and it seems to be quite straightforward to implement and it brings a
> > lot
> > > > of potential. Unsurprisingly, I leaning toward the first option:
> > > >
> > > > 1. We keep PyArrow expressions in the API initially, but once we have
> > > > > Substrait-based alternatives we deprecate the PyArrow expression
> > support.
> > > > > This is what I intended with the current design, and I think it
> > provides
> > > > > the most obvious migration paths for existing producers and
> > consumers.
> > > >
> > > >
> > > > Let me give my vision on some of the concerns raised.
> > > >
> > > > Will, I see that you've already addressed this issue to some extent
> in
> > > > > your proposal. For example, you mention that we should initially
> > > > > define this protocol to include only a minimal subset of the
> Dataset
> > > > > API. I agree, but I think there are some loose ends we should be
> > > > > careful to tie up. I strongly agree with the comments made by
> David,
> > > > > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > > > > expressions in this API. Expressions are an implementation detail
> of
> > > > > PyArrow, not a part of the Arrow standard. It would be much safer
> fo