>
> I do not know if we voted on a naming convention, but we may want to
> reserve a namespace for us (e.g. "arrow").

+1 to calling out in docs that the arrow namespace should be reserved.
maybe "apache.arrow" to lower the possibility of collisions with people who
already have extension types? (I don't feel too strongly about this).

Note that we do not have tests on tensor arrays, so testing the extension
> type on these may be hindered by divergences between implementations. I do
> not think we even have json integration files for them.

Agree, we'll likely need a little more thought on what it means to validate
extension types (is being able to parse extension metadata sufficient?)

Also, note that Rust's arrow2 supports extension types (tested part of the
> IPC and c data interface*), and Polars relies on it to allow Python generic
> "object" in its machinery.

I think this is great for having external verification of  specifications,
but I think for official arrow types, we should be focusing on
implementations that are under ASF governance.

On Tue, Feb 8, 2022 at 8:32 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Note that we do not have tests on tensor arrays, so testing the extension
> type on these may be hindered by divergences between implementations. I do
> not think we even have json integration files for them.
>
> If the focus is extension types, maybe it would be best to cover types
> whose physical representations are covered in e.g. IPC or c data interface
> tests.
>
> I do not know if we voted on a naming convention, but we may want to
> reserve a namespace for us (e.g. "arrow").
>
> Also, note that Rust's arrow2 supports extension types (tested part of the
> IPC and c data interface*), and Polars relies on it to allow Python generic
> "object" in its machinery.
>
> Best,
> Jorge
>
> * pending https://issues.apache.org/jira/browse/ARROW-15613
>
>
>
> On Tue, Feb 8, 2022, 13:52 Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > On Mon, 7 Feb 2022 at 21:02, Rok Mihevc <rok.mih...@gmail.com> wrote:
> >
> > > To follow up the discussion from the bi-weekly Arrow sync:
> > >
> > > - JSON seems the most suitable candidate for the extension metadata.
> > > E.g.: TensorArray
> > > {"key": "ARROW:extension:name", "value": "tensor<type=int64, shape=(3,
> > > 3, 4), strides=(12, 4, 1)>"},
> > > {"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
> > > 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}
> > >
> >
> > I will start a separate thread for the exact encoding of the metadata
> value
> > (i.e. JSON or something else) if that's OK. I already started writing one
> > last week anyway, and that keeps things a bit separated.
> >
> > For the name of the extension type:
> > - We might want to use something like "arrow.tensor" to follow the
> > recommendation at
> > https://arrow.apache.org/docs/format/Columnar.html#extension-types to
> use
> > a
> > namespace. And so for "well known" extension types that are defined in
> the
> > Arrow project itself, I think we can use the "arrow" namespace? (as
> > example, for the extension types defined in pandas, I used the "pandas."
> > namespace)
> > - In general, I think it's best to keep the name itself simple, and leave
> > any parametrization out of it (since this is included in the metadata).
> So
> > in this case that would be just "tensor" instead of "tensor<type=...,
> > shape=..., ..>".
> > - Specifically for this extension type, we might want to use something
> like
> > "fixed_size_tensor" instead of "tensor", to be able to differentiate in
> the
> > future between the tensor type with constant shape vs variable shape (
> > ARROW-1614 <https://issues.apache.org/jira/browse/ARROW-1614> vs
> > ARROW-8714
> > <https://issues.apache.org/jira/browse/ARROW-8714>). But that's
> something
> > to discuss in the relevant JIRA issue / PR.
> >
> > - We want to start with at least one integration test pair. Potential
> > > candidates are cpp, julia, go, rust.
> > >
> >
> > Rust does not yet seem to support extension types? (
> > https://github.com/apache/arrow-rs/issues/218)
> >
> >
> > > - First well known extension type candidate is TensorArray but other
> > > suggestions are welcome.
> > >
> >
> > Others that I am aware of that have been brought up in the past are UUID
> (
> > ARROW-2152 <https://issues.apache.org/jira/browse/ARROW-2152>), complex
> > numbers (ARROW-638 <https://issues.apache.org/jira/browse/ARROW-638>,
> this
> > has a PR) and 8-bit boolean values (ARROW-1674
> > <https://issues.apache.org/jira/browse/ARROW-1674>). But I think we
> should
> > mainly look at demand / someone wanting to implement this, and (for you)
> > this seems to be Tensors, so it's fine to focus on that.
> >
> > Joris
> >
> >
> > >
> > > On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >
> > > >
> > > > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :
> > > > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc <rok.mih...@gmail.com>
> > wrote:
> > > > >>
> > > > >> Thanks for the input Weston!
> > > > >>
> > > > >> How about arrow/experimental/format/ExtensionTypes.fbs or
> > > > >> arrow/format/ExtensionTypes.fbs for language independent schema
> and
> > > > >> loosely arrow/<IMPLEMENTATION>/extensions for implementations?
> > > > >>
> > > > >> Having machine readable definitions could perhaps be useful for
> > > > >> generating implementations in some cases.
> > > > >
> > > > > Is it useful to put this in a flatbuffer file? Based on the list
> from
> > > > > Weston just below, I think this will mostly contain a *description*
> > of
> > > > > those different aspect (a specification of the extension type), and
> > > > > there is no data that actually fits in a flatbuffer table? In that
> > > > > case a plain text (eg markdown) file seems more fitting?
> > > >
> > > > I agree this is mostly a plain text (or, rather, reST :-))
> > specification
> > > > task.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
> >
>

Reply via email to