On Mon, 7 Feb 2022 at 21:02, Rok Mihevc <rok.mih...@gmail.com> wrote:

> To follow up the discussion from the bi-weekly Arrow sync:
>
> - JSON seems the most suitable candidate for the extension metadata.
> E.g.: TensorArray
> {"key": "ARROW:extension:name", "value": "tensor<type=int64, shape=(3,
> 3, 4), strides=(12, 4, 1)>"},
> {"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
> 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}
>

I will start a separate thread for the exact encoding of the metadata value
(i.e. JSON or something else) if that's OK. I already started writing one
last week anyway, and that keeps things a bit separated.

For the name of the extension type:
- We might want to use something like "arrow.tensor" to follow the
recommendation at
https://arrow.apache.org/docs/format/Columnar.html#extension-types to use a
namespace. And so for "well known" extension types that are defined in the
Arrow project itself, I think we can use the "arrow" namespace? (as
example, for the extension types defined in pandas, I used the "pandas."
namespace)
- In general, I think it's best to keep the name itself simple, and leave
any parametrization out of it (since this is included in the metadata). So
in this case that would be just "tensor" instead of "tensor<type=...,
shape=..., ..>".
- Specifically for this extension type, we might want to use something like
"fixed_size_tensor" instead of "tensor", to be able to differentiate in the
future between the tensor type with constant shape vs variable shape (
ARROW-1614 <https://issues.apache.org/jira/browse/ARROW-1614> vs ARROW-8714
<https://issues.apache.org/jira/browse/ARROW-8714>). But that's something
to discuss in the relevant JIRA issue / PR.

- We want to start with at least one integration test pair. Potential
> candidates are cpp, julia, go, rust.
>

Rust does not yet seem to support extension types? (
https://github.com/apache/arrow-rs/issues/218)


> - First well known extension type candidate is TensorArray but other
> suggestions are welcome.
>

Others that I am aware of that have been brought up in the past are UUID (
ARROW-2152 <https://issues.apache.org/jira/browse/ARROW-2152>), complex
numbers (ARROW-638 <https://issues.apache.org/jira/browse/ARROW-638>, this
has a PR) and 8-bit boolean values (ARROW-1674
<https://issues.apache.org/jira/browse/ARROW-1674>). But I think we should
mainly look at demand / someone wanting to implement this, and (for you)
this seems to be Tensors, so it's fine to focus on that.

Joris


>
> On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >
> > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :
> > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc <rok.mih...@gmail.com> wrote:
> > >>
> > >> Thanks for the input Weston!
> > >>
> > >> How about arrow/experimental/format/ExtensionTypes.fbs or
> > >> arrow/format/ExtensionTypes.fbs for language independent schema and
> > >> loosely arrow/<IMPLEMENTATION>/extensions for implementations?
> > >>
> > >> Having machine readable definitions could perhaps be useful for
> > >> generating implementations in some cases.
> > >
> > > Is it useful to put this in a flatbuffer file? Based on the list from
> > > Weston just below, I think this will mostly contain a *description* of
> > > those different aspect (a specification of the extension type), and
> > > there is no data that actually fits in a flatbuffer table? In that
> > > case a plain text (eg markdown) file seems more fitting?
> >
> > I agree this is mostly a plain text (or, rather, reST :-)) specification
> > task.
> >
> > Regards
> >
> > Antoine.
>

Reply via email to