Note that we do not have tests on tensor arrays, so testing the extension type on these may be hindered by divergences between implementations. I do not think we even have json integration files for them.
If the focus is extension types, maybe it would be best to cover types whose physical representations are covered in e.g. IPC or c data interface tests. I do not know if we voted on a naming convention, but we may want to reserve a namespace for us (e.g. "arrow"). Also, note that Rust's arrow2 supports extension types (tested part of the IPC and c data interface*), and Polars relies on it to allow Python generic "object" in its machinery. Best, Jorge * pending https://issues.apache.org/jira/browse/ARROW-15613 On Tue, Feb 8, 2022, 13:52 Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > On Mon, 7 Feb 2022 at 21:02, Rok Mihevc <rok.mih...@gmail.com> wrote: > > > To follow up the discussion from the bi-weekly Arrow sync: > > > > - JSON seems the most suitable candidate for the extension metadata. > > E.g.: TensorArray > > {"key": "ARROW:extension:name", "value": "tensor<type=int64, shape=(3, > > 3, 4), strides=(12, 4, 1)>"}, > > {"key": "ARROW:extension:metadata", "value": "{'type': 'int64', > > 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"} > > > > I will start a separate thread for the exact encoding of the metadata value > (i.e. JSON or something else) if that's OK. I already started writing one > last week anyway, and that keeps things a bit separated. > > For the name of the extension type: > - We might want to use something like "arrow.tensor" to follow the > recommendation at > https://arrow.apache.org/docs/format/Columnar.html#extension-types to use > a > namespace. And so for "well known" extension types that are defined in the > Arrow project itself, I think we can use the "arrow" namespace? (as > example, for the extension types defined in pandas, I used the "pandas." > namespace) > - In general, I think it's best to keep the name itself simple, and leave > any parametrization out of it (since this is included in the metadata). So > in this case that would be just "tensor" instead of "tensor<type=..., > shape=..., ..>". > - Specifically for this extension type, we might want to use something like > "fixed_size_tensor" instead of "tensor", to be able to differentiate in the > future between the tensor type with constant shape vs variable shape ( > ARROW-1614 <https://issues.apache.org/jira/browse/ARROW-1614> vs > ARROW-8714 > <https://issues.apache.org/jira/browse/ARROW-8714>). But that's something > to discuss in the relevant JIRA issue / PR. > > - We want to start with at least one integration test pair. Potential > > candidates are cpp, julia, go, rust. > > > > Rust does not yet seem to support extension types? ( > https://github.com/apache/arrow-rs/issues/218) > > > > - First well known extension type candidate is TensorArray but other > > suggestions are welcome. > > > > Others that I am aware of that have been brought up in the past are UUID ( > ARROW-2152 <https://issues.apache.org/jira/browse/ARROW-2152>), complex > numbers (ARROW-638 <https://issues.apache.org/jira/browse/ARROW-638>, this > has a PR) and 8-bit boolean values (ARROW-1674 > <https://issues.apache.org/jira/browse/ARROW-1674>). But I think we should > mainly look at demand / someone wanting to implement this, and (for you) > this seems to be Tensors, so it's fine to focus on that. > > Joris > > > > > > On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit : > > > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc <rok.mih...@gmail.com> > wrote: > > > >> > > > >> Thanks for the input Weston! > > > >> > > > >> How about arrow/experimental/format/ExtensionTypes.fbs or > > > >> arrow/format/ExtensionTypes.fbs for language independent schema and > > > >> loosely arrow/<IMPLEMENTATION>/extensions for implementations? > > > >> > > > >> Having machine readable definitions could perhaps be useful for > > > >> generating implementations in some cases. > > > > > > > > Is it useful to put this in a flatbuffer file? Based on the list from > > > > Weston just below, I think this will mostly contain a *description* > of > > > > those different aspect (a specification of the extension type), and > > > > there is no data that actually fits in a flatbuffer table? In that > > > > case a plain text (eg markdown) file seems more fitting? > > > > > > I agree this is mostly a plain text (or, rather, reST :-)) > specification > > > task. > > > > > > Regards > > > > > > Antoine. > > >