Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-02-08 Thread Paul Balança
If I may, I would be really interested to be kept in the loop as well. I
have been working on a small library making it easy to declare Python types
and automatically getting them supported in Pyarrow as extension types (and
then benefit of vecotrized ops) : https://github.com/balancap/arrowbic

The main feature at the moment is the support of dataclass, numpy arrays
and enum, but I plan to extend it to as many standard Python patterns as
possible.

Short story, for now, I am storing metadata in json serialized, but I would
be happy to move to any standard defined in Pyarrow, and also use the
standard representation for tensor / Numpy array

Thanks you!
Paul




On Tue, 8 Feb 2022, 17:57 Micah Kornfield,  wrote:

> >
> > I do not know if we voted on a naming convention, but we may want to
> > reserve a namespace for us (e.g. "arrow").
>
> +1 to calling out in docs that the arrow namespace should be reserved.
> maybe "apache.arrow" to lower the possibility of collisions with people who
> already have extension types? (I don't feel too strongly about this).
>
> Note that we do not have tests on tensor arrays, so testing the extension
> > type on these may be hindered by divergences between implementations. I
> do
> > not think we even have json integration files for them.
>
> Agree, we'll likely need a little more thought on what it means to validate
> extension types (is being able to parse extension metadata sufficient?)
>
> Also, note that Rust's arrow2 supports extension types (tested part of the
> > IPC and c data interface*), and Polars relies on it to allow Python
> generic
> > "object" in its machinery.
>
> I think this is great for having external verification of  specifications,
> but I think for official arrow types, we should be focusing on
> implementations that are under ASF governance.
>
> On Tue, Feb 8, 2022 at 8:32 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Note that we do not have tests on tensor arrays, so testing the extension
> > type on these may be hindered by divergences between implementations. I
> do
> > not think we even have json integration files for them.
> >
> > If the focus is extension types, maybe it would be best to cover types
> > whose physical representations are covered in e.g. IPC or c data
> interface
> > tests.
> >
> > I do not know if we voted on a naming convention, but we may want to
> > reserve a namespace for us (e.g. "arrow").
> >
> > Also, note that Rust's arrow2 supports extension types (tested part of
> the
> > IPC and c data interface*), and Polars relies on it to allow Python
> generic
> > "object" in its machinery.
> >
> > Best,
> > Jorge
> >
> > * pending https://issues.apache.org/jira/browse/ARROW-15613
> >
> >
> >
> > On Tue, Feb 8, 2022, 13:52 Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> > > On Mon, 7 Feb 2022 at 21:02, Rok Mihevc  wrote:
> > >
> > > > To follow up the discussion from the bi-weekly Arrow sync:
> > > >
> > > > - JSON seems the most suitable candidate for the extension metadata.
> > > > E.g.: TensorArray
> > > > {"key": "ARROW:extension:name", "value": "tensor shape=(3,
> > > > 3, 4), strides=(12, 4, 1)>"},
> > > > {"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
> > > > 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}
> > > >
> > >
> > > I will start a separate thread for the exact encoding of the metadata
> > value
> > > (i.e. JSON or something else) if that's OK. I already started writing
> one
> > > last week anyway, and that keeps things a bit separated.
> > >
> > > For the name of the extension type:
> > > - We might want to use something like "arrow.tensor" to follow the
> > > recommendation at
> > > https://arrow.apache.org/docs/format/Columnar.html#extension-types to
> > use
> > > a
> > > namespace. And so for "well known" extension types that are defined in
> > the
> > > Arrow project itself, I think we can use the "arrow" namespace? (as
> > > example, for the extension types defined in pandas, I used the
> "pandas."
> > > namespace)
> > > - In general, I think it's best to keep the name itself simple, and
> leave
> > > any parametrization out of it (since this is included in the metadata).
> > So
> > > in this case that would be just "tensor" instead of "tensor > > shape=..., ..>".
> > > - Specifically for this extension type, we might want to use something
> > like
> > > "fixed_size_tensor" instead of "tensor", to be able to differentiate in
> > the
> > > future between the tensor type with constant shape vs variable shape (
> > > ARROW-1614  vs
> > > ARROW-8714
> > > ). But that's
> > something
> > > to discuss in the relevant JIRA issue / PR.
> > >
> > > - We want to start with at least one integration test pair. Potential
> > > > candidates are cpp, julia, go, rust.
> > > >
> > >
> > > Rust does not yet seem to support extension types? (
> > > 

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-02-08 Thread Micah Kornfield
>
> I do not know if we voted on a naming convention, but we may want to
> reserve a namespace for us (e.g. "arrow").

+1 to calling out in docs that the arrow namespace should be reserved.
maybe "apache.arrow" to lower the possibility of collisions with people who
already have extension types? (I don't feel too strongly about this).

Note that we do not have tests on tensor arrays, so testing the extension
> type on these may be hindered by divergences between implementations. I do
> not think we even have json integration files for them.

Agree, we'll likely need a little more thought on what it means to validate
extension types (is being able to parse extension metadata sufficient?)

Also, note that Rust's arrow2 supports extension types (tested part of the
> IPC and c data interface*), and Polars relies on it to allow Python generic
> "object" in its machinery.

I think this is great for having external verification of  specifications,
but I think for official arrow types, we should be focusing on
implementations that are under ASF governance.

On Tue, Feb 8, 2022 at 8:32 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Note that we do not have tests on tensor arrays, so testing the extension
> type on these may be hindered by divergences between implementations. I do
> not think we even have json integration files for them.
>
> If the focus is extension types, maybe it would be best to cover types
> whose physical representations are covered in e.g. IPC or c data interface
> tests.
>
> I do not know if we voted on a naming convention, but we may want to
> reserve a namespace for us (e.g. "arrow").
>
> Also, note that Rust's arrow2 supports extension types (tested part of the
> IPC and c data interface*), and Polars relies on it to allow Python generic
> "object" in its machinery.
>
> Best,
> Jorge
>
> * pending https://issues.apache.org/jira/browse/ARROW-15613
>
>
>
> On Tue, Feb 8, 2022, 13:52 Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > On Mon, 7 Feb 2022 at 21:02, Rok Mihevc  wrote:
> >
> > > To follow up the discussion from the bi-weekly Arrow sync:
> > >
> > > - JSON seems the most suitable candidate for the extension metadata.
> > > E.g.: TensorArray
> > > {"key": "ARROW:extension:name", "value": "tensor > > 3, 4), strides=(12, 4, 1)>"},
> > > {"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
> > > 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}
> > >
> >
> > I will start a separate thread for the exact encoding of the metadata
> value
> > (i.e. JSON or something else) if that's OK. I already started writing one
> > last week anyway, and that keeps things a bit separated.
> >
> > For the name of the extension type:
> > - We might want to use something like "arrow.tensor" to follow the
> > recommendation at
> > https://arrow.apache.org/docs/format/Columnar.html#extension-types to
> use
> > a
> > namespace. And so for "well known" extension types that are defined in
> the
> > Arrow project itself, I think we can use the "arrow" namespace? (as
> > example, for the extension types defined in pandas, I used the "pandas."
> > namespace)
> > - In general, I think it's best to keep the name itself simple, and leave
> > any parametrization out of it (since this is included in the metadata).
> So
> > in this case that would be just "tensor" instead of "tensor > shape=..., ..>".
> > - Specifically for this extension type, we might want to use something
> like
> > "fixed_size_tensor" instead of "tensor", to be able to differentiate in
> the
> > future between the tensor type with constant shape vs variable shape (
> > ARROW-1614  vs
> > ARROW-8714
> > ). But that's
> something
> > to discuss in the relevant JIRA issue / PR.
> >
> > - We want to start with at least one integration test pair. Potential
> > > candidates are cpp, julia, go, rust.
> > >
> >
> > Rust does not yet seem to support extension types? (
> > https://github.com/apache/arrow-rs/issues/218)
> >
> >
> > > - First well known extension type candidate is TensorArray but other
> > > suggestions are welcome.
> > >
> >
> > Others that I am aware of that have been brought up in the past are UUID
> (
> > ARROW-2152 ), complex
> > numbers (ARROW-638 ,
> this
> > has a PR) and 8-bit boolean values (ARROW-1674
> > ). But I think we
> should
> > mainly look at demand / someone wanting to implement this, and (for you)
> > this seems to be Tensors, so it's fine to focus on that.
> >
> > Joris
> >
> >
> > >
> > > On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou 
> > > wrote:
> > > >
> > > >
> > > > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :
> > > > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc 
> > wrote:
> > > > >>
> > > > >> Thanks for the input Weston!

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-02-08 Thread Jorge Cardoso Leitão
Note that we do not have tests on tensor arrays, so testing the extension
type on these may be hindered by divergences between implementations. I do
not think we even have json integration files for them.

If the focus is extension types, maybe it would be best to cover types
whose physical representations are covered in e.g. IPC or c data interface
tests.

I do not know if we voted on a naming convention, but we may want to
reserve a namespace for us (e.g. "arrow").

Also, note that Rust's arrow2 supports extension types (tested part of the
IPC and c data interface*), and Polars relies on it to allow Python generic
"object" in its machinery.

Best,
Jorge

* pending https://issues.apache.org/jira/browse/ARROW-15613



On Tue, Feb 8, 2022, 13:52 Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> On Mon, 7 Feb 2022 at 21:02, Rok Mihevc  wrote:
>
> > To follow up the discussion from the bi-weekly Arrow sync:
> >
> > - JSON seems the most suitable candidate for the extension metadata.
> > E.g.: TensorArray
> > {"key": "ARROW:extension:name", "value": "tensor > 3, 4), strides=(12, 4, 1)>"},
> > {"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
> > 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}
> >
>
> I will start a separate thread for the exact encoding of the metadata value
> (i.e. JSON or something else) if that's OK. I already started writing one
> last week anyway, and that keeps things a bit separated.
>
> For the name of the extension type:
> - We might want to use something like "arrow.tensor" to follow the
> recommendation at
> https://arrow.apache.org/docs/format/Columnar.html#extension-types to use
> a
> namespace. And so for "well known" extension types that are defined in the
> Arrow project itself, I think we can use the "arrow" namespace? (as
> example, for the extension types defined in pandas, I used the "pandas."
> namespace)
> - In general, I think it's best to keep the name itself simple, and leave
> any parametrization out of it (since this is included in the metadata). So
> in this case that would be just "tensor" instead of "tensor shape=..., ..>".
> - Specifically for this extension type, we might want to use something like
> "fixed_size_tensor" instead of "tensor", to be able to differentiate in the
> future between the tensor type with constant shape vs variable shape (
> ARROW-1614  vs
> ARROW-8714
> ). But that's something
> to discuss in the relevant JIRA issue / PR.
>
> - We want to start with at least one integration test pair. Potential
> > candidates are cpp, julia, go, rust.
> >
>
> Rust does not yet seem to support extension types? (
> https://github.com/apache/arrow-rs/issues/218)
>
>
> > - First well known extension type candidate is TensorArray but other
> > suggestions are welcome.
> >
>
> Others that I am aware of that have been brought up in the past are UUID (
> ARROW-2152 ), complex
> numbers (ARROW-638 , this
> has a PR) and 8-bit boolean values (ARROW-1674
> ). But I think we should
> mainly look at demand / someone wanting to implement this, and (for you)
> this seems to be Tensors, so it's fine to focus on that.
>
> Joris
>
>
> >
> > On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou 
> > wrote:
> > >
> > >
> > > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :
> > > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc 
> wrote:
> > > >>
> > > >> Thanks for the input Weston!
> > > >>
> > > >> How about arrow/experimental/format/ExtensionTypes.fbs or
> > > >> arrow/format/ExtensionTypes.fbs for language independent schema and
> > > >> loosely arrow//extensions for implementations?
> > > >>
> > > >> Having machine readable definitions could perhaps be useful for
> > > >> generating implementations in some cases.
> > > >
> > > > Is it useful to put this in a flatbuffer file? Based on the list from
> > > > Weston just below, I think this will mostly contain a *description*
> of
> > > > those different aspect (a specification of the extension type), and
> > > > there is no data that actually fits in a flatbuffer table? In that
> > > > case a plain text (eg markdown) file seems more fitting?
> > >
> > > I agree this is mostly a plain text (or, rather, reST :-))
> specification
> > > task.
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-02-08 Thread Joris Van den Bossche
On Mon, 7 Feb 2022 at 21:02, Rok Mihevc  wrote:

> To follow up the discussion from the bi-weekly Arrow sync:
>
> - JSON seems the most suitable candidate for the extension metadata.
> E.g.: TensorArray
> {"key": "ARROW:extension:name", "value": "tensor 3, 4), strides=(12, 4, 1)>"},
> {"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
> 'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}
>

I will start a separate thread for the exact encoding of the metadata value
(i.e. JSON or something else) if that's OK. I already started writing one
last week anyway, and that keeps things a bit separated.

For the name of the extension type:
- We might want to use something like "arrow.tensor" to follow the
recommendation at
https://arrow.apache.org/docs/format/Columnar.html#extension-types to use a
namespace. And so for "well known" extension types that are defined in the
Arrow project itself, I think we can use the "arrow" namespace? (as
example, for the extension types defined in pandas, I used the "pandas."
namespace)
- In general, I think it's best to keep the name itself simple, and leave
any parametrization out of it (since this is included in the metadata). So
in this case that would be just "tensor" instead of "tensor".
- Specifically for this extension type, we might want to use something like
"fixed_size_tensor" instead of "tensor", to be able to differentiate in the
future between the tensor type with constant shape vs variable shape (
ARROW-1614  vs ARROW-8714
). But that's something
to discuss in the relevant JIRA issue / PR.

- We want to start with at least one integration test pair. Potential
> candidates are cpp, julia, go, rust.
>

Rust does not yet seem to support extension types? (
https://github.com/apache/arrow-rs/issues/218)


> - First well known extension type candidate is TensorArray but other
> suggestions are welcome.
>

Others that I am aware of that have been brought up in the past are UUID (
ARROW-2152 ), complex
numbers (ARROW-638 , this
has a PR) and 8-bit boolean values (ARROW-1674
). But I think we should
mainly look at demand / someone wanting to implement this, and (for you)
this seems to be Tensors, so it's fine to focus on that.

Joris


>
> On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou 
> wrote:
> >
> >
> > Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :
> > > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc  wrote:
> > >>
> > >> Thanks for the input Weston!
> > >>
> > >> How about arrow/experimental/format/ExtensionTypes.fbs or
> > >> arrow/format/ExtensionTypes.fbs for language independent schema and
> > >> loosely arrow//extensions for implementations?
> > >>
> > >> Having machine readable definitions could perhaps be useful for
> > >> generating implementations in some cases.
> > >
> > > Is it useful to put this in a flatbuffer file? Based on the list from
> > > Weston just below, I think this will mostly contain a *description* of
> > > those different aspect (a specification of the extension type), and
> > > there is no data that actually fits in a flatbuffer table? In that
> > > case a plain text (eg markdown) file seems more fitting?
> >
> > I agree this is mostly a plain text (or, rather, reST :-)) specification
> > task.
> >
> > Regards
> >
> > Antoine.
>


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-02-07 Thread Rok Mihevc
To follow up the discussion from the bi-weekly Arrow sync:

- JSON seems the most suitable candidate for the extension metadata.
E.g.: TensorArray
{"key": "ARROW:extension:name", "value": "tensor"},
{"key": "ARROW:extension:metadata", "value": "{'type': 'int64',
'shape': [3, 3, 4], 'strides': [12, 4, 1]}"}

- We want to start with at least one integration test pair. Potential
candidates are cpp, julia, go, rust.

- First well known extension type candidate is TensorArray but other
suggestions are welcome.

On Tue, Jan 25, 2022 at 10:34 AM Antoine Pitrou  wrote:
>
>
> Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :
> > On Sat, 22 Jan 2022 at 20:27, Rok Mihevc  wrote:
> >>
> >> Thanks for the input Weston!
> >>
> >> How about arrow/experimental/format/ExtensionTypes.fbs or
> >> arrow/format/ExtensionTypes.fbs for language independent schema and
> >> loosely arrow//extensions for implementations?
> >>
> >> Having machine readable definitions could perhaps be useful for
> >> generating implementations in some cases.
> >
> > Is it useful to put this in a flatbuffer file? Based on the list from
> > Weston just below, I think this will mostly contain a *description* of
> > those different aspect (a specification of the extension type), and
> > there is no data that actually fits in a flatbuffer table? In that
> > case a plain text (eg markdown) file seems more fitting?
>
> I agree this is mostly a plain text (or, rather, reST :-)) specification
> task.
>
> Regards
>
> Antoine.


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-01-25 Thread Antoine Pitrou



Le 25/01/2022 à 10:12, Joris Van den Bossche a écrit :

On Sat, 22 Jan 2022 at 20:27, Rok Mihevc  wrote:


Thanks for the input Weston!

How about arrow/experimental/format/ExtensionTypes.fbs or
arrow/format/ExtensionTypes.fbs for language independent schema and
loosely arrow//extensions for implementations?

Having machine readable definitions could perhaps be useful for
generating implementations in some cases.


Is it useful to put this in a flatbuffer file? Based on the list from
Weston just below, I think this will mostly contain a *description* of
those different aspect (a specification of the extension type), and
there is no data that actually fits in a flatbuffer table? In that
case a plain text (eg markdown) file seems more fitting?


I agree this is mostly a plain text (or, rather, reST :-)) specification 
task.


Regards

Antoine.


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-01-25 Thread Joris Van den Bossche
On Sat, 22 Jan 2022 at 20:27, Rok Mihevc  wrote:
>
> Thanks for the input Weston!
>
> How about arrow/experimental/format/ExtensionTypes.fbs or
> arrow/format/ExtensionTypes.fbs for language independent schema and
> loosely arrow//extensions for implementations?
>
> Having machine readable definitions could perhaps be useful for
> generating implementations in some cases.

Is it useful to put this in a flatbuffer file? Based on the list from
Weston just below, I think this will mostly contain a *description* of
those different aspect (a specification of the extension type), and
there is no data that actually fits in a flatbuffer table? In that
case a plain text (eg markdown) file seems more fitting?

>
> > * The name of the extension type (to go in ARROW:extension:name)
> > * A description of the extension type and how it should be used
> > * The storage type of the extension type
> > * The format and meaning of the content that will go into 
> > ARROW:extension:metadata
>
> These sound pretty complete!
>
> I'll wait for a couple of days to see if there's more input and then
> draft a PR. Do we need a vote on this?
>
>
> Best,
> Rok
>
> On Fri, Jan 21, 2022 at 3:07 AM Weston Pace  wrote:
> >
> > Those all seem to be C++ locations.  If we want to define
> > cross-implementation "Well Known Extension Types" then it seems we
> > would want to come up with some kind of language independent agreement
> > (could just be a markdown file but maybe there is some advantage to
> > having something programmatically consumable) describing:
> >
> > * The name of the extension type (to go in ARROW:extension:name)
> > * A description of the extension type and how it should be used
> > * The storage type of the extension type
> > * The format and meaning of the content that will go into
> > ARROW:extension:metadata
> >
> > I think (but am not sure) that, since these are metadata keys, we are
> > supposed to stick to printable ASCII for values (for backwards
> > compatibility).
> >
> > For example, in the docs, we currently have this little blurb about a
> > theoretical tensor extension type:
> >
> > > tensor (multidimensional array) stored as Binary values and
> > > having serialized metadata indicating the data type and shape
> > > of each value. This could be JSON like {'type': 'int8', 'shape':
> > > [4, 5]} for a 4x5 cell tensor.
> >
> > In my mind this file would be somewhat analogous to the way that
> > schema.fbs is the cross implementation "ground truth" for our logical
> > types.
> >
> > Then the C++ implementation would be free to put the implementation
> > (I'd vote for arrow/cpp/extensions but a separate repo is probably ok.
> > I'm -1 on arrow/extensions/...)
> >
> > On Thu, Jan 20, 2022 at 3:20 PM Rok Mihevc  wrote:
> > >
> > > To continue the ExtensionType part of this thread - I would like to
> > > add TensorArray [1] as an ExtensionType to Arrow but we have not yet
> > > agreed on an "official" location for "Well Known Extension Types".
> > >
> > > Where could we put these? Some suggestions:
> > >
> > > * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h)
> > > * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h)
> > > * separate repo (e.g. 
> > > github.com/apache/arrow_extensions/cpp/tensor_array.h)
> > >
> > > I'd be happy to also gather other Well Known Extension Types into one
> > > location if this moves forward.
> > >
> > > Rok
> > >
> > > [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389
> > >
> > > On Sat, May 1, 2021 at 12:12 PM Andrew Lamb  wrote:
> > > >
> > > > I agree with others on this thread. Thanks for writing this down Micah
> > > >
> > > > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou  
> > > > wrote:
> > > >
> > > > >
> > > > > I concur with both what Wes and Micah said.
> > > > >
> > > > > As for temporal types, they have wide-spread use and their semantics
> > > > > require dedicated treatment for arithmetic and conversion, so it's
> > > > > helpful to define dedicated types for them, as opposed to mere 
> > > > > annotations.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 30/04/2021 à 16:40, Wes McKinney a écrit :
> > > > > > I agree that the bar for adding new types to the Type union in 
> > > > > > Schema.fbs
> > > > > > should be quite high going forward. Using extension types 
> > > > > > increasingly
> > > > > for
> > > > > > adding specializations of built-in types will mean less burden for
> > > > > > implementations to simply "propagate forward" this data (by 
> > > > > > preserving
> > > > > the
> > > > > > extra metadata) even if they don't understand what it does. It 
> > > > > > would be
> > > > > > nice, therefore, to put us on a path to expanding our set of 
> > > > > > "official"
> > > > > > extension types (which would include things like JSON or UUID) 
> > > > > > since some
> > > > > > libraries may choose to implement convenience containers for these 
> > > > > > for
> > > > > 

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-01-22 Thread Micah Kornfield
Sorry meant to add, that I think the C++ implementation should go
where-ever is most convenient to make it work well in the system (unless
the type requires heavy third-party dependencies).

On Sat, Jan 22, 2022 at 8:53 PM Micah Kornfield 
wrote:

>  Do we need a vote on this?
>
> I was imagining well known types would follow roughly the same process
> that new types follow (requiring two different language implementations and
> an integration test).  I don't think we need to stick to java as the second
> language though.
>
> On Sat, Jan 22, 2022 at 11:27 AM Rok Mihevc  wrote:
>
>> Thanks for the input Weston!
>>
>> How about arrow/experimental/format/ExtensionTypes.fbs or
>> arrow/format/ExtensionTypes.fbs for language independent schema and
>> loosely arrow//extensions for implementations?
>>
>> Having machine readable definitions could perhaps be useful for
>> generating implementations in some cases.
>>
>> > * The name of the extension type (to go in ARROW:extension:name)
>> > * A description of the extension type and how it should be used
>> > * The storage type of the extension type
>> > * The format and meaning of the content that will go into
>> ARROW:extension:metadata
>>
>> These sound pretty complete!
>>
>> I'll wait for a couple of days to see if there's more input and then
>> draft a PR. Do we need a vote on this?
>>
>>
>> Best,
>> Rok
>>
>> On Fri, Jan 21, 2022 at 3:07 AM Weston Pace 
>> wrote:
>> >
>> > Those all seem to be C++ locations.  If we want to define
>> > cross-implementation "Well Known Extension Types" then it seems we
>> > would want to come up with some kind of language independent agreement
>> > (could just be a markdown file but maybe there is some advantage to
>> > having something programmatically consumable) describing:
>> >
>> > * The name of the extension type (to go in ARROW:extension:name)
>> > * A description of the extension type and how it should be used
>> > * The storage type of the extension type
>> > * The format and meaning of the content that will go into
>> > ARROW:extension:metadata
>> >
>> > I think (but am not sure) that, since these are metadata keys, we are
>> > supposed to stick to printable ASCII for values (for backwards
>> > compatibility).
>> >
>> > For example, in the docs, we currently have this little blurb about a
>> > theoretical tensor extension type:
>> >
>> > > tensor (multidimensional array) stored as Binary values and
>> > > having serialized metadata indicating the data type and shape
>> > > of each value. This could be JSON like {'type': 'int8', 'shape':
>> > > [4, 5]} for a 4x5 cell tensor.
>> >
>> > In my mind this file would be somewhat analogous to the way that
>> > schema.fbs is the cross implementation "ground truth" for our logical
>> > types.
>> >
>> > Then the C++ implementation would be free to put the implementation
>> > (I'd vote for arrow/cpp/extensions but a separate repo is probably ok.
>> > I'm -1 on arrow/extensions/...)
>> >
>> > On Thu, Jan 20, 2022 at 3:20 PM Rok Mihevc 
>> wrote:
>> > >
>> > > To continue the ExtensionType part of this thread - I would like to
>> > > add TensorArray [1] as an ExtensionType to Arrow but we have not yet
>> > > agreed on an "official" location for "Well Known Extension Types".
>> > >
>> > > Where could we put these? Some suggestions:
>> > >
>> > > * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h)
>> > > * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h)
>> > > * separate repo (e.g.
>> github.com/apache/arrow_extensions/cpp/tensor_array.h)
>> > >
>> > > I'd be happy to also gather other Well Known Extension Types into one
>> > > location if this moves forward.
>> > >
>> > > Rok
>> > >
>> > > [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389
>> > >
>> > > On Sat, May 1, 2021 at 12:12 PM Andrew Lamb 
>> wrote:
>> > > >
>> > > > I agree with others on this thread. Thanks for writing this down
>> Micah
>> > > >
>> > > > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou 
>> wrote:
>> > > >
>> > > > >
>> > > > > I concur with both what Wes and Micah said.
>> > > > >
>> > > > > As for temporal types, they have wide-spread use and their
>> semantics
>> > > > > require dedicated treatment for arithmetic and conversion, so it's
>> > > > > helpful to define dedicated types for them, as opposed to mere
>> annotations.
>> > > > >
>> > > > > Regards
>> > > > >
>> > > > > Antoine.
>> > > > >
>> > > > >
>> > > > > Le 30/04/2021 à 16:40, Wes McKinney a écrit :
>> > > > > > I agree that the bar for adding new types to the Type union in
>> Schema.fbs
>> > > > > > should be quite high going forward. Using extension types
>> increasingly
>> > > > > for
>> > > > > > adding specializations of built-in types will mean less burden
>> for
>> > > > > > implementations to simply "propagate forward" this data (by
>> preserving
>> > > > > the
>> > > > > > extra metadata) even if they don't understand what it does. It
>> would be
>> > > > > > nice, therefore, to put us 

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-01-22 Thread Micah Kornfield
>
>  Do we need a vote on this?

I was imagining well known types would follow roughly the same process that
new types follow (requiring two different language implementations and an
integration test).  I don't think we need to stick to java as the second
language though.

On Sat, Jan 22, 2022 at 11:27 AM Rok Mihevc  wrote:

> Thanks for the input Weston!
>
> How about arrow/experimental/format/ExtensionTypes.fbs or
> arrow/format/ExtensionTypes.fbs for language independent schema and
> loosely arrow//extensions for implementations?
>
> Having machine readable definitions could perhaps be useful for
> generating implementations in some cases.
>
> > * The name of the extension type (to go in ARROW:extension:name)
> > * A description of the extension type and how it should be used
> > * The storage type of the extension type
> > * The format and meaning of the content that will go into
> ARROW:extension:metadata
>
> These sound pretty complete!
>
> I'll wait for a couple of days to see if there's more input and then
> draft a PR. Do we need a vote on this?
>
>
> Best,
> Rok
>
> On Fri, Jan 21, 2022 at 3:07 AM Weston Pace  wrote:
> >
> > Those all seem to be C++ locations.  If we want to define
> > cross-implementation "Well Known Extension Types" then it seems we
> > would want to come up with some kind of language independent agreement
> > (could just be a markdown file but maybe there is some advantage to
> > having something programmatically consumable) describing:
> >
> > * The name of the extension type (to go in ARROW:extension:name)
> > * A description of the extension type and how it should be used
> > * The storage type of the extension type
> > * The format and meaning of the content that will go into
> > ARROW:extension:metadata
> >
> > I think (but am not sure) that, since these are metadata keys, we are
> > supposed to stick to printable ASCII for values (for backwards
> > compatibility).
> >
> > For example, in the docs, we currently have this little blurb about a
> > theoretical tensor extension type:
> >
> > > tensor (multidimensional array) stored as Binary values and
> > > having serialized metadata indicating the data type and shape
> > > of each value. This could be JSON like {'type': 'int8', 'shape':
> > > [4, 5]} for a 4x5 cell tensor.
> >
> > In my mind this file would be somewhat analogous to the way that
> > schema.fbs is the cross implementation "ground truth" for our logical
> > types.
> >
> > Then the C++ implementation would be free to put the implementation
> > (I'd vote for arrow/cpp/extensions but a separate repo is probably ok.
> > I'm -1 on arrow/extensions/...)
> >
> > On Thu, Jan 20, 2022 at 3:20 PM Rok Mihevc  wrote:
> > >
> > > To continue the ExtensionType part of this thread - I would like to
> > > add TensorArray [1] as an ExtensionType to Arrow but we have not yet
> > > agreed on an "official" location for "Well Known Extension Types".
> > >
> > > Where could we put these? Some suggestions:
> > >
> > > * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h)
> > > * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h)
> > > * separate repo (e.g.
> github.com/apache/arrow_extensions/cpp/tensor_array.h)
> > >
> > > I'd be happy to also gather other Well Known Extension Types into one
> > > location if this moves forward.
> > >
> > > Rok
> > >
> > > [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389
> > >
> > > On Sat, May 1, 2021 at 12:12 PM Andrew Lamb 
> wrote:
> > > >
> > > > I agree with others on this thread. Thanks for writing this down
> Micah
> > > >
> > > > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou 
> wrote:
> > > >
> > > > >
> > > > > I concur with both what Wes and Micah said.
> > > > >
> > > > > As for temporal types, they have wide-spread use and their
> semantics
> > > > > require dedicated treatment for arithmetic and conversion, so it's
> > > > > helpful to define dedicated types for them, as opposed to mere
> annotations.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 30/04/2021 à 16:40, Wes McKinney a écrit :
> > > > > > I agree that the bar for adding new types to the Type union in
> Schema.fbs
> > > > > > should be quite high going forward. Using extension types
> increasingly
> > > > > for
> > > > > > adding specializations of built-in types will mean less burden
> for
> > > > > > implementations to simply "propagate forward" this data (by
> preserving
> > > > > the
> > > > > > extra metadata) even if they don't understand what it does. It
> would be
> > > > > > nice, therefore, to put us on a path to expanding our set of
> "official"
> > > > > > extension types (which would include things like JSON or UUID)
> since some
> > > > > > libraries may choose to implement convenience containers for
> these for
> > > > > > usability.
> > > > > >
> > > > > > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette <
> bhule...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > >> 

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-01-22 Thread Rok Mihevc
Thanks for the input Weston!

How about arrow/experimental/format/ExtensionTypes.fbs or
arrow/format/ExtensionTypes.fbs for language independent schema and
loosely arrow//extensions for implementations?

Having machine readable definitions could perhaps be useful for
generating implementations in some cases.

> * The name of the extension type (to go in ARROW:extension:name)
> * A description of the extension type and how it should be used
> * The storage type of the extension type
> * The format and meaning of the content that will go into 
> ARROW:extension:metadata

These sound pretty complete!

I'll wait for a couple of days to see if there's more input and then
draft a PR. Do we need a vote on this?


Best,
Rok

On Fri, Jan 21, 2022 at 3:07 AM Weston Pace  wrote:
>
> Those all seem to be C++ locations.  If we want to define
> cross-implementation "Well Known Extension Types" then it seems we
> would want to come up with some kind of language independent agreement
> (could just be a markdown file but maybe there is some advantage to
> having something programmatically consumable) describing:
>
> * The name of the extension type (to go in ARROW:extension:name)
> * A description of the extension type and how it should be used
> * The storage type of the extension type
> * The format and meaning of the content that will go into
> ARROW:extension:metadata
>
> I think (but am not sure) that, since these are metadata keys, we are
> supposed to stick to printable ASCII for values (for backwards
> compatibility).
>
> For example, in the docs, we currently have this little blurb about a
> theoretical tensor extension type:
>
> > tensor (multidimensional array) stored as Binary values and
> > having serialized metadata indicating the data type and shape
> > of each value. This could be JSON like {'type': 'int8', 'shape':
> > [4, 5]} for a 4x5 cell tensor.
>
> In my mind this file would be somewhat analogous to the way that
> schema.fbs is the cross implementation "ground truth" for our logical
> types.
>
> Then the C++ implementation would be free to put the implementation
> (I'd vote for arrow/cpp/extensions but a separate repo is probably ok.
> I'm -1 on arrow/extensions/...)
>
> On Thu, Jan 20, 2022 at 3:20 PM Rok Mihevc  wrote:
> >
> > To continue the ExtensionType part of this thread - I would like to
> > add TensorArray [1] as an ExtensionType to Arrow but we have not yet
> > agreed on an "official" location for "Well Known Extension Types".
> >
> > Where could we put these? Some suggestions:
> >
> > * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h)
> > * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h)
> > * separate repo (e.g. github.com/apache/arrow_extensions/cpp/tensor_array.h)
> >
> > I'd be happy to also gather other Well Known Extension Types into one
> > location if this moves forward.
> >
> > Rok
> >
> > [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389
> >
> > On Sat, May 1, 2021 at 12:12 PM Andrew Lamb  wrote:
> > >
> > > I agree with others on this thread. Thanks for writing this down Micah
> > >
> > > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou  
> > > wrote:
> > >
> > > >
> > > > I concur with both what Wes and Micah said.
> > > >
> > > > As for temporal types, they have wide-spread use and their semantics
> > > > require dedicated treatment for arithmetic and conversion, so it's
> > > > helpful to define dedicated types for them, as opposed to mere 
> > > > annotations.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 30/04/2021 à 16:40, Wes McKinney a écrit :
> > > > > I agree that the bar for adding new types to the Type union in 
> > > > > Schema.fbs
> > > > > should be quite high going forward. Using extension types increasingly
> > > > for
> > > > > adding specializations of built-in types will mean less burden for
> > > > > implementations to simply "propagate forward" this data (by preserving
> > > > the
> > > > > extra metadata) even if they don't understand what it does. It would 
> > > > > be
> > > > > nice, therefore, to put us on a path to expanding our set of 
> > > > > "official"
> > > > > extension types (which would include things like JSON or UUID) since 
> > > > > some
> > > > > libraries may choose to implement convenience containers for these for
> > > > > usability.
> > > > >
> > > > > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette 
> > > > wrote:
> > > > >
> > > > >> +1 this looks good to me.
> > > > >>
> > > > >> My only concern is with criteria #3 " Is the underlying encoding of 
> > > > >> the
> > > > >> type already semantically supported by a type?". I think this is a 
> > > > >> good
> > > > >> criteria, but it's inconsistent with the current spec. By that 
> > > > >> criteria
> > > > >> some existing types (Timestamp, Time, Duration, Date) should be well
> > > > known
> > > > >> extension types, right?
> > > > >>
> > > > >> Perhaps we should explicitly indicate these types are 

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-01-20 Thread Weston Pace
Those all seem to be C++ locations.  If we want to define
cross-implementation "Well Known Extension Types" then it seems we
would want to come up with some kind of language independent agreement
(could just be a markdown file but maybe there is some advantage to
having something programmatically consumable) describing:

* The name of the extension type (to go in ARROW:extension:name)
* A description of the extension type and how it should be used
* The storage type of the extension type
* The format and meaning of the content that will go into
ARROW:extension:metadata

I think (but am not sure) that, since these are metadata keys, we are
supposed to stick to printable ASCII for values (for backwards
compatibility).

For example, in the docs, we currently have this little blurb about a
theoretical tensor extension type:

> tensor (multidimensional array) stored as Binary values and
> having serialized metadata indicating the data type and shape
> of each value. This could be JSON like {'type': 'int8', 'shape':
> [4, 5]} for a 4x5 cell tensor.

In my mind this file would be somewhat analogous to the way that
schema.fbs is the cross implementation "ground truth" for our logical
types.

Then the C++ implementation would be free to put the implementation
(I'd vote for arrow/cpp/extensions but a separate repo is probably ok.
I'm -1 on arrow/extensions/...)

On Thu, Jan 20, 2022 at 3:20 PM Rok Mihevc  wrote:
>
> To continue the ExtensionType part of this thread - I would like to
> add TensorArray [1] as an ExtensionType to Arrow but we have not yet
> agreed on an "official" location for "Well Known Extension Types".
>
> Where could we put these? Some suggestions:
>
> * implementation folders (e.g. arrow/cpp/extensions/tensor_array.h)
> * extensions folder (e.g. arrow/extensions/cpp/tensor_array.h)
> * separate repo (e.g. github.com/apache/arrow_extensions/cpp/tensor_array.h)
>
> I'd be happy to also gather other Well Known Extension Types into one
> location if this moves forward.
>
> Rok
>
> [1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389
>
> On Sat, May 1, 2021 at 12:12 PM Andrew Lamb  wrote:
> >
> > I agree with others on this thread. Thanks for writing this down Micah
> >
> > On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou  wrote:
> >
> > >
> > > I concur with both what Wes and Micah said.
> > >
> > > As for temporal types, they have wide-spread use and their semantics
> > > require dedicated treatment for arithmetic and conversion, so it's
> > > helpful to define dedicated types for them, as opposed to mere 
> > > annotations.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 30/04/2021 à 16:40, Wes McKinney a écrit :
> > > > I agree that the bar for adding new types to the Type union in 
> > > > Schema.fbs
> > > > should be quite high going forward. Using extension types increasingly
> > > for
> > > > adding specializations of built-in types will mean less burden for
> > > > implementations to simply "propagate forward" this data (by preserving
> > > the
> > > > extra metadata) even if they don't understand what it does. It would be
> > > > nice, therefore, to put us on a path to expanding our set of "official"
> > > > extension types (which would include things like JSON or UUID) since 
> > > > some
> > > > libraries may choose to implement convenience containers for these for
> > > > usability.
> > > >
> > > > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette 
> > > wrote:
> > > >
> > > >> +1 this looks good to me.
> > > >>
> > > >> My only concern is with criteria #3 " Is the underlying encoding of the
> > > >> type already semantically supported by a type?". I think this is a good
> > > >> criteria, but it's inconsistent with the current spec. By that criteria
> > > >> some existing types (Timestamp, Time, Duration, Date) should be well
> > > known
> > > >> extension types, right?
> > > >>
> > > >> Perhaps we should explicitly indicate these types are grandfathered in
> > > [1]
> > > >> because they existed before extension types, to avoid tension with this
> > > >> criteria.
> > > >>
> > > >> Brian
> > > >>
> > > >> [1] https://en.wikipedia.org/wiki/Grandfather_clause
> > > >>
> > > >> On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
> > > >> jorgecarlei...@gmail.com> wrote:
> > > >>
> > > >>> Thanks for writing this.
> > > >>>
> > > >>> I agree. That is a good decision tree. +1
> > > >>>
> > > >>> Best,
> > > >>> Jorge
> > > >>>
> > > >>>
> > > >>> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield  > > >
> > > >>> wrote:
> > > >>>
> > >  The discussion around adding another interval type to the Schema.fbs
> > > >>> raises
> > >  the issue of when do we decide to add a new type to the Schema.fbs vs
> > > >>> using
> > >  other means (primarily extension types [1]).
> > > 
> > >  A few criteria come to mind that could help decide (feedback 
> > >  welcome):
> > > 
> > >  1.  Is the type a new parameterization of an existing type?
> > > 

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2022-01-20 Thread Rok Mihevc
To continue the ExtensionType part of this thread - I would like to
add TensorArray [1] as an ExtensionType to Arrow but we have not yet
agreed on an "official" location for "Well Known Extension Types".

Where could we put these? Some suggestions:

* implementation folders (e.g. arrow/cpp/extensions/tensor_array.h)
* extensions folder (e.g. arrow/extensions/cpp/tensor_array.h)
* separate repo (e.g. github.com/apache/arrow_extensions/cpp/tensor_array.h)

I'd be happy to also gather other Well Known Extension Types into one
location if this moves forward.

Rok

[1] https://github.com/apache/arrow/pull/8510#issuecomment-991150389

On Sat, May 1, 2021 at 12:12 PM Andrew Lamb  wrote:
>
> I agree with others on this thread. Thanks for writing this down Micah
>
> On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou  wrote:
>
> >
> > I concur with both what Wes and Micah said.
> >
> > As for temporal types, they have wide-spread use and their semantics
> > require dedicated treatment for arithmetic and conversion, so it's
> > helpful to define dedicated types for them, as opposed to mere annotations.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 30/04/2021 à 16:40, Wes McKinney a écrit :
> > > I agree that the bar for adding new types to the Type union in Schema.fbs
> > > should be quite high going forward. Using extension types increasingly
> > for
> > > adding specializations of built-in types will mean less burden for
> > > implementations to simply "propagate forward" this data (by preserving
> > the
> > > extra metadata) even if they don't understand what it does. It would be
> > > nice, therefore, to put us on a path to expanding our set of "official"
> > > extension types (which would include things like JSON or UUID) since some
> > > libraries may choose to implement convenience containers for these for
> > > usability.
> > >
> > > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette 
> > wrote:
> > >
> > >> +1 this looks good to me.
> > >>
> > >> My only concern is with criteria #3 " Is the underlying encoding of the
> > >> type already semantically supported by a type?". I think this is a good
> > >> criteria, but it's inconsistent with the current spec. By that criteria
> > >> some existing types (Timestamp, Time, Duration, Date) should be well
> > known
> > >> extension types, right?
> > >>
> > >> Perhaps we should explicitly indicate these types are grandfathered in
> > [1]
> > >> because they existed before extension types, to avoid tension with this
> > >> criteria.
> > >>
> > >> Brian
> > >>
> > >> [1] https://en.wikipedia.org/wiki/Grandfather_clause
> > >>
> > >> On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
> > >> jorgecarlei...@gmail.com> wrote:
> > >>
> > >>> Thanks for writing this.
> > >>>
> > >>> I agree. That is a good decision tree. +1
> > >>>
> > >>> Best,
> > >>> Jorge
> > >>>
> > >>>
> > >>> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield  > >
> > >>> wrote:
> > >>>
> >  The discussion around adding another interval type to the Schema.fbs
> > >>> raises
> >  the issue of when do we decide to add a new type to the Schema.fbs vs
> > >>> using
> >  other means (primarily extension types [1]).
> > 
> >  A few criteria come to mind that could help decide (feedback welcome):
> > 
> >  1.  Is the type a new parameterization of an existing type?
> >   - If Yes, and we believe the parameterization is useful and can
> > be
> > >>> done
> >  in a forward/backward compatible manner then we would update
> > >> Schema.fbs.
> > 
> >  2.  Does the type itself have its own specification for processing
> > >> (e.g.
> >  JSON, BSON, Thrift, Avro, Protobuf)?
> > - If yes, we would NOT add them to Schema.fbs.  I think this would
> >  potentially yield too many new types.
> > 
> >  3.  Is the underlying encoding of the type already semantically
> > >> supported
> >  by a type? (e.g. if we want to encode physical lengths like meters
> > >> these
> >  can be represented by an integer).
> >  - If yes, we would NOT update the specification.  This seems like
> > >> the
> >  exact use-case that extension types are meant for.
> > 
> >  * How does this apply to Interval? *
> >  Interval extends an existing type in the specification and multiple
> > >>> "packed
> >  fields" cannot be easily communicated with the current version of the
> >  specification.  Hence, I feel comfortable making the addition to
> > >>> Schema.fbs
> > 
> >  * What does this mean for other common types? *
> > 
> >  I think as types come up that are very common but we don't want to add
> > >> to
> >  the Schema.fbs we should invest in formalizing them as "Well Known"
> >  Extension types.  In this scenario, we would update the specification
> > >> to
> >  include how to specify the extension type metadata (and still require
> > >> at
> >  least two libraries support the Extension type before inclusion 

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2021-05-01 Thread Andrew Lamb
I agree with others on this thread. Thanks for writing this down Micah

On Fri, Apr 30, 2021 at 11:16 AM Antoine Pitrou  wrote:

>
> I concur with both what Wes and Micah said.
>
> As for temporal types, they have wide-spread use and their semantics
> require dedicated treatment for arithmetic and conversion, so it's
> helpful to define dedicated types for them, as opposed to mere annotations.
>
> Regards
>
> Antoine.
>
>
> Le 30/04/2021 à 16:40, Wes McKinney a écrit :
> > I agree that the bar for adding new types to the Type union in Schema.fbs
> > should be quite high going forward. Using extension types increasingly
> for
> > adding specializations of built-in types will mean less burden for
> > implementations to simply "propagate forward" this data (by preserving
> the
> > extra metadata) even if they don't understand what it does. It would be
> > nice, therefore, to put us on a path to expanding our set of "official"
> > extension types (which would include things like JSON or UUID) since some
> > libraries may choose to implement convenience containers for these for
> > usability.
> >
> > On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette 
> wrote:
> >
> >> +1 this looks good to me.
> >>
> >> My only concern is with criteria #3 " Is the underlying encoding of the
> >> type already semantically supported by a type?". I think this is a good
> >> criteria, but it's inconsistent with the current spec. By that criteria
> >> some existing types (Timestamp, Time, Duration, Date) should be well
> known
> >> extension types, right?
> >>
> >> Perhaps we should explicitly indicate these types are grandfathered in
> [1]
> >> because they existed before extension types, to avoid tension with this
> >> criteria.
> >>
> >> Brian
> >>
> >> [1] https://en.wikipedia.org/wiki/Grandfather_clause
> >>
> >> On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
> >> jorgecarlei...@gmail.com> wrote:
> >>
> >>> Thanks for writing this.
> >>>
> >>> I agree. That is a good decision tree. +1
> >>>
> >>> Best,
> >>> Jorge
> >>>
> >>>
> >>> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield  >
> >>> wrote:
> >>>
>  The discussion around adding another interval type to the Schema.fbs
> >>> raises
>  the issue of when do we decide to add a new type to the Schema.fbs vs
> >>> using
>  other means (primarily extension types [1]).
> 
>  A few criteria come to mind that could help decide (feedback welcome):
> 
>  1.  Is the type a new parameterization of an existing type?
>   - If Yes, and we believe the parameterization is useful and can
> be
> >>> done
>  in a forward/backward compatible manner then we would update
> >> Schema.fbs.
> 
>  2.  Does the type itself have its own specification for processing
> >> (e.g.
>  JSON, BSON, Thrift, Avro, Protobuf)?
> - If yes, we would NOT add them to Schema.fbs.  I think this would
>  potentially yield too many new types.
> 
>  3.  Is the underlying encoding of the type already semantically
> >> supported
>  by a type? (e.g. if we want to encode physical lengths like meters
> >> these
>  can be represented by an integer).
>  - If yes, we would NOT update the specification.  This seems like
> >> the
>  exact use-case that extension types are meant for.
> 
>  * How does this apply to Interval? *
>  Interval extends an existing type in the specification and multiple
> >>> "packed
>  fields" cannot be easily communicated with the current version of the
>  specification.  Hence, I feel comfortable making the addition to
> >>> Schema.fbs
> 
>  * What does this mean for other common types? *
> 
>  I think as types come up that are very common but we don't want to add
> >> to
>  the Schema.fbs we should invest in formalizing them as "Well Known"
>  Extension types.  In this scenario, we would update the specification
> >> to
>  include how to specify the extension type metadata (and still require
> >> at
>  least two libraries support the Extension type before inclusion as
> >> "Well
>  Known").
> 
>  * Practical implications *
> 
>  I think this means the type system in Schema.fbs is mostly closed
> (i.e.
>  there is a high bar for adding new types). One potentially useful type
> >> to
>  have would be a "packed struct" that supports something similar to
> >> python
>  struct library [2].  I think this would likely cover many extension
> >> type
>  use-cases.
> 
>  Thoughts?
> 
>  -Micah
> 
>  [1]
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>  [2] https://docs.python.org/3/library/struct.html
> 
> >>>
> >>
> >
>


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2021-04-30 Thread Antoine Pitrou



I concur with both what Wes and Micah said.

As for temporal types, they have wide-spread use and their semantics 
require dedicated treatment for arithmetic and conversion, so it's 
helpful to define dedicated types for them, as opposed to mere annotations.


Regards

Antoine.


Le 30/04/2021 à 16:40, Wes McKinney a écrit :

I agree that the bar for adding new types to the Type union in Schema.fbs
should be quite high going forward. Using extension types increasingly for
adding specializations of built-in types will mean less burden for
implementations to simply "propagate forward" this data (by preserving the
extra metadata) even if they don't understand what it does. It would be
nice, therefore, to put us on a path to expanding our set of "official"
extension types (which would include things like JSON or UUID) since some
libraries may choose to implement convenience containers for these for
usability.

On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette  wrote:


+1 this looks good to me.

My only concern is with criteria #3 " Is the underlying encoding of the
type already semantically supported by a type?". I think this is a good
criteria, but it's inconsistent with the current spec. By that criteria
some existing types (Timestamp, Time, Duration, Date) should be well known
extension types, right?

Perhaps we should explicitly indicate these types are grandfathered in [1]
because they existed before extension types, to avoid tension with this
criteria.

Brian

[1] https://en.wikipedia.org/wiki/Grandfather_clause

On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:


Thanks for writing this.

I agree. That is a good decision tree. +1

Best,
Jorge


On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield 
wrote:


The discussion around adding another interval type to the Schema.fbs

raises

the issue of when do we decide to add a new type to the Schema.fbs vs

using

other means (primarily extension types [1]).

A few criteria come to mind that could help decide (feedback welcome):

1.  Is the type a new parameterization of an existing type?
 - If Yes, and we believe the parameterization is useful and can be

done

in a forward/backward compatible manner then we would update

Schema.fbs.


2.  Does the type itself have its own specification for processing

(e.g.

JSON, BSON, Thrift, Avro, Protobuf)?
   - If yes, we would NOT add them to Schema.fbs.  I think this would
potentially yield too many new types.

3.  Is the underlying encoding of the type already semantically

supported

by a type? (e.g. if we want to encode physical lengths like meters

these

can be represented by an integer).
- If yes, we would NOT update the specification.  This seems like

the

exact use-case that extension types are meant for.

* How does this apply to Interval? *
Interval extends an existing type in the specification and multiple

"packed

fields" cannot be easily communicated with the current version of the
specification.  Hence, I feel comfortable making the addition to

Schema.fbs


* What does this mean for other common types? *

I think as types come up that are very common but we don't want to add

to

the Schema.fbs we should invest in formalizing them as "Well Known"
Extension types.  In this scenario, we would update the specification

to

include how to specify the extension type metadata (and still require

at

least two libraries support the Extension type before inclusion as

"Well

Known").

* Practical implications *

I think this means the type system in Schema.fbs is mostly closed (i.e.
there is a high bar for adding new types). One potentially useful type

to

have would be a "packed struct" that supports something similar to

python

struct library [2].  I think this would likely cover many extension

type

use-cases.

Thoughts?

-Micah

[1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
[2] https://docs.python.org/3/library/struct.html









Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2021-04-30 Thread Wes McKinney
I agree that the bar for adding new types to the Type union in Schema.fbs
should be quite high going forward. Using extension types increasingly for
adding specializations of built-in types will mean less burden for
implementations to simply "propagate forward" this data (by preserving the
extra metadata) even if they don't understand what it does. It would be
nice, therefore, to put us on a path to expanding our set of "official"
extension types (which would include things like JSON or UUID) since some
libraries may choose to implement convenience containers for these for
usability.

On Fri, Apr 30, 2021 at 9:22 AM Brian Hulette  wrote:

> +1 this looks good to me.
>
> My only concern is with criteria #3 " Is the underlying encoding of the
> type already semantically supported by a type?". I think this is a good
> criteria, but it's inconsistent with the current spec. By that criteria
> some existing types (Timestamp, Time, Duration, Date) should be well known
> extension types, right?
>
> Perhaps we should explicitly indicate these types are grandfathered in [1]
> because they existed before extension types, to avoid tension with this
> criteria.
>
> Brian
>
> [1] https://en.wikipedia.org/wiki/Grandfather_clause
>
> On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Thanks for writing this.
> >
> > I agree. That is a good decision tree. +1
> >
> > Best,
> > Jorge
> >
> >
> > On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield 
> > wrote:
> >
> > > The discussion around adding another interval type to the Schema.fbs
> > raises
> > > the issue of when do we decide to add a new type to the Schema.fbs vs
> > using
> > > other means (primarily extension types [1]).
> > >
> > > A few criteria come to mind that could help decide (feedback welcome):
> > >
> > > 1.  Is the type a new parameterization of an existing type?
> > > - If Yes, and we believe the parameterization is useful and can be
> > done
> > > in a forward/backward compatible manner then we would update
> Schema.fbs.
> > >
> > > 2.  Does the type itself have its own specification for processing
> (e.g.
> > > JSON, BSON, Thrift, Avro, Protobuf)?
> > >   - If yes, we would NOT add them to Schema.fbs.  I think this would
> > > potentially yield too many new types.
> > >
> > > 3.  Is the underlying encoding of the type already semantically
> supported
> > > by a type? (e.g. if we want to encode physical lengths like meters
> these
> > > can be represented by an integer).
> > >- If yes, we would NOT update the specification.  This seems like
> the
> > > exact use-case that extension types are meant for.
> > >
> > > * How does this apply to Interval? *
> > > Interval extends an existing type in the specification and multiple
> > "packed
> > > fields" cannot be easily communicated with the current version of the
> > > specification.  Hence, I feel comfortable making the addition to
> > Schema.fbs
> > >
> > > * What does this mean for other common types? *
> > >
> > > I think as types come up that are very common but we don't want to add
> to
> > > the Schema.fbs we should invest in formalizing them as "Well Known"
> > > Extension types.  In this scenario, we would update the specification
> to
> > > include how to specify the extension type metadata (and still require
> at
> > > least two libraries support the Extension type before inclusion as
> "Well
> > > Known").
> > >
> > > * Practical implications *
> > >
> > > I think this means the type system in Schema.fbs is mostly closed (i.e.
> > > there is a high bar for adding new types). One potentially useful type
> to
> > > have would be a "packed struct" that supports something similar to
> python
> > > struct library [2].  I think this would likely cover many extension
> type
> > > use-cases.
> > >
> > > Thoughts?
> > >
> > > -Micah
> > >
> > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> > > [2] https://docs.python.org/3/library/struct.html
> > >
> >
>


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2021-04-30 Thread Brian Hulette
+1 this looks good to me.

My only concern is with criteria #3 " Is the underlying encoding of the
type already semantically supported by a type?". I think this is a good
criteria, but it's inconsistent with the current spec. By that criteria
some existing types (Timestamp, Time, Duration, Date) should be well known
extension types, right?

Perhaps we should explicitly indicate these types are grandfathered in [1]
because they existed before extension types, to avoid tension with this
criteria.

Brian

[1] https://en.wikipedia.org/wiki/Grandfather_clause

On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Thanks for writing this.
>
> I agree. That is a good decision tree. +1
>
> Best,
> Jorge
>
>
> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield 
> wrote:
>
> > The discussion around adding another interval type to the Schema.fbs
> raises
> > the issue of when do we decide to add a new type to the Schema.fbs vs
> using
> > other means (primarily extension types [1]).
> >
> > A few criteria come to mind that could help decide (feedback welcome):
> >
> > 1.  Is the type a new parameterization of an existing type?
> > - If Yes, and we believe the parameterization is useful and can be
> done
> > in a forward/backward compatible manner then we would update Schema.fbs.
> >
> > 2.  Does the type itself have its own specification for processing (e.g.
> > JSON, BSON, Thrift, Avro, Protobuf)?
> >   - If yes, we would NOT add them to Schema.fbs.  I think this would
> > potentially yield too many new types.
> >
> > 3.  Is the underlying encoding of the type already semantically supported
> > by a type? (e.g. if we want to encode physical lengths like meters these
> > can be represented by an integer).
> >- If yes, we would NOT update the specification.  This seems like the
> > exact use-case that extension types are meant for.
> >
> > * How does this apply to Interval? *
> > Interval extends an existing type in the specification and multiple
> "packed
> > fields" cannot be easily communicated with the current version of the
> > specification.  Hence, I feel comfortable making the addition to
> Schema.fbs
> >
> > * What does this mean for other common types? *
> >
> > I think as types come up that are very common but we don't want to add to
> > the Schema.fbs we should invest in formalizing them as "Well Known"
> > Extension types.  In this scenario, we would update the specification to
> > include how to specify the extension type metadata (and still require at
> > least two libraries support the Extension type before inclusion as "Well
> > Known").
> >
> > * Practical implications *
> >
> > I think this means the type system in Schema.fbs is mostly closed (i.e.
> > there is a high bar for adding new types). One potentially useful type to
> > have would be a "packed struct" that supports something similar to python
> > struct library [2].  I think this would likely cover many extension type
> > use-cases.
> >
> > Thoughts?
> >
> > -Micah
> >
> > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> > [2] https://docs.python.org/3/library/struct.html
> >
>


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2021-04-29 Thread Jorge Cardoso Leitão
Thanks for writing this.

I agree. That is a good decision tree. +1

Best,
Jorge


On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield 
wrote:

> The discussion around adding another interval type to the Schema.fbs raises
> the issue of when do we decide to add a new type to the Schema.fbs vs using
> other means (primarily extension types [1]).
>
> A few criteria come to mind that could help decide (feedback welcome):
>
> 1.  Is the type a new parameterization of an existing type?
> - If Yes, and we believe the parameterization is useful and can be done
> in a forward/backward compatible manner then we would update Schema.fbs.
>
> 2.  Does the type itself have its own specification for processing (e.g.
> JSON, BSON, Thrift, Avro, Protobuf)?
>   - If yes, we would NOT add them to Schema.fbs.  I think this would
> potentially yield too many new types.
>
> 3.  Is the underlying encoding of the type already semantically supported
> by a type? (e.g. if we want to encode physical lengths like meters these
> can be represented by an integer).
>- If yes, we would NOT update the specification.  This seems like the
> exact use-case that extension types are meant for.
>
> * How does this apply to Interval? *
> Interval extends an existing type in the specification and multiple "packed
> fields" cannot be easily communicated with the current version of the
> specification.  Hence, I feel comfortable making the addition to Schema.fbs
>
> * What does this mean for other common types? *
>
> I think as types come up that are very common but we don't want to add to
> the Schema.fbs we should invest in formalizing them as "Well Known"
> Extension types.  In this scenario, we would update the specification to
> include how to specify the extension type metadata (and still require at
> least two libraries support the Extension type before inclusion as "Well
> Known").
>
> * Practical implications *
>
> I think this means the type system in Schema.fbs is mostly closed (i.e.
> there is a high bar for adding new types). One potentially useful type to
> have would be a "packed struct" that supports something similar to python
> struct library [2].  I think this would likely cover many extension type
> use-cases.
>
> Thoughts?
>
> -Micah
>
> [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> [2] https://docs.python.org/3/library/struct.html
>