>
>  It might help this discussion and future discussions like it if we could
> define how it is determined whether a type should be part of the Arrow
> format, an extension type (and what does it mean to say there is a
> "canonical" extension type), or just something that a language
> implementation or downstream library builds for itself with metadata. I
> feel like this has come up before but I don't recall a resolution.


There seemed to be  consensus, but I guess we never formally voted on the
decision points here:
https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E

Applying the criteria to complex types:
1.  Is the type a new parameterization of an existing type?  No

2.  Does the type itself have its own specification for processing (e.g.
JSON, BSON, Thrift, Avro, Protobuf)? No

3.  Is the underlying encoding of the type already semantically supported
by a type?  Yes.  Two have been mentioned in this thread and I would also
support adding a new packed struct type, but it appears isn't necessary for
this. Note that FixedSizeLists have some limitations in regards to parquet
compatibility around nullability, there might be a few other sharp edges.

So if we use this criteria we would lean towards an extension type.

We never converged on a standard for "canonical" extension types.  I would
propose it roughly be the same criteria as a first class type:
1.  Specification/document update PR that describes the representation
2.  Implementation showing working integration tests across two languages
(for canonical types I think this can be any 2 languages instead of C++ and
Java)
3.  Formal vote accepting the canonical type.

Thanks,
Micah



On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Isn't an array of complexes represented by what arrow already supports? In
> particular, I see at least two valid in-memory representations to use, that
> depend on what we are going to do with it:
>
> * Struct[re, im]
> * FixedList[2]
>
> In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...], in
> the second case we have 1 buffer, [x0, y0, x1, y1, ...].
>
> The first representation is useful for column-based operations (e.g. taking
> the real part in case 1 is trivial; requires a copy in the second case),
> the second representation is useful for row-base operations (e.g. "take"
> and "filter" require a single pass over buffer 1). Case 2 does not support
> Re and Im of different physical types (arguably an issue). Both cases
> support nullability of individual items or combined.
>
> What I conclude is that this does not seem to be a problem about a base
> in-memory representation, but rather on whether we agree on a
> representation that justifies adding associated metadata to the spec.
>
> The case for the complex interval type recently proposed [1] is more
> compelling to me because a complex ops over intervals usually required all
> parts of the interval (and thus the "FixedList" representation is more
> compelling), but each part has a different type. I.e. it is like a
> "FixedTypedList[int32, int32, int64]", which we do not natively support.
>
> [1] https://github.com/apache/arrow/pull/10177
>
> Best,
> Jorge
>
>
>
> On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
> neal.p.richard...@gmail.com>
> wrote:
>
> >  It might help this discussion and future discussions like it if we could
> > define how it is determined whether a type should be part of the Arrow
> > format, an extension type (and what does it mean to say there is a
> > "canonical" extension type), or just something that a language
> > implementation or downstream library builds for itself with metadata. I
> > feel like this has come up before but I don't recall a resolution.
> >
> > Examples might also help: are there examples of "canonical extension
> > types"?
> >
> > Neal
> >
> > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > >
> > > > My understanding is that it means having COMPLEX as an entry in the
> > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > work in the C++ library much more straightforward.
> > >
> > > One idea I proposed would be to do that, and implement the
> > > > serialization of the complex metadata using Extension types.
> > >
> > >
> > > If this is a maintainable strategy for Canonical types it sounds good
> to
> > > me.
> > >
> > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > My understanding is that it means having COMPLEX as an entry in the
> > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > work in the C++ library much more straightforward.
> > > >
> > > > One idea I proposed would be to do that, and implement the
> > > > serialization of the complex metadata using Extension types.
> > > >
> > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <weston.p...@gmail.com>
> > > wrote:
> > > > >
> > > > > > While dedicated types are not strictly required, compute
> functions
> > > > would
> > > > > > be much easier to add for a first-class dedicated complex
> datatype
> > > > > > rather than for an extension type.
> > > > > @pitrou
> > > > >
> > > > > This is perhaps a naive question (and admittedly, I'm not up to
> speed
> > > > > on my compute kernels) but why is this the case?  For example, if
> > > > > adding a complex addition kernel it seems we would be talking
> > about...
> > > > >
> > > > > dest_scalar.real = scalar1.real + scalar2.real;
> > > > > dest_scalar.im = scalar1.im + scalar2.im;
> > > > >
> > > > > vs...
> > > > >
> > > > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > > > >
> > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <wesmck...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > I'd be supportive of starting with this as a "canonical"
> extension
> > > > > > type so that all implementations are not expected to support
> > complex
> > > > > > types — this would encourage us to build sufficient integration
> > e.g.
> > > > > > with NumPy to get things working end-to-end with the on-wire
> > > > > > representation being an extension type. We could certainly choose
> > to
> > > > > > treat the type as "first class" in the C++ library without it
> being
> > > > > > "top level" in the Type union in Flatbuffers.
> > > > > >
> > > > > > I agree that the use cases are more specialized, and the fact
> that
> > we
> > > > > > haven't needed it until now (or at least, its absence suggests
> > this)
> > > > > > shows that this is the case.
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > I'm convinced now that  first-class types seem to be the way
> to
> > > go
> > > > and I'm
> > > > > > > > happy to take this approach.
> > > > > > >
> > > > > > > I agree from an implementation effort it is simpler, but I'm
> > still
> > > > not
> > > > > > > convinced that we should be adding this as a first class type.
> > As
> > > > noted in
> > > > > > > the survey below it appears Complex numbers are not a core
> > concept
> > > > in many
> > > > > > > general purpose coding languages and it doesn't appear to be a
> > > > common type
> > > > > > > in SQL systems either.
> > > > > > >
> > > > > > > The reason why I am being nit-picky here is I think that
> having a
> > > > first
> > > > > > > class type indicates that it should eventually be supported by
> > all
> > > > > > > reference implementations.  An "well known" extension type I
> > think
> > > > offers
> > > > > > > less guarantees which makes it seem more suitable for niche
> > types.
> > > > > > >
> > > > > > > > I don't immediately see a Packed Struct type. Would this need
> > to
> > > be
> > > > > > > > > implemented?
> > > > > > > > Not necessarily (*).  But before thinking about
> implementation,
> > > > this
> > > > > > > > proposal must be accepted into the format.
> > > > > > >
> > > > > > >
> > > > > > > Yes, this is a type that has been proposed in the past and I
> > think
> > > > handles
> > > > > > > a lot of  types not yet in Arrow but have been requested (e.g.
> IP
> > > > > > > Addresses, Geo coordinates), etc.
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > > > simon.perk...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> > > anto...@python.org>
> > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > > > >
> > > > > > > > > > Adding a new first-class type in Arrow requires working
> > > > integration
> > > > > > > > tests
> > > > > > > > > > between C++ and Java libraries (once the idea is
> informally
> > > > agreed
> > > > > > > > upon)
> > > > > > > > > > and then a final vote for approval.  We haven't
> formalized
> > > > extension
> > > > > > > > > types
> > > > > > > > > > but I imagine a similar cross language requirement would
> be
> > > > agreed
> > > > > > > > upon.
> > > > > > > > > > Implementation of computation wouldn't be required for
> > adding
> > > > a new
> > > > > > > > type.
> > > > > > > > > > Different language bindings have taken different
> approaches
> > > on
> > > > how much
> > > > > > > > > > additional computational elements are packaged in them.
> > > > > > > > >
> > > > > > > > > While dedicated types are not strictly required, compute
> > > > functions would
> > > > > > > > > be much easier to add for a first-class dedicated complex
> > > > datatype
> > > > > > > > > rather than for an extension type.
> > > > > > > > >
> > > > > > > > > Since complex numbers are quite common in some domains, and
> > > > since they
> > > > > > > > > are conceptually simply, IMHO it would make sense to add
> them
> > > to
> > > > the
> > > > > > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm convinced now that  first-class types seem to be the way
> to
> > > go
> > > > and I'm
> > > > > > > > happy to take this approach.
> > > > > > > > Regarding compute functions, it looks like the standard set
> of
> > > > scalar
> > > > > > > > arithmetic and reduction functionality
> > > > > > > > is desirable for complex numbers:
> > > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > > > Perhaps it would be better to split the addition of the Types
> > and
> > > > addition
> > > > > > > > Compute functionality into separate PRs?
> > > > > > > >
> > > > > > > > Regarding the process for managing this PR, it sounds like a
> > > > proposal must
> > > > > > > > be voted on?
> > > > > > > > i.e. is this proposal still in this phase
> > > > > > > >
> > > >
> > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Simon
> > > > > > > >
> > > >
> > >
> >
>

Reply via email to