Thanks Micah. Those criteria seem reasonable (and that discussion was recent enough that my memory of it should have been sharper). I've created https://issues.apache.org/jira/browse/ARROW-13055 so that we can document this decision. IMO we don't need a vote on these criteria--seems like there was consensus before.
Neal On Thu, Jun 10, 2021 at 9:52 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > It might help this discussion and future discussions like it if we could > > define how it is determined whether a type should be part of the Arrow > > format, an extension type (and what does it mean to say there is a > > "canonical" extension type), or just something that a language > > implementation or downstream library builds for itself with metadata. I > > feel like this has come up before but I don't recall a resolution. > > > There seemed to be consensus, but I guess we never formally voted on the > decision points here: > > https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E > > Applying the criteria to complex types: > 1. Is the type a new parameterization of an existing type? No > > 2. Does the type itself have its own specification for processing (e.g. > JSON, BSON, Thrift, Avro, Protobuf)? No > > 3. Is the underlying encoding of the type already semantically supported > by a type? Yes. Two have been mentioned in this thread and I would also > support adding a new packed struct type, but it appears isn't necessary for > this. Note that FixedSizeLists have some limitations in regards to parquet > compatibility around nullability, there might be a few other sharp edges. > > So if we use this criteria we would lean towards an extension type. > > We never converged on a standard for "canonical" extension types. I would > propose it roughly be the same criteria as a first class type: > 1. Specification/document update PR that describes the representation > 2. Implementation showing working integration tests across two languages > (for canonical types I think this can be any 2 languages instead of C++ and > Java) > 3. Formal vote accepting the canonical type. > > Thanks, > Micah > > > > On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Isn't an array of complexes represented by what arrow already supports? > In > > particular, I see at least two valid in-memory representations to use, > that > > depend on what we are going to do with it: > > > > * Struct[re, im] > > * FixedList[2] > > > > In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...], > in > > the second case we have 1 buffer, [x0, y0, x1, y1, ...]. > > > > The first representation is useful for column-based operations (e.g. > taking > > the real part in case 1 is trivial; requires a copy in the second case), > > the second representation is useful for row-base operations (e.g. "take" > > and "filter" require a single pass over buffer 1). Case 2 does not > support > > Re and Im of different physical types (arguably an issue). Both cases > > support nullability of individual items or combined. > > > > What I conclude is that this does not seem to be a problem about a base > > in-memory representation, but rather on whether we agree on a > > representation that justifies adding associated metadata to the spec. > > > > The case for the complex interval type recently proposed [1] is more > > compelling to me because a complex ops over intervals usually required > all > > parts of the interval (and thus the "FixedList" representation is more > > compelling), but each part has a different type. I.e. it is like a > > "FixedTypedList[int32, int32, int64]", which we do not natively support. > > > > [1] https://github.com/apache/arrow/pull/10177 > > > > Best, > > Jorge > > > > > > > > On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson < > > neal.p.richard...@gmail.com> > > wrote: > > > > > It might help this discussion and future discussions like it if we > could > > > define how it is determined whether a type should be part of the Arrow > > > format, an extension type (and what does it mean to say there is a > > > "canonical" extension type), or just something that a language > > > implementation or downstream library builds for itself with metadata. I > > > feel like this has come up before but I don't recall a resolution. > > > > > > Examples might also help: are there examples of "canonical extension > > > types"? > > > > > > Neal > > > > > > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com > > > > > wrote: > > > > > > > > > > > > > My understanding is that it means having COMPLEX as an entry in the > > > > > arrow/type_fwd.h Type enum. I agree this would make implementation > > > > > work in the C++ library much more straightforward. > > > > > > > > One idea I proposed would be to do that, and implement the > > > > > serialization of the complex metadata using Extension types. > > > > > > > > > > > > If this is a maintainable strategy for Canonical types it sounds good > > to > > > > me. > > > > > > > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > My understanding is that it means having COMPLEX as an entry in the > > > > > arrow/type_fwd.h Type enum. I agree this would make implementation > > > > > work in the C++ library much more straightforward. > > > > > > > > > > One idea I proposed would be to do that, and implement the > > > > > serialization of the complex metadata using Extension types. > > > > > > > > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <weston.p...@gmail.com > > > > > > wrote: > > > > > > > > > > > > > While dedicated types are not strictly required, compute > > functions > > > > > would > > > > > > > be much easier to add for a first-class dedicated complex > > datatype > > > > > > > rather than for an extension type. > > > > > > @pitrou > > > > > > > > > > > > This is perhaps a naive question (and admittedly, I'm not up to > > speed > > > > > > on my compute kernels) but why is this the case? For example, if > > > > > > adding a complex addition kernel it seems we would be talking > > > about... > > > > > > > > > > > > dest_scalar.real = scalar1.real + scalar2.real; > > > > > > dest_scalar.im = scalar1.im + scalar2.im; > > > > > > > > > > > > vs... > > > > > > > > > > > > dest_scalar[0] = scalar1[0] + scalar2[0]; > > > > > > dest_scalar[1] = scalar1[1] + scalar2[1]; > > > > > > > > > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney < > wesmck...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > I'd be supportive of starting with this as a "canonical" > > extension > > > > > > > type so that all implementations are not expected to support > > > complex > > > > > > > types — this would encourage us to build sufficient integration > > > e.g. > > > > > > > with NumPy to get things working end-to-end with the on-wire > > > > > > > representation being an extension type. We could certainly > choose > > > to > > > > > > > treat the type as "first class" in the C++ library without it > > being > > > > > > > "top level" in the Type union in Flatbuffers. > > > > > > > > > > > > > > I agree that the use cases are more specialized, and the fact > > that > > > we > > > > > > > haven't needed it until now (or at least, its absence suggests > > > this) > > > > > > > shows that this is the case. > > > > > > > > > > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield < > > > > emkornfi...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > I'm convinced now that first-class types seem to be the > way > > to > > > > go > > > > > and I'm > > > > > > > > > happy to take this approach. > > > > > > > > > > > > > > > > I agree from an implementation effort it is simpler, but I'm > > > still > > > > > not > > > > > > > > convinced that we should be adding this as a first class > type. > > > As > > > > > noted in > > > > > > > > the survey below it appears Complex numbers are not a core > > > concept > > > > > in many > > > > > > > > general purpose coding languages and it doesn't appear to be > a > > > > > common type > > > > > > > > in SQL systems either. > > > > > > > > > > > > > > > > The reason why I am being nit-picky here is I think that > > having a > > > > > first > > > > > > > > class type indicates that it should eventually be supported > by > > > all > > > > > > > > reference implementations. An "well known" extension type I > > > think > > > > > offers > > > > > > > > less guarantees which makes it seem more suitable for niche > > > types. > > > > > > > > > > > > > > > > > I don't immediately see a Packed Struct type. Would this > need > > > to > > > > be > > > > > > > > > > implemented? > > > > > > > > > Not necessarily (*). But before thinking about > > implementation, > > > > > this > > > > > > > > > proposal must be accepted into the format. > > > > > > > > > > > > > > > > > > > > > > > > Yes, this is a type that has been proposed in the past and I > > > think > > > > > handles > > > > > > > > a lot of types not yet in Arrow but have been requested > (e.g. > > IP > > > > > > > > Addresses, Geo coordinates), etc. > > > > > > > > > > > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins < > > > > > simon.perk...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou < > > > > anto...@python.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit : > > > > > > > > > > > > > > > > > > > > > > Adding a new first-class type in Arrow requires working > > > > > integration > > > > > > > > > tests > > > > > > > > > > > between C++ and Java libraries (once the idea is > > informally > > > > > agreed > > > > > > > > > upon) > > > > > > > > > > > and then a final vote for approval. We haven't > > formalized > > > > > extension > > > > > > > > > > types > > > > > > > > > > > but I imagine a similar cross language requirement > would > > be > > > > > agreed > > > > > > > > > upon. > > > > > > > > > > > Implementation of computation wouldn't be required for > > > adding > > > > > a new > > > > > > > > > type. > > > > > > > > > > > Different language bindings have taken different > > approaches > > > > on > > > > > how much > > > > > > > > > > > additional computational elements are packaged in them. > > > > > > > > > > > > > > > > > > > > While dedicated types are not strictly required, compute > > > > > functions would > > > > > > > > > > be much easier to add for a first-class dedicated complex > > > > > datatype > > > > > > > > > > rather than for an extension type. > > > > > > > > > > > > > > > > > > > > Since complex numbers are quite common in some domains, > and > > > > > since they > > > > > > > > > > are conceptually simply, IMHO it would make sense to add > > them > > > > to > > > > > the > > > > > > > > > > native Arrow datatypes (at least COMPLEX64 and > COMPLEX128). > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm convinced now that first-class types seem to be the > way > > to > > > > go > > > > > and I'm > > > > > > > > > happy to take this approach. > > > > > > > > > Regarding compute functions, it looks like the standard set > > of > > > > > scalar > > > > > > > > > arithmetic and reduction functionality > > > > > > > > > is desirable for complex numbers: > > > > > > > > > https://arrow.apache.org/docs/cpp/compute.html# > > > > > > > > > Perhaps it would be better to split the addition of the > Types > > > and > > > > > addition > > > > > > > > > Compute functionality into separate PRs? > > > > > > > > > > > > > > > > > > Regarding the process for managing this PR, it sounds like > a > > > > > proposal must > > > > > > > > > be voted on? > > > > > > > > > i.e. is this proposal still in this phase > > > > > > > > > > > > > > > > > > > > > > > > http://arrow.apache.org/docs/developers/contributing.html#before-starting > > > > > > > > > Regards > > > > > > > > > > > > > > > > > > Simon > > > > > > > > > > > > > > > > > > > > > > > >