Re: [DISCUSS] Format changes: process and requirements

Paul Taylor Sun, 17 Mar 2019 22:52:07 -0700

Hi Jacques,

I think we should have two complete implementations. I don't think having
one feature in C# and Go and another in JavaScript and Rust does justice to
the project goals.

Agree 100%. We may already be in this situation with the DictionaryBatch"isDelta" flag. I haven't checked the C++ in a while so I may bemistaken, but I think JS is the only impl with support for interleavedDictionary/RecordBatches. It'd be good to put a process in place thathelps avoid this in the future.

I think Java and C++ should always be complete. They are
the first two implementations. I believe they are the most complete and
broadly used/popular (C++ given Python & Pandas integration and Java via
Spark & Dremio).

No argument here either, though I should mention with the exception ofTensor messages the JS version is also feature-complete from thestandpoint of the format.

It's still early in terms of adoption, but we've seen some interest fromthe Vega, Jupyter, and Uber Deck.gl projects in either contributing toor integrating with ArrowJS.

So while we're certainly not at the level of Spark or Pandas, we may bepoised for wider adoption, and I'd request we take the JS implementationinto account when making format changes. I'm happy to implement newfeatures and update the integration tests as necessary.

Are there specific changes to format/ that have been merged that you
are concerned about that you feel need to be discussed separately?

The thing that springs to mind is anything to do with 64-bit indexing,as recently discussed in the sparse matrix thread. IIRC none of the JSengines presently allow allocating buffers greater than 2GiB.Limitations in JS shouldn't block other implementations from movingahead, but it would be good for the community to come to a consensus onguidance or workarounds for JS interop when we are in that sort ofsituation.


Thanks,

Paul


On 3/17/19 6:07 PM, Jacques Nadeau wrote:

How about "at least two native implementations" instead of
"Java and C++"? Now, we have multiple native
implementations:

I think we should have two complete implementations. I don't think having
one feature in C# and Go and another in JavaScript and Rust does justice to
the project goals. I think Java and C++ should always be complete. They are
the first two implementations. I believe they are the most complete and
broadly used/popular (C++ given Python & Pandas integration and Java via
Spark & Dremio). This is a compromise between setting a high barrier for
creation of new features and making sure that we have validated things
across impls.

Are there specific changes to format/ that have been merged that you
are concerned about that you feel need to be discussed separately?
There have been some changes related to serializing tensor metadata
that are clearly marked as experimental, and they also do not interact
with the columnar format.

There are several things we've introduced over time that suffered this
problem. Alignment changes, dictionary encoding, union behavior, interval
behavior, tensors, unsigned integrations, etc that we've failed to make
sure we have integration tests for. I've meant to send this email for
months but saw a couple of recent proposed changes which made me feel like
we should discuss further.

Re: [DISCUSS] Format changes: process and requirements

Reply via email to