Hi Jacques,
I think we should have two complete implementations. I don't think having
one feature in C# and Go and another in JavaScript and Rust does justice to
the project goals.
Agree 100%. We may already be in this situation with the DictionaryBatch
"isDelta" flag. I haven't checked the C++ in a while so I may be
mistaken, but I think JS is the only impl with support for interleaved
Dictionary/RecordBatches. It'd be good to put a process in place that
helps avoid this in the future.
I think Java and C++ should always be complete. They are
the first two implementations. I believe they are the most complete and
broadly used/popular (C++ given Python & Pandas integration and Java via
Spark & Dremio).
No argument here either, though I should mention with the exception of
Tensor messages the JS version is also feature-complete from the
standpoint of the format.
It's still early in terms of adoption, but we've seen some interest from
the Vega, Jupyter, and Uber Deck.gl projects in either contributing to
or integrating with ArrowJS.
So while we're certainly not at the level of Spark or Pandas, we may be
poised for wider adoption, and I'd request we take the JS implementation
into account when making format changes. I'm happy to implement new
features and update the integration tests as necessary.
Are there specific changes to format/ that have been merged that you
are concerned about that you feel need to be discussed separately?
The thing that springs to mind is anything to do with 64-bit indexing,
as recently discussed in the sparse matrix thread. IIRC none of the JS
engines presently allow allocating buffers greater than 2GiB.
Limitations in JS shouldn't block other implementations from moving
ahead, but it would be good for the community to come to a consensus on
guidance or workarounds for JS interop when we are in that sort of
situation.
Thanks,
Paul
On 3/17/19 6:07 PM, Jacques Nadeau wrote:
How about "at least two native implementations" instead of
"Java and C++"? Now, we have multiple native
implementations:
I think we should have two complete implementations. I don't think having
one feature in C# and Go and another in JavaScript and Rust does justice to
the project goals. I think Java and C++ should always be complete. They are
the first two implementations. I believe they are the most complete and
broadly used/popular (C++ given Python & Pandas integration and Java via
Spark & Dremio). This is a compromise between setting a high barrier for
creation of new features and making sure that we have validated things
across impls.
Are there specific changes to format/ that have been merged that you
are concerned about that you feel need to be discussed separately?
There have been some changes related to serializing tensor metadata
that are clearly marked as experimental, and they also do not interact
with the columnar format.
There are several things we've introduced over time that suffered this
problem. Alignment changes, dictionary encoding, union behavior, interval
behavior, tensors, unsigned integrations, etc that we've failed to make
sure we have integration tests for. I've meant to send this email for
months but saw a couple of recent proposed changes which made me feel like
we should discuss further.