I like this proposal, I think it strikes a pragmatic balance between preserving interoperability whilst still allowing new ideas to be incorporated into the standard. Thank you for writing this up.

On 13/07/2023 10:22, Matt Topol wrote:
I don't have much to add but I do want to second Jacob's comments. I agree
that this is a good way to avoid the fragmentation while keeping Arrow
relevant, and likely something we need to do so that we can ensure Arrow
remains the way to do this data integration and interoperability.

On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
<ja...@voltrondata.com.invalid> wrote:

Hello Everyone,

Thanks for this comprehensive but concise write up Neal! I think this
proposal is a good way to avoid both fragmentation of the arrow ecosystem
as well as its obsolescence. In my opinion of these two problems the
obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
already (close to) being relegated to the sidelines in eco-system defining
projects.

Jacob

On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

Hi all,
As was previously raised in [1] and surfaced again in [2], there is a
proposal for representing alternative layouts. The intent, as I
understand
it, is to be able to support memory layouts that some (but perhaps not
all)
applications of Arrow find valuable, so that these nearly Arrow systems
can
be fully Arrow-native.

I wanted to start a more focused discussion on it because I think it's
worth being considered on its own merits, but I also think this gets to
the
core of what the Arrow project is and should be, and I don't want us to
lose sight of that.

To restate the proposal from [1]:

  * There are one or more primary layouts
    * Existing layouts are automatically considered primary layouts,
even if they
wouldn't have been primary layouts initially (e.g. large list)
  * A new layout, if it is semantically equivalent to another, is
considered an
alternative layout
  * An alternative layout still has the same requirements for adoption
(two implementations
and a vote)
    * An implementation should not feel pressured to rush and implement
the
new
layout. It would be good if they contribute in the discussion and
consider
the layout and vote if they feel it would be an acceptable design.
  * We can define and vote and approve as many canonical alternative
layouts as
we want:
    * A canonical alternative layout should, at a minimum, have some
reasonable
justification, such as improved performance for algorithm X
  * Arrow implementations MUST support the primary layouts
  * An Arrow implementation MAY support a canonical alternative, however:
    * An Arrow implementation MUST first support the primary layout
    * An Arrow implementation MUST support conversion to/from the primary
and
canonical layout
    * An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema inference
should prefer the primary layout).
  * We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g. performance)
of
the layout but also the interoperability. In other words, a layout can
only
become a primary layout if there is significant evidence that most
implementations
plan to adopt it.


To summarize some of the arguments against the proposal from the previous
threads, there are concerns about increasing the complexity of the Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.

Where these discussions, both about several proposed new types and this
layout proposal, get to the core of Arrow is well expressed in the
comments
on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
matters to people more, interoperability or best-in-class performance?"
And
Pedro notes that because of the overhead of converting these
not-yet-Arrow
types to the Arrow C ABI is high enough that they've considered
abandoning
Arrow as their interchange format. So: on the one hand, we're kinda
choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.

What I see that we're trying to do here is find a way to expand the Arrow
specification just enough so that Arrow becomes or remains the in-memory
standard everywhere, but not so much that it creates too much complexity
or
burden to implement. Expand too much and you get a fragmented ecosystem
where everyone is writing subsets of the Arrow standard and so nothing is
fully compatible and the whole premise is undermined. But expand too
little
and projects will abandon the standard and we've also failed.

I don't have a tidy answer, but I wanted to acknowledge the bigger
issues,
and see if this helps us reason about the various proposals on the
table. I
wonder if the alternative layout proposal is the happy medium that adds
some complexity to the specification, but less than there would be if
three
new types were added, and still meets the needs of projects like DuckDB,
Velox, and Gluten and gets them fully Arrow native.

Neal


[1]: https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
[2]: https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
[3]: https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
[4]: https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h

Reply via email to