Hello,

I'm trying to reason about the advantages and drawbacks of this proposal, but it seems to me that it lacks definition.

I would welcome a draft PR showcasing the changes necessary in the IPC format definition, and in the C Data Interface specification (no need to actually implement them for now :-)).


As it is, it seems that this proposal would allow us to switch from:

"""We'd like to add a more efficient physical data representation, so we'll introduce a new Arrow data type. Implementations may or may not support it, but we will progressively try to bring reference implementations to parity.""" (1)

to:

"""We'd like to add a more efficient physical data representation, so we'll introduce a new alternative layout for an existing Arrow data type. Implementations may or may not support it, but we will progressively try to bring reference implementations to parity.""" (2)

The expected advantage of (2) over (1) seems to be mainly a difference in how new format features are communicated. There are mainline features, and there are experimental / provisional features.

Regards

Antoine.



Le 13/07/2023 à 00:01, Neal Richardson a écrit :
Hi all,
As was previously raised in [1] and surfaced again in [2], there is a
proposal for representing alternative layouts. The intent, as I understand
it, is to be able to support memory layouts that some (but perhaps not all)
applications of Arrow find valuable, so that these nearly Arrow systems can
be fully Arrow-native.

I wanted to start a more focused discussion on it because I think it's
worth being considered on its own merits, but I also think this gets to the
core of what the Arrow project is and should be, and I don't want us to
lose sight of that.

To restate the proposal from [1]:

  * There are one or more primary layouts
    * Existing layouts are automatically considered primary layouts,
even if they
wouldn't have been primary layouts initially (e.g. large list)
  * A new layout, if it is semantically equivalent to another, is considered an
alternative layout
  * An alternative layout still has the same requirements for adoption
(two implementations
and a vote)
    * An implementation should not feel pressured to rush and implement the new
layout. It would be good if they contribute in the discussion and consider
the layout and vote if they feel it would be an acceptable design.
  * We can define and vote and approve as many canonical alternative layouts as
we want:
    * A canonical alternative layout should, at a minimum, have some reasonable
justification, such as improved performance for algorithm X
  * Arrow implementations MUST support the primary layouts
  * An Arrow implementation MAY support a canonical alternative, however:
    * An Arrow implementation MUST first support the primary layout
    * An Arrow implementation MUST support conversion to/from the primary and
canonical layout
    * An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema inference
should prefer the primary layout).
  * We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g. performance) of
the layout but also the interoperability. In other words, a layout can only
become a primary layout if there is significant evidence that most
implementations
plan to adopt it.


To summarize some of the arguments against the proposal from the previous
threads, there are concerns about increasing the complexity of the Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.

Where these discussions, both about several proposed new types and this
layout proposal, get to the core of Arrow is well expressed in the comments
on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
matters to people more, interoperability or best-in-class performance?" And
Pedro notes that because of the overhead of converting these not-yet-Arrow
types to the Arrow C ABI is high enough that they've considered abandoning
Arrow as their interchange format. So: on the one hand, we're kinda
choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.

What I see that we're trying to do here is find a way to expand the Arrow
specification just enough so that Arrow becomes or remains the in-memory
standard everywhere, but not so much that it creates too much complexity or
burden to implement. Expand too much and you get a fragmented ecosystem
where everyone is writing subsets of the Arrow standard and so nothing is
fully compatible and the whole premise is undermined. But expand too little
and projects will abandon the standard and we've also failed.

I don't have a tidy answer, but I wanted to acknowledge the bigger issues,
and see if this helps us reason about the various proposals on the table. I
wonder if the alternative layout proposal is the happy medium that adds
some complexity to the specification, but less than there would be if three
new types were added, and still meets the needs of projects like DuckDB,
Velox, and Gluten and gets them fully Arrow native.

Neal


[1]: https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
[2]: https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
[3]: https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
[4]: https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h

Reply via email to