Hello Everyone,

Thanks for this comprehensive but concise write up Neal! I think this
proposal is a good way to avoid both fragmentation of the arrow ecosystem
as well as its obsolescence. In my opinion of these two problems the
obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
already (close to) being relegated to the sidelines in eco-system defining
projects.

Jacob

On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

> Hi all,
> As was previously raised in [1] and surfaced again in [2], there is a
> proposal for representing alternative layouts. The intent, as I understand
> it, is to be able to support memory layouts that some (but perhaps not all)
> applications of Arrow find valuable, so that these nearly Arrow systems can
> be fully Arrow-native.
>
> I wanted to start a more focused discussion on it because I think it's
> worth being considered on its own merits, but I also think this gets to the
> core of what the Arrow project is and should be, and I don't want us to
> lose sight of that.
>
> To restate the proposal from [1]:
>
>  * There are one or more primary layouts
>    * Existing layouts are automatically considered primary layouts,
> even if they
> wouldn't have been primary layouts initially (e.g. large list)
>  * A new layout, if it is semantically equivalent to another, is
> considered an
> alternative layout
>  * An alternative layout still has the same requirements for adoption
> (two implementations
> and a vote)
>    * An implementation should not feel pressured to rush and implement the
> new
> layout. It would be good if they contribute in the discussion and consider
> the layout and vote if they feel it would be an acceptable design.
>  * We can define and vote and approve as many canonical alternative
> layouts as
> we want:
>    * A canonical alternative layout should, at a minimum, have some
> reasonable
> justification, such as improved performance for algorithm X
>  * Arrow implementations MUST support the primary layouts
>  * An Arrow implementation MAY support a canonical alternative, however:
>    * An Arrow implementation MUST first support the primary layout
>    * An Arrow implementation MUST support conversion to/from the primary
> and
> canonical layout
>    * An Arrow implementation's APIs MUST only provide data in the
> alternative layout if it is explicitly asked for (e.g. schema inference
> should prefer the primary layout).
>  * We can still vote for new primary layouts (e.g. promoting a
> canonical alternative)
> but, in these votes we don't only consider the value (e.g. performance) of
> the layout but also the interoperability. In other words, a layout can only
> become a primary layout if there is significant evidence that most
> implementations
> plan to adopt it.
>
>
> To summarize some of the arguments against the proposal from the previous
> threads, there are concerns about increasing the complexity of the Arrow
> specification and the cost/burden of updating all of the Arrow
> specifications to support them.
>
> Where these discussions, both about several proposed new types and this
> layout proposal, get to the core of Arrow is well expressed in the comments
> on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
> matters to people more, interoperability or best-in-class performance?" And
> Pedro notes that because of the overhead of converting these not-yet-Arrow
> types to the Arrow C ABI is high enough that they've considered abandoning
> Arrow as their interchange format. So: on the one hand, we're kinda
> choosing which quality we're optimizing for, but on the other,
> interoperability and performance are dependent on each other.
>
> What I see that we're trying to do here is find a way to expand the Arrow
> specification just enough so that Arrow becomes or remains the in-memory
> standard everywhere, but not so much that it creates too much complexity or
> burden to implement. Expand too much and you get a fragmented ecosystem
> where everyone is writing subsets of the Arrow standard and so nothing is
> fully compatible and the whole premise is undermined. But expand too little
> and projects will abandon the standard and we've also failed.
>
> I don't have a tidy answer, but I wanted to acknowledge the bigger issues,
> and see if this helps us reason about the various proposals on the table. I
> wonder if the alternative layout proposal is the happy medium that adds
> some complexity to the specification, but less than there would be if three
> new types were added, and still meets the needs of projects like DuckDB,
> Velox, and Gluten and gets them fully Arrow native.
>
> Neal
>
>
> [1]: https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
> [2]: https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
> [3]: https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
> [4]: https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h
>

Reply via email to