Re: [DISCUSS] Canonical alternative layout proposal

Ian Cook Thu, 13 Jul 2023 08:59:08 -0700

Thank you Weston for proposing this solution and Neal for describing
its context and implications. I agree with the other replies here—this
seems like an elegant solution to a growing need that could, if left
unaddressed, increase the fragmentation of the ecosystem and reduce
the centrality of the Arrow format.


Greater diversity of layouts is happening. Whether it happens inside
of Arrow or outside of Arrow is up to us. I think we all would like to
see it happen inside of Arrow. This proposal allows for that, while
striking a balance as Raphael describes.

However I think there is still some ambiguity about exactly how an
Arrow implementation that is consuming/producing data would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported. This was discussed briefly in [5] but I am interested to
see how this negotiation would be implemented in practice in the C
data interface, IPC, Flight, etc.

Ian

[5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2


On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:
>
> I like this proposal, I think it strikes a pragmatic balance between
> preserving interoperability whilst still allowing new ideas to be
> incorporated into the standard. Thank you for writing this up.
>
> On 13/07/2023 10:22, Matt Topol wrote:
> > I don't have much to add but I do want to second Jacob's comments. I agree
> > that this is a good way to avoid the fragmentation while keeping Arrow
> > relevant, and likely something we need to do so that we can ensure Arrow
> > remains the way to do this data integration and interoperability.
> >
> > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> > <ja...@voltrondata.com.invalid> wrote:
> >
> >> Hello Everyone,
> >>
> >> Thanks for this comprehensive but concise write up Neal! I think this
> >> proposal is a good way to avoid both fragmentation of the arrow ecosystem
> >> as well as its obsolescence. In my opinion of these two problems the
> >> obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
> >> already (close to) being relegated to the sidelines in eco-system defining
> >> projects.
> >>
> >> Jacob
> >>
> >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> >> neal.p.richard...@gmail.com> wrote:
> >>
> >>> Hi all,
> >>> As was previously raised in [1] and surfaced again in [2], there is a
> >>> proposal for representing alternative layouts. The intent, as I
> >> understand
> >>> it, is to be able to support memory layouts that some (but perhaps not
> >> all)
> >>> applications of Arrow find valuable, so that these nearly Arrow systems
> >> can
> >>> be fully Arrow-native.
> >>>
> >>> I wanted to start a more focused discussion on it because I think it's
> >>> worth being considered on its own merits, but I also think this gets to
> >> the
> >>> core of what the Arrow project is and should be, and I don't want us to
> >>> lose sight of that.
> >>>
> >>> To restate the proposal from [1]:
> >>>
> >>>   * There are one or more primary layouts
> >>>     * Existing layouts are automatically considered primary layouts,
> >>> even if they
> >>> wouldn't have been primary layouts initially (e.g. large list)
> >>>   * A new layout, if it is semantically equivalent to another, is
> >>> considered an
> >>> alternative layout
> >>>   * An alternative layout still has the same requirements for adoption
> >>> (two implementations
> >>> and a vote)
> >>>     * An implementation should not feel pressured to rush and implement
> >> the
> >>> new
> >>> layout. It would be good if they contribute in the discussion and
> >> consider
> >>> the layout and vote if they feel it would be an acceptable design.
> >>>   * We can define and vote and approve as many canonical alternative
> >>> layouts as
> >>> we want:
> >>>     * A canonical alternative layout should, at a minimum, have some
> >>> reasonable
> >>> justification, such as improved performance for algorithm X
> >>>   * Arrow implementations MUST support the primary layouts
> >>>   * An Arrow implementation MAY support a canonical alternative, however:
> >>>     * An Arrow implementation MUST first support the primary layout
> >>>     * An Arrow implementation MUST support conversion to/from the primary
> >>> and
> >>> canonical layout
> >>>     * An Arrow implementation's APIs MUST only provide data in the
> >>> alternative layout if it is explicitly asked for (e.g. schema inference
> >>> should prefer the primary layout).
> >>>   * We can still vote for new primary layouts (e.g. promoting a
> >>> canonical alternative)
> >>> but, in these votes we don't only consider the value (e.g. performance)
> >> of
> >>> the layout but also the interoperability. In other words, a layout can
> >> only
> >>> become a primary layout if there is significant evidence that most
> >>> implementations
> >>> plan to adopt it.
> >>>
> >>>
> >>> To summarize some of the arguments against the proposal from the previous
> >>> threads, there are concerns about increasing the complexity of the Arrow
> >>> specification and the cost/burden of updating all of the Arrow
> >>> specifications to support them.
> >>>
> >>> Where these discussions, both about several proposed new types and this
> >>> layout proposal, get to the core of Arrow is well expressed in the
> >> comments
> >>> on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
> >>> matters to people more, interoperability or best-in-class performance?"
> >> And
> >>> Pedro notes that because of the overhead of converting these
> >> not-yet-Arrow
> >>> types to the Arrow C ABI is high enough that they've considered
> >> abandoning
> >>> Arrow as their interchange format. So: on the one hand, we're kinda
> >>> choosing which quality we're optimizing for, but on the other,
> >>> interoperability and performance are dependent on each other.
> >>>
> >>> What I see that we're trying to do here is find a way to expand the Arrow
> >>> specification just enough so that Arrow becomes or remains the in-memory
> >>> standard everywhere, but not so much that it creates too much complexity
> >> or
> >>> burden to implement. Expand too much and you get a fragmented ecosystem
> >>> where everyone is writing subsets of the Arrow standard and so nothing is
> >>> fully compatible and the whole premise is undermined. But expand too
> >> little
> >>> and projects will abandon the standard and we've also failed.
> >>>
> >>> I don't have a tidy answer, but I wanted to acknowledge the bigger
> >> issues,
> >>> and see if this helps us reason about the various proposals on the
> >> table. I
> >>> wonder if the alternative layout proposal is the happy medium that adds
> >>> some complexity to the specification, but less than there would be if
> >> three
> >>> new types were added, and still meets the needs of projects like DuckDB,
> >>> Velox, and Gluten and gets them fully Arrow native.
> >>>
> >>> Neal
> >>>
> >>>
> >>> [1]: https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
> >>> [2]: https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
> >>> [3]: https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
> >>> [4]: https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h
> >>>

Re: [DISCUSS] Canonical alternative layout proposal

Reply via email to