Thank you Neil for writing this summary and everyone whose thoughts went
into the discussions -- I think the proposal, as summarized, offers a great
path forward by allowing the various Arrow communities to specialize when
advantageous but remain compatible.


On Thu, Jul 13, 2023 at 11:59 AM Ian Cook <ianmc...@apache.org> wrote:

> Thank you Weston for proposing this solution and Neal for describing
> its context and implications. I agree with the other replies here—this
> seems like an elegant solution to a growing need that could, if left
> unaddressed, increase the fragmentation of the ecosystem and reduce
> the centrality of the Arrow format.
>
> Greater diversity of layouts is happening. Whether it happens inside
> of Arrow or outside of Arrow is up to us. I think we all would like to
> see it happen inside of Arrow. This proposal allows for that, while
> striking a balance as Raphael describes.
>
> However I think there is still some ambiguity about exactly how an
> Arrow implementation that is consuming/producing data would negotiate
> with an Arrow implementation or other component that is
> producing/consuming data to determine whether an alternative layout is
> supported. This was discussed briefly in [5] but I am interested to
> see how this negotiation would be implemented in practice in the C
> data interface, IPC, Flight, etc.
>
> Ian
>
> [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2
>
>
> On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid> wrote:
> >
> > I like this proposal, I think it strikes a pragmatic balance between
> > preserving interoperability whilst still allowing new ideas to be
> > incorporated into the standard. Thank you for writing this up.
> >
> > On 13/07/2023 10:22, Matt Topol wrote:
> > > I don't have much to add but I do want to second Jacob's comments. I
> agree
> > > that this is a good way to avoid the fragmentation while keeping Arrow
> > > relevant, and likely something we need to do so that we can ensure
> Arrow
> > > remains the way to do this data integration and interoperability.
> > >
> > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> > > <ja...@voltrondata.com.invalid> wrote:
> > >
> > >> Hello Everyone,
> > >>
> > >> Thanks for this comprehensive but concise write up Neal! I think this
> > >> proposal is a good way to avoid both fragmentation of the arrow
> ecosystem
> > >> as well as its obsolescence. In my opinion of these two problems the
> > >> obsolescence is the bigger issue as (as mentioned in the proposal)
> arrow is
> > >> already (close to) being relegated to the sidelines in eco-system
> defining
> > >> projects.
> > >>
> > >> Jacob
> > >>
> > >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> > >> neal.p.richard...@gmail.com> wrote:
> > >>
> > >>> Hi all,
> > >>> As was previously raised in [1] and surfaced again in [2], there is a
> > >>> proposal for representing alternative layouts. The intent, as I
> > >> understand
> > >>> it, is to be able to support memory layouts that some (but perhaps
> not
> > >> all)
> > >>> applications of Arrow find valuable, so that these nearly Arrow
> systems
> > >> can
> > >>> be fully Arrow-native.
> > >>>
> > >>> I wanted to start a more focused discussion on it because I think
> it's
> > >>> worth being considered on its own merits, but I also think this gets
> to
> > >> the
> > >>> core of what the Arrow project is and should be, and I don't want us
> to
> > >>> lose sight of that.
> > >>>
> > >>> To restate the proposal from [1]:
> > >>>
> > >>>   * There are one or more primary layouts
> > >>>     * Existing layouts are automatically considered primary layouts,
> > >>> even if they
> > >>> wouldn't have been primary layouts initially (e.g. large list)
> > >>>   * A new layout, if it is semantically equivalent to another, is
> > >>> considered an
> > >>> alternative layout
> > >>>   * An alternative layout still has the same requirements for
> adoption
> > >>> (two implementations
> > >>> and a vote)
> > >>>     * An implementation should not feel pressured to rush and
> implement
> > >> the
> > >>> new
> > >>> layout. It would be good if they contribute in the discussion and
> > >> consider
> > >>> the layout and vote if they feel it would be an acceptable design.
> > >>>   * We can define and vote and approve as many canonical alternative
> > >>> layouts as
> > >>> we want:
> > >>>     * A canonical alternative layout should, at a minimum, have some
> > >>> reasonable
> > >>> justification, such as improved performance for algorithm X
> > >>>   * Arrow implementations MUST support the primary layouts
> > >>>   * An Arrow implementation MAY support a canonical alternative,
> however:
> > >>>     * An Arrow implementation MUST first support the primary layout
> > >>>     * An Arrow implementation MUST support conversion to/from the
> primary
> > >>> and
> > >>> canonical layout
> > >>>     * An Arrow implementation's APIs MUST only provide data in the
> > >>> alternative layout if it is explicitly asked for (e.g. schema
> inference
> > >>> should prefer the primary layout).
> > >>>   * We can still vote for new primary layouts (e.g. promoting a
> > >>> canonical alternative)
> > >>> but, in these votes we don't only consider the value (e.g.
> performance)
> > >> of
> > >>> the layout but also the interoperability. In other words, a layout
> can
> > >> only
> > >>> become a primary layout if there is significant evidence that most
> > >>> implementations
> > >>> plan to adopt it.
> > >>>
> > >>>
> > >>> To summarize some of the arguments against the proposal from the
> previous
> > >>> threads, there are concerns about increasing the complexity of the
> Arrow
> > >>> specification and the cost/burden of updating all of the Arrow
> > >>> specifications to support them.
> > >>>
> > >>> Where these discussions, both about several proposed new types and
> this
> > >>> layout proposal, get to the core of Arrow is well expressed in the
> > >> comments
> > >>> on the previous thread by Raphael [3] and Pedro [4]. Raphael asks:
> "what
> > >>> matters to people more, interoperability or best-in-class
> performance?"
> > >> And
> > >>> Pedro notes that because of the overhead of converting these
> > >> not-yet-Arrow
> > >>> types to the Arrow C ABI is high enough that they've considered
> > >> abandoning
> > >>> Arrow as their interchange format. So: on the one hand, we're kinda
> > >>> choosing which quality we're optimizing for, but on the other,
> > >>> interoperability and performance are dependent on each other.
> > >>>
> > >>> What I see that we're trying to do here is find a way to expand the
> Arrow
> > >>> specification just enough so that Arrow becomes or remains the
> in-memory
> > >>> standard everywhere, but not so much that it creates too much
> complexity
> > >> or
> > >>> burden to implement. Expand too much and you get a fragmented
> ecosystem
> > >>> where everyone is writing subsets of the Arrow standard and so
> nothing is
> > >>> fully compatible and the whole premise is undermined. But expand too
> > >> little
> > >>> and projects will abandon the standard and we've also failed.
> > >>>
> > >>> I don't have a tidy answer, but I wanted to acknowledge the bigger
> > >> issues,
> > >>> and see if this helps us reason about the various proposals on the
> > >> table. I
> > >>> wonder if the alternative layout proposal is the happy medium that
> adds
> > >>> some complexity to the specification, but less than there would be if
> > >> three
> > >>> new types were added, and still meets the needs of projects like
> DuckDB,
> > >>> Velox, and Gluten and gets them fully Arrow native.
> > >>>
> > >>> Neal
> > >>>
> > >>>
> > >>> [1]:
> https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
> > >>> [2]:
> https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
> > >>> [3]:
> https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
> > >>> [4]:
> https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h
> > >>>
>

Reply via email to