Thank you Neil for writing this summary and everyone whose thoughts went into the discussions -- I think the proposal, as summarized, offers a great path forward by allowing the various Arrow communities to specialize when advantageous but remain compatible.
On Thu, Jul 13, 2023 at 11:59 AM Ian Cook <ianmc...@apache.org> wrote: > Thank you Weston for proposing this solution and Neal for describing > its context and implications. I agree with the other replies here—this > seems like an elegant solution to a growing need that could, if left > unaddressed, increase the fragmentation of the ecosystem and reduce > the centrality of the Arrow format. > > Greater diversity of layouts is happening. Whether it happens inside > of Arrow or outside of Arrow is up to us. I think we all would like to > see it happen inside of Arrow. This proposal allows for that, while > striking a balance as Raphael describes. > > However I think there is still some ambiguity about exactly how an > Arrow implementation that is consuming/producing data would negotiate > with an Arrow implementation or other component that is > producing/consuming data to determine whether an alternative layout is > supported. This was discussed briefly in [5] but I am interested to > see how this negotiation would be implemented in practice in the C > data interface, IPC, Flight, etc. > > Ian > > [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2 > > > On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > > I like this proposal, I think it strikes a pragmatic balance between > > preserving interoperability whilst still allowing new ideas to be > > incorporated into the standard. Thank you for writing this up. > > > > On 13/07/2023 10:22, Matt Topol wrote: > > > I don't have much to add but I do want to second Jacob's comments. I > agree > > > that this is a good way to avoid the fragmentation while keeping Arrow > > > relevant, and likely something we need to do so that we can ensure > Arrow > > > remains the way to do this data integration and interoperability. > > > > > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens > > > <ja...@voltrondata.com.invalid> wrote: > > > > > >> Hello Everyone, > > >> > > >> Thanks for this comprehensive but concise write up Neal! I think this > > >> proposal is a good way to avoid both fragmentation of the arrow > ecosystem > > >> as well as its obsolescence. In my opinion of these two problems the > > >> obsolescence is the bigger issue as (as mentioned in the proposal) > arrow is > > >> already (close to) being relegated to the sidelines in eco-system > defining > > >> projects. > > >> > > >> Jacob > > >> > > >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson < > > >> neal.p.richard...@gmail.com> wrote: > > >> > > >>> Hi all, > > >>> As was previously raised in [1] and surfaced again in [2], there is a > > >>> proposal for representing alternative layouts. The intent, as I > > >> understand > > >>> it, is to be able to support memory layouts that some (but perhaps > not > > >> all) > > >>> applications of Arrow find valuable, so that these nearly Arrow > systems > > >> can > > >>> be fully Arrow-native. > > >>> > > >>> I wanted to start a more focused discussion on it because I think > it's > > >>> worth being considered on its own merits, but I also think this gets > to > > >> the > > >>> core of what the Arrow project is and should be, and I don't want us > to > > >>> lose sight of that. > > >>> > > >>> To restate the proposal from [1]: > > >>> > > >>> * There are one or more primary layouts > > >>> * Existing layouts are automatically considered primary layouts, > > >>> even if they > > >>> wouldn't have been primary layouts initially (e.g. large list) > > >>> * A new layout, if it is semantically equivalent to another, is > > >>> considered an > > >>> alternative layout > > >>> * An alternative layout still has the same requirements for > adoption > > >>> (two implementations > > >>> and a vote) > > >>> * An implementation should not feel pressured to rush and > implement > > >> the > > >>> new > > >>> layout. It would be good if they contribute in the discussion and > > >> consider > > >>> the layout and vote if they feel it would be an acceptable design. > > >>> * We can define and vote and approve as many canonical alternative > > >>> layouts as > > >>> we want: > > >>> * A canonical alternative layout should, at a minimum, have some > > >>> reasonable > > >>> justification, such as improved performance for algorithm X > > >>> * Arrow implementations MUST support the primary layouts > > >>> * An Arrow implementation MAY support a canonical alternative, > however: > > >>> * An Arrow implementation MUST first support the primary layout > > >>> * An Arrow implementation MUST support conversion to/from the > primary > > >>> and > > >>> canonical layout > > >>> * An Arrow implementation's APIs MUST only provide data in the > > >>> alternative layout if it is explicitly asked for (e.g. schema > inference > > >>> should prefer the primary layout). > > >>> * We can still vote for new primary layouts (e.g. promoting a > > >>> canonical alternative) > > >>> but, in these votes we don't only consider the value (e.g. > performance) > > >> of > > >>> the layout but also the interoperability. In other words, a layout > can > > >> only > > >>> become a primary layout if there is significant evidence that most > > >>> implementations > > >>> plan to adopt it. > > >>> > > >>> > > >>> To summarize some of the arguments against the proposal from the > previous > > >>> threads, there are concerns about increasing the complexity of the > Arrow > > >>> specification and the cost/burden of updating all of the Arrow > > >>> specifications to support them. > > >>> > > >>> Where these discussions, both about several proposed new types and > this > > >>> layout proposal, get to the core of Arrow is well expressed in the > > >> comments > > >>> on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: > "what > > >>> matters to people more, interoperability or best-in-class > performance?" > > >> And > > >>> Pedro notes that because of the overhead of converting these > > >> not-yet-Arrow > > >>> types to the Arrow C ABI is high enough that they've considered > > >> abandoning > > >>> Arrow as their interchange format. So: on the one hand, we're kinda > > >>> choosing which quality we're optimizing for, but on the other, > > >>> interoperability and performance are dependent on each other. > > >>> > > >>> What I see that we're trying to do here is find a way to expand the > Arrow > > >>> specification just enough so that Arrow becomes or remains the > in-memory > > >>> standard everywhere, but not so much that it creates too much > complexity > > >> or > > >>> burden to implement. Expand too much and you get a fragmented > ecosystem > > >>> where everyone is writing subsets of the Arrow standard and so > nothing is > > >>> fully compatible and the whole premise is undermined. But expand too > > >> little > > >>> and projects will abandon the standard and we've also failed. > > >>> > > >>> I don't have a tidy answer, but I wanted to acknowledge the bigger > > >> issues, > > >>> and see if this helps us reason about the various proposals on the > > >> table. I > > >>> wonder if the alternative layout proposal is the happy medium that > adds > > >>> some complexity to the specification, but less than there would be if > > >> three > > >>> new types were added, and still meets the needs of projects like > DuckDB, > > >>> Velox, and Gluten and gets them fully Arrow native. > > >>> > > >>> Neal > > >>> > > >>> > > >>> [1]: > https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk > > >>> [2]: > https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6 > > >>> [3]: > https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832 > > >>> [4]: > https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h > > >>> >