Re: [DISCUSS] Canonical alternative layout proposal

Dane Pitkin Thu, 13 Jul 2023 10:49:30 -0700

I am in favor of this proposal. IMO the Arrow project is the right place to
standardize both the interoperability *and operability* of columnar data
layouts. Data engines are a core component of the Arrow ecosystem and the
project should be able to grow with these data engines as they converge on
new layouts. Since columnar data is ubiquitous in analytical workloads, we
are seeing a natural progression into optimizing those workloads. This
includes new lossless compression schemes for columnar data that allows
engines to operate directly on the compressed data (e.g. RLE). If we can't
reliably support the growing needs of the broader data engine ecosystem in
a timely manner, then I also fear Arrow might lose relevancy over time.


On Thu, Jul 13, 2023 at 11:59 AM Ian Cook <[email protected]> wrote:

> Thank you Weston for proposing this solution and Neal for describing
> its context and implications. I agree with the other replies here—this
> seems like an elegant solution to a growing need that could, if left
> unaddressed, increase the fragmentation of the ecosystem and reduce
> the centrality of the Arrow format.
>
> Greater diversity of layouts is happening. Whether it happens inside
> of Arrow or outside of Arrow is up to us. I think we all would like to
> see it happen inside of Arrow. This proposal allows for that, while
> striking a balance as Raphael describes.
>
> However I think there is still some ambiguity about exactly how an
> Arrow implementation that is consuming/producing data would negotiate
> with an Arrow implementation or other component that is
> producing/consuming data to determine whether an alternative layout is
> supported. This was discussed briefly in [5] but I am interested to
> see how this negotiation would be implemented in practice in the C
> data interface, IPC, Flight, etc.
>
> Ian
>
> [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2
>
>
> On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
> <[email protected]> wrote:
> >
> > I like this proposal, I think it strikes a pragmatic balance between
> > preserving interoperability whilst still allowing new ideas to be
> > incorporated into the standard. Thank you for writing this up.
> >
> > On 13/07/2023 10:22, Matt Topol wrote:
> > > I don't have much to add but I do want to second Jacob's comments. I
> agree
> > > that this is a good way to avoid the fragmentation while keeping Arrow
> > > relevant, and likely something we need to do so that we can ensure
> Arrow
> > > remains the way to do this data integration and interoperability.
> > >
> > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> > > <[email protected]> wrote:
> > >
> > >> Hello Everyone,
> > >>
> > >> Thanks for this comprehensive but concise write up Neal! I think this
> > >> proposal is a good way to avoid both fragmentation of the arrow
> ecosystem
> > >> as well as its obsolescence. In my opinion of these two problems the
> > >> obsolescence is the bigger issue as (as mentioned in the proposal)
> arrow is
> > >> already (close to) being relegated to the sidelines in eco-system
> defining
> > >> projects.
> > >>
> > >> Jacob
> > >>
> > >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> > >> [email protected]> wrote:
> > >>
> > >>> Hi all,
> > >>> As was previously raised in [1] and surfaced again in [2], there is a
> > >>> proposal for representing alternative layouts. The intent, as I
> > >> understand
> > >>> it, is to be able to support memory layouts that some (but perhaps
> not
> > >> all)
> > >>> applications of Arrow find valuable, so that these nearly Arrow
> systems
> > >> can
> > >>> be fully Arrow-native.
> > >>>
> > >>> I wanted to start a more focused discussion on it because I think
> it's
> > >>> worth being considered on its own merits, but I also think this gets
> to
> > >> the
> > >>> core of what the Arrow project is and should be, and I don't want us
> to
> > >>> lose sight of that.
> > >>>
> > >>> To restate the proposal from [1]:
> > >>>
> > >>>   * There are one or more primary layouts
> > >>>     * Existing layouts are automatically considered primary layouts,
> > >>> even if they
> > >>> wouldn't have been primary layouts initially (e.g. large list)
> > >>>   * A new layout, if it is semantically equivalent to another, is
> > >>> considered an
> > >>> alternative layout
> > >>>   * An alternative layout still has the same requirements for
> adoption
> > >>> (two implementations
> > >>> and a vote)
> > >>>     * An implementation should not feel pressured to rush and
> implement
> > >> the
> > >>> new
> > >>> layout. It would be good if they contribute in the discussion and
> > >> consider
> > >>> the layout and vote if they feel it would be an acceptable design.
> > >>>   * We can define and vote and approve as many canonical alternative
> > >>> layouts as
> > >>> we want:
> > >>>     * A canonical alternative layout should, at a minimum, have some
> > >>> reasonable
> > >>> justification, such as improved performance for algorithm X
> > >>>   * Arrow implementations MUST support the primary layouts
> > >>>   * An Arrow implementation MAY support a canonical alternative,
> however:
> > >>>     * An Arrow implementation MUST first support the primary layout
> > >>>     * An Arrow implementation MUST support conversion to/from the
> primary
> > >>> and
> > >>> canonical layout
> > >>>     * An Arrow implementation's APIs MUST only provide data in the
> > >>> alternative layout if it is explicitly asked for (e.g. schema
> inference
> > >>> should prefer the primary layout).
> > >>>   * We can still vote for new primary layouts (e.g. promoting a
> > >>> canonical alternative)
> > >>> but, in these votes we don't only consider the value (e.g.
> performance)
> > >> of
> > >>> the layout but also the interoperability. In other words, a layout
> can
> > >> only
> > >>> become a primary layout if there is significant evidence that most
> > >>> implementations
> > >>> plan to adopt it.
> > >>>
> > >>>
> > >>> To summarize some of the arguments against the proposal from the
> previous
> > >>> threads, there are concerns about increasing the complexity of the
> Arrow
> > >>> specification and the cost/burden of updating all of the Arrow
> > >>> specifications to support them.
> > >>>
> > >>> Where these discussions, both about several proposed new types and
> this
> > >>> layout proposal, get to the core of Arrow is well expressed in the
> > >> comments
> > >>> on the previous thread by Raphael [3] and Pedro [4]. Raphael asks:
> "what
> > >>> matters to people more, interoperability or best-in-class
> performance?"
> > >> And
> > >>> Pedro notes that because of the overhead of converting these
> > >> not-yet-Arrow
> > >>> types to the Arrow C ABI is high enough that they've considered
> > >> abandoning
> > >>> Arrow as their interchange format. So: on the one hand, we're kinda
> > >>> choosing which quality we're optimizing for, but on the other,
> > >>> interoperability and performance are dependent on each other.
> > >>>
> > >>> What I see that we're trying to do here is find a way to expand the
> Arrow
> > >>> specification just enough so that Arrow becomes or remains the
> in-memory
> > >>> standard everywhere, but not so much that it creates too much
> complexity
> > >> or
> > >>> burden to implement. Expand too much and you get a fragmented
> ecosystem
> > >>> where everyone is writing subsets of the Arrow standard and so
> nothing is
> > >>> fully compatible and the whole premise is undermined. But expand too
> > >> little
> > >>> and projects will abandon the standard and we've also failed.
> > >>>
> > >>> I don't have a tidy answer, but I wanted to acknowledge the bigger
> > >> issues,
> > >>> and see if this helps us reason about the various proposals on the
> > >> table. I
> > >>> wonder if the alternative layout proposal is the happy medium that
> adds
> > >>> some complexity to the specification, but less than there would be if
> > >> three
> > >>> new types were added, and still meets the needs of projects like
> DuckDB,
> > >>> Velox, and Gluten and gets them fully Arrow native.
> > >>>
> > >>> Neal
> > >>>
> > >>>
> > >>> [1]:
> https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
> > >>> [2]:
> https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
> > >>> [3]:
> https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
> > >>> [4]:
> https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h
> > >>>
>

Re: [DISCUSS] Canonical alternative layout proposal

Reply via email to