Thanks Neal and Weston!

I prepared a diagram to solidify my own understanding of the context, which can 
be found at [1].

I think alternative layouts sounds like a nice first approach to allowing new 
layouts that can be supported lazily (implemented when it is beneficial) by 
various implementations of the Arrow Columnar Format. But, I do think that it's 
just a (practical) formalization of saying what layouts are required and which 
ones are optional.

>From the making of the diagram, I also decided that the discussion isn't 
>limited to performance, since there are several reasons new physical layouts 
>may be proposed (or, at least, there are many aspects of performance). Even if 
>it's not "canonical alternative layouts," I think it is important that there 
>be some process for developers that use Arrow to propose extensions to the 
>columnar format without having to prove out the benefits for libraries that 
>use a different tech stack (e.g. rust vs C++ vs go).


[1]: 
https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


------- Original Message -------
On Thursday, July 13th, 2023 at 10:49, Dane Pitkin 
<d...@voltrondata.com.INVALID> wrote:


> I am in favor of this proposal. IMO the Arrow project is the right place to
> standardize both the interoperability and operability of columnar data
> layouts. Data engines are a core component of the Arrow ecosystem and the
> project should be able to grow with these data engines as they converge on
> new layouts. Since columnar data is ubiquitous in analytical workloads, we
> are seeing a natural progression into optimizing those workloads. This
> includes new lossless compression schemes for columnar data that allows
> engines to operate directly on the compressed data (e.g. RLE). If we can't
> reliably support the growing needs of the broader data engine ecosystem in
> a timely manner, then I also fear Arrow might lose relevancy over time.
> 

> On Thu, Jul 13, 2023 at 11:59 AM Ian Cook ianmc...@apache.org wrote:
> 

> > Thank you Weston for proposing this solution and Neal for describing
> > its context and implications. I agree with the other replies here—this
> > seems like an elegant solution to a growing need that could, if left
> > unaddressed, increase the fragmentation of the ecosystem and reduce
> > the centrality of the Arrow format.
> > 

> > Greater diversity of layouts is happening. Whether it happens inside
> > of Arrow or outside of Arrow is up to us. I think we all would like to
> > see it happen inside of Arrow. This proposal allows for that, while
> > striking a balance as Raphael describes.
> > 

> > However I think there is still some ambiguity about exactly how an
> > Arrow implementation that is consuming/producing data would negotiate
> > with an Arrow implementation or other component that is
> > producing/consuming data to determine whether an alternative layout is
> > supported. This was discussed briefly in [5] but I am interested to
> > see how this negotiation would be implemented in practice in the C
> > data interface, IPC, Flight, etc.
> > 

> > Ian
> > 

> > [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2
> > 

> > On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
> > r.taylordav...@googlemail.com.invalid wrote:
> > 

> > > I like this proposal, I think it strikes a pragmatic balance between
> > > preserving interoperability whilst still allowing new ideas to be
> > > incorporated into the standard. Thank you for writing this up.
> > > 

> > > On 13/07/2023 10:22, Matt Topol wrote:
> > > 

> > > > I don't have much to add but I do want to second Jacob's comments. I
> > > > agree
> > > > that this is a good way to avoid the fragmentation while keeping Arrow
> > > > relevant, and likely something we need to do so that we can ensure
> > > > Arrow
> > > > remains the way to do this data integration and interoperability.
> > > > 

> > > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> > > > ja...@voltrondata.com.invalid wrote:
> > > > 

> > > > > Hello Everyone,
> > > > > 

> > > > > Thanks for this comprehensive but concise write up Neal! I think this
> > > > > proposal is a good way to avoid both fragmentation of the arrow
> > > > > ecosystem
> > > > > as well as its obsolescence. In my opinion of these two problems the
> > > > > obsolescence is the bigger issue as (as mentioned in the proposal)
> > > > > arrow is
> > > > > already (close to) being relegated to the sidelines in eco-system
> > > > > defining
> > > > > projects.
> > > > > 

> > > > > Jacob
> > > > > 

> > > > > On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> > > > > neal.p.richard...@gmail.com> wrote:
> > > > > 

> > > > > > Hi all,
> > > > > > As was previously raised in 1 and surfaced again in 2, there is a
> > > > > > proposal for representing alternative layouts. The intent, as I
> > > > > > understand
> > > > > > it, is to be able to support memory layouts that some (but perhaps
> > > > > > not
> > > > > > all)
> > > > > > applications of Arrow find valuable, so that these nearly Arrow
> > > > > > systems
> > > > > > can
> > > > > > be fully Arrow-native.
> > > > > > 

> > > > > > I wanted to start a more focused discussion on it because I think
> > > > > > it's
> > > > > > worth being considered on its own merits, but I also think this gets
> > > > > > to
> > > > > > the
> > > > > > core of what the Arrow project is and should be, and I don't want us
> > > > > > to
> > > > > > lose sight of that.
> > > > > > 

> > > > > > To restate the proposal from 1:
> > > > > > 

> > > > > > * There are one or more primary layouts
> > > > > > * Existing layouts are automatically considered primary layouts,
> > > > > > even if they
> > > > > > wouldn't have been primary layouts initially (e.g. large list)
> > > > > > * A new layout, if it is semantically equivalent to another, is
> > > > > > considered an
> > > > > > alternative layout
> > > > > > * An alternative layout still has the same requirements for
> > > > > > adoption
> > > > > > (two implementations
> > > > > > and a vote)
> > > > > > * An implementation should not feel pressured to rush and
> > > > > > implement
> > > > > > the
> > > > > > new
> > > > > > layout. It would be good if they contribute in the discussion and
> > > > > > consider
> > > > > > the layout and vote if they feel it would be an acceptable design.
> > > > > > * We can define and vote and approve as many canonical alternative
> > > > > > layouts as
> > > > > > we want:
> > > > > > * A canonical alternative layout should, at a minimum, have some
> > > > > > reasonable
> > > > > > justification, such as improved performance for algorithm X
> > > > > > * Arrow implementations MUST support the primary layouts
> > > > > > * An Arrow implementation MAY support a canonical alternative,
> > > > > > however:
> > > > > > * An Arrow implementation MUST first support the primary layout
> > > > > > * An Arrow implementation MUST support conversion to/from the
> > > > > > primary
> > > > > > and
> > > > > > canonical layout
> > > > > > * An Arrow implementation's APIs MUST only provide data in the
> > > > > > alternative layout if it is explicitly asked for (e.g. schema
> > > > > > inference
> > > > > > should prefer the primary layout).
> > > > > > * We can still vote for new primary layouts (e.g. promoting a
> > > > > > canonical alternative)
> > > > > > but, in these votes we don't only consider the value (e.g.
> > > > > > performance)
> > > > > > of
> > > > > > the layout but also the interoperability. In other words, a layout
> > > > > > can
> > > > > > only
> > > > > > become a primary layout if there is significant evidence that most
> > > > > > implementations
> > > > > > plan to adopt it.
> > > > > > 

> > > > > > To summarize some of the arguments against the proposal from the
> > > > > > previous
> > > > > > threads, there are concerns about increasing the complexity of the
> > > > > > Arrow
> > > > > > specification and the cost/burden of updating all of the Arrow
> > > > > > specifications to support them.
> > > > > > 

> > > > > > Where these discussions, both about several proposed new types and
> > > > > > this
> > > > > > layout proposal, get to the core of Arrow is well expressed in the
> > > > > > comments
> > > > > > on the previous thread by Raphael 3 and Pedro 4. Raphael asks:
> > > > > > "what
> > > > > > matters to people more, interoperability or best-in-class
> > > > > > performance?"
> > > > > > And
> > > > > > Pedro notes that because of the overhead of converting these
> > > > > > not-yet-Arrow
> > > > > > types to the Arrow C ABI is high enough that they've considered
> > > > > > abandoning
> > > > > > Arrow as their interchange format. So: on the one hand, we're kinda
> > > > > > choosing which quality we're optimizing for, but on the other,
> > > > > > interoperability and performance are dependent on each other.
> > > > > > 

> > > > > > What I see that we're trying to do here is find a way to expand the
> > > > > > Arrow
> > > > > > specification just enough so that Arrow becomes or remains the
> > > > > > in-memory
> > > > > > standard everywhere, but not so much that it creates too much
> > > > > > complexity
> > > > > > or
> > > > > > burden to implement. Expand too much and you get a fragmented
> > > > > > ecosystem
> > > > > > where everyone is writing subsets of the Arrow standard and so
> > > > > > nothing is
> > > > > > fully compatible and the whole premise is undermined. But expand too
> > > > > > little
> > > > > > and projects will abandon the standard and we've also failed.
> > > > > > 

> > > > > > I don't have a tidy answer, but I wanted to acknowledge the bigger
> > > > > > issues,
> > > > > > and see if this helps us reason about the various proposals on the
> > > > > > table. I
> > > > > > wonder if the alternative layout proposal is the happy medium that
> > > > > > adds
> > > > > > some complexity to the specification, but less than there would be 
> > > > > > if
> > > > > > three
> > > > > > new types were added, and still meets the needs of projects like
> > > > > > DuckDB,
> > > > > > Velox, and Gluten and gets them fully Arrow native.
> > > > > > 

> > > > > > Neal

Attachment: publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to