Re: [DISCUSS] Canonical alternative layout proposal

2023-08-05 Thread Felipe Oliveira Carvalho
> I think this is similar to the proposal with the exception that your
> suggestion would require amending existing types that happen to be
> alternatives to each other.

I want to avoid electing one canonical layout for a kind (AKA "logical
type"). And the existence of "alternative layouts" implies the existence of
a canonical layout.

In my suggestion, a layout being canonical is not a property of the
specification, but a choice of the system implementing the specification.

One concrete example of this is how Polars elected LargeList as the
canonical type for the List logical type [1] while Velox settled on a list
array representation based on 32-bit offsets and sizes.



The specification can define the rules of communication upfront, achieving
two goals:

1) implementations can add new layouts and immediately inter-operate better
with other implementations
2) implementations can add new behaviors without concerning themselves with
new layouts other implementations are adding

This is not a full solution to the "expression problem" because we are
still left with some conversions at runtime, but as each implementation
gets closer to understanding all layouts, the conversions disappear.

If we settle on canonical layouts, communication is forced to always
convert to the canonical layout when passing data around, penalizing
layouts that are better for computation.

[1]
https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/dtype.rs#L247

* in terms of speed, and memory consumption, but not binary size

On Thu, Aug 3, 2023 at 12:28 AM Weston Pace  wrote:
>
> > I would welcome a draft PR showcasing the changes necessary in the IPC
> > format definition, and in the C Data Interface specification (no need to
> > actually implement them for now :-)).
>
> I've proposed something at [1].
>
> > One sketch of an idea: define sets of types that we can call “kinds”**
> > (e.g. “string kind” = {string, string view, large string, ree…},
> > “list kind” = {list, large_list, list_view, large_list_view…}).
>
> I think this is similar to the proposal with the exception that your
> suggestion would require amending existing types that happen to be
> alternatives to each other.  I'm not opposed to it but I think it's
> compatible and we don't necessarily need all of the complexity just yet
> (feel free to correct me if I'm wrong).  I don't think we need to
introduce
> the concept of "kind".  We already have a concept of "logical type" in the
> spec.  I think what you are stating is that a single logical type may have
> multiple physical layouts.  I agree.  E.g. variable size list<32>,
variable
> size list<64>, and REE are the physical layouts that, combined with the
> logical type "string", give you "string", "large string", and
"ree"
>
> [1] https://github.com/apache/arrow/pull/37000
>
> On Tue, Aug 1, 2023 at 1:51 AM Felipe Oliveira Carvalho <
felipe...@gmail.com>
> wrote:
>
> > A major difficulty in making the Arrow array types open for extension
[1]
> > is that as soon as we define an (a) universal representation* or (b)
> > abstract interface, we close the door for vectorization. (a) prevents
> > having new vectorization friendly formats and (b) limits the
implementation
> > of new vectorized operations. This is an instance of the “expression
> > problem” [2].
> >
> > The way Arrow currently “solves” the data abstraction problem is by
having
> > no data abstraction — every operation takes a type and should provide
> > specializations for every type. Sometimes it’s possible to re-use the
same
> > kernel for different types, but the general approach is that we
specialize
> > (in the case of C++, we sometimes can specialize by just instantiating a
> > template, but that’s still an specialization).
> >
> > Given these constraints, what could be done?
> >
> > One sketch of an idea: define sets of types that we can call “kinds”**
> > (e.g. “string kind” = {string, string view, large string, ree…},
> > “list kind” = {list, large_list, list_view, large_list_view…}).
> >
> > Then when different implementations have to communicate or interoperate,
> > they have to only be up to date on the list of Arrow Kinds and before
data
> > is moved a conversion step between types within the same kind is
performed
> > if required to make that communication possible.
> >
> > Example: a system that has a string_view Array and needs to send that
array
> > to a system that only understands large_string instances of the string
kind
> > MUST perform a conversion. This means that as long as all Arrow
> > implementations understand one established type on each of the kinds,
they
> > can communicate.
> >
> > This imposes a reasonable requirement on new types: when introduced,
they
> > should come with conversions to the previously specified types on that
> > kind.
> >
> > Any thoughts?
> >
> > —
> > Felipe
> > Voltron Data
> >
> >
> > [1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
> > [2] 

Re: [DISCUSS] Canonical alternative layout proposal

2023-08-02 Thread Weston Pace
> I would welcome a draft PR showcasing the changes necessary in the IPC
> format definition, and in the C Data Interface specification (no need to
> actually implement them for now :-)).

I've proposed something at [1].

> One sketch of an idea: define sets of types that we can call “kinds”**
> (e.g. “string kind” = {string, string view, large string, ree…},
> “list kind” = {list, large_list, list_view, large_list_view…}).

I think this is similar to the proposal with the exception that your
suggestion would require amending existing types that happen to be
alternatives to each other.  I'm not opposed to it but I think it's
compatible and we don't necessarily need all of the complexity just yet
(feel free to correct me if I'm wrong).  I don't think we need to introduce
the concept of "kind".  We already have a concept of "logical type" in the
spec.  I think what you are stating is that a single logical type may have
multiple physical layouts.  I agree.  E.g. variable size list<32>, variable
size list<64>, and REE are the physical layouts that, combined with the
logical type "string", give you "string", "large string", and "ree"

[1] https://github.com/apache/arrow/pull/37000

On Tue, Aug 1, 2023 at 1:51 AM Felipe Oliveira Carvalho 
wrote:

> A major difficulty in making the Arrow array types open for extension [1]
> is that as soon as we define an (a) universal representation* or (b)
> abstract interface, we close the door for vectorization. (a) prevents
> having new vectorization friendly formats and (b) limits the implementation
> of new vectorized operations. This is an instance of the “expression
> problem” [2].
>
> The way Arrow currently “solves” the data abstraction problem is by having
> no data abstraction — every operation takes a type and should provide
> specializations for every type. Sometimes it’s possible to re-use the same
> kernel for different types, but the general approach is that we specialize
> (in the case of C++, we sometimes can specialize by just instantiating a
> template, but that’s still an specialization).
>
> Given these constraints, what could be done?
>
> One sketch of an idea: define sets of types that we can call “kinds”**
> (e.g. “string kind” = {string, string view, large string, ree…},
> “list kind” = {list, large_list, list_view, large_list_view…}).
>
> Then when different implementations have to communicate or interoperate,
> they have to only be up to date on the list of Arrow Kinds and before data
> is moved a conversion step between types within the same kind is performed
> if required to make that communication possible.
>
> Example: a system that has a string_view Array and needs to send that array
> to a system that only understands large_string instances of the string kind
> MUST perform a conversion. This means that as long as all Arrow
> implementations understand one established type on each of the kinds, they
> can communicate.
>
> This imposes a reasonable requirement on new types: when introduced, they
> should come with conversions to the previously specified types on that
> kind.
>
> Any thoughts?
>
> —
> Felipe
> Voltron Data
>
>
> [1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
> [2] https://en.wikipedia.org/wiki/Expression_problem
>
> * “an array is a list of buffers and child arrays” doesn’t qualify as
> “universal representation” because it doesn’t make a commitment on what all
> the buffers and child arrays mean universally
>
> ** if kind is already taken to mean scalar/array, we can use the term
> “sort”
>
> On Mon, 31 Jul 2023 at 04:39 Gang Wu  wrote:
>
> > I am also in favor of the idea of an alternative layout. IIRC, a new
> > alternative
> > layout still goes into a process of standardization though it is the
> choice
> > of
> > each implementation to decide support now or later. I'd like to ask if we
> > can
> > provide the flexibility for implementations or downstream projects to
> > actually
> > implement a new alternative layout by means of a pluggable interface
> before
> > starting the standardization process. This is similar to promoting a
> > popular
> > extension type implemented by many users to a canonical extension type.
> > I know this is more complicated as extension type simply reuses existing
> > layout but alternative layout usually means a brand new one. For example,
> > if two projects speak Arrow and now they want to share a new layout, they
> > can simply implement a pluggable alternative layout before Arrow adopts
> it.
> > This can unblock projects to evolve and help Arrow not to be fragmented.
> >
> > Best,
> > Gang
> >
> > On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou 
> > wrote:
> >
> > >
> > > Hello,
> > >
> > > I'm trying to reason about the advantages and drawbacks of this
> > > proposal, but it seems to me that it lacks definition.
> > >
> > > I would welcome a draft PR showcasing the changes necessary in the IPC
> > > format definition, and in the C Data Interface specification (no need
> 

Re: [DISCUSS] Canonical alternative layout proposal

2023-08-01 Thread Felipe Oliveira Carvalho
A major difficulty in making the Arrow array types open for extension [1]
is that as soon as we define an (a) universal representation* or (b)
abstract interface, we close the door for vectorization. (a) prevents
having new vectorization friendly formats and (b) limits the implementation
of new vectorized operations. This is an instance of the “expression
problem” [2].

The way Arrow currently “solves” the data abstraction problem is by having
no data abstraction — every operation takes a type and should provide
specializations for every type. Sometimes it’s possible to re-use the same
kernel for different types, but the general approach is that we specialize
(in the case of C++, we sometimes can specialize by just instantiating a
template, but that’s still an specialization).

Given these constraints, what could be done?

One sketch of an idea: define sets of types that we can call “kinds”**
(e.g. “string kind” = {string, string view, large string, ree…},
“list kind” = {list, large_list, list_view, large_list_view…}).

Then when different implementations have to communicate or interoperate,
they have to only be up to date on the list of Arrow Kinds and before data
is moved a conversion step between types within the same kind is performed
if required to make that communication possible.

Example: a system that has a string_view Array and needs to send that array
to a system that only understands large_string instances of the string kind
MUST perform a conversion. This means that as long as all Arrow
implementations understand one established type on each of the kinds, they
can communicate.

This imposes a reasonable requirement on new types: when introduced, they
should come with conversions to the previously specified types on that kind.

Any thoughts?

—
Felipe
Voltron Data


[1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
[2] https://en.wikipedia.org/wiki/Expression_problem

* “an array is a list of buffers and child arrays” doesn’t qualify as
“universal representation” because it doesn’t make a commitment on what all
the buffers and child arrays mean universally

** if kind is already taken to mean scalar/array, we can use the term “sort”

On Mon, 31 Jul 2023 at 04:39 Gang Wu  wrote:

> I am also in favor of the idea of an alternative layout. IIRC, a new
> alternative
> layout still goes into a process of standardization though it is the choice
> of
> each implementation to decide support now or later. I'd like to ask if we
> can
> provide the flexibility for implementations or downstream projects to
> actually
> implement a new alternative layout by means of a pluggable interface before
> starting the standardization process. This is similar to promoting a
> popular
> extension type implemented by many users to a canonical extension type.
> I know this is more complicated as extension type simply reuses existing
> layout but alternative layout usually means a brand new one. For example,
> if two projects speak Arrow and now they want to share a new layout, they
> can simply implement a pluggable alternative layout before Arrow adopts it.
> This can unblock projects to evolve and help Arrow not to be fragmented.
>
> Best,
> Gang
>
> On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou 
> wrote:
>
> >
> > Hello,
> >
> > I'm trying to reason about the advantages and drawbacks of this
> > proposal, but it seems to me that it lacks definition.
> >
> > I would welcome a draft PR showcasing the changes necessary in the IPC
> > format definition, and in the C Data Interface specification (no need to
> > actually implement them for now :-)).
> >
> >
> > As it is, it seems that this proposal would allow us to switch from:
> >
> > """We'd like to add a more efficient physical data representation, so
> > we'll introduce a new Arrow data type. Implementations may or may not
> > support it, but we will progressively try to bring reference
> > implementations to parity.""" (1)
> >
> > to:
> >
> > """We'd like to add a more efficient physical data representation, so
> > we'll introduce a new alternative layout for an existing Arrow data
> > type. Implementations may or may not support it, but we will
> > progressively try to bring reference implementations to parity.""" (2)
> >
> > The expected advantage of (2) over (1) seems to be mainly a difference
> > in how new format features are communicated. There are mainline
> > features, and there are experimental / provisional features.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > Le 13/07/2023 à 00:01, Neal Richardson a écrit :
> > > Hi all,
> > > As was previously raised in [1] and surfaced again in [2], there is a
> > > proposal for representing alternative layouts. The intent, as I
> > understand
> > > it, is to be able to support memory layouts that some (but perhaps not
> > all)
> > > applications of Arrow find valuable, so that these nearly Arrow systems
> > can
> > > be fully Arrow-native.
> > >
> > > I wanted to start a more focused discussion on 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-30 Thread Gang Wu
I am also in favor of the idea of an alternative layout. IIRC, a new
alternative
layout still goes into a process of standardization though it is the choice
of
each implementation to decide support now or later. I'd like to ask if we
can
provide the flexibility for implementations or downstream projects to
actually
implement a new alternative layout by means of a pluggable interface before
starting the standardization process. This is similar to promoting a popular
extension type implemented by many users to a canonical extension type.
I know this is more complicated as extension type simply reuses existing
layout but alternative layout usually means a brand new one. For example,
if two projects speak Arrow and now they want to share a new layout, they
can simply implement a pluggable alternative layout before Arrow adopts it.
This can unblock projects to evolve and help Arrow not to be fragmented.

Best,
Gang

On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou  wrote:

>
> Hello,
>
> I'm trying to reason about the advantages and drawbacks of this
> proposal, but it seems to me that it lacks definition.
>
> I would welcome a draft PR showcasing the changes necessary in the IPC
> format definition, and in the C Data Interface specification (no need to
> actually implement them for now :-)).
>
>
> As it is, it seems that this proposal would allow us to switch from:
>
> """We'd like to add a more efficient physical data representation, so
> we'll introduce a new Arrow data type. Implementations may or may not
> support it, but we will progressively try to bring reference
> implementations to parity.""" (1)
>
> to:
>
> """We'd like to add a more efficient physical data representation, so
> we'll introduce a new alternative layout for an existing Arrow data
> type. Implementations may or may not support it, but we will
> progressively try to bring reference implementations to parity.""" (2)
>
> The expected advantage of (2) over (1) seems to be mainly a difference
> in how new format features are communicated. There are mainline
> features, and there are experimental / provisional features.
>
> Regards
>
> Antoine.
>
>
>
> Le 13/07/2023 à 00:01, Neal Richardson a écrit :
> > Hi all,
> > As was previously raised in [1] and surfaced again in [2], there is a
> > proposal for representing alternative layouts. The intent, as I
> understand
> > it, is to be able to support memory layouts that some (but perhaps not
> all)
> > applications of Arrow find valuable, so that these nearly Arrow systems
> can
> > be fully Arrow-native.
> >
> > I wanted to start a more focused discussion on it because I think it's
> > worth being considered on its own merits, but I also think this gets to
> the
> > core of what the Arrow project is and should be, and I don't want us to
> > lose sight of that.
> >
> > To restate the proposal from [1]:
> >
> >   * There are one or more primary layouts
> > * Existing layouts are automatically considered primary layouts,
> > even if they
> > wouldn't have been primary layouts initially (e.g. large list)
> >   * A new layout, if it is semantically equivalent to another, is
> considered an
> > alternative layout
> >   * An alternative layout still has the same requirements for adoption
> > (two implementations
> > and a vote)
> > * An implementation should not feel pressured to rush and implement
> the new
> > layout. It would be good if they contribute in the discussion and
> consider
> > the layout and vote if they feel it would be an acceptable design.
> >   * We can define and vote and approve as many canonical alternative
> layouts as
> > we want:
> > * A canonical alternative layout should, at a minimum, have some
> reasonable
> > justification, such as improved performance for algorithm X
> >   * Arrow implementations MUST support the primary layouts
> >   * An Arrow implementation MAY support a canonical alternative, however:
> > * An Arrow implementation MUST first support the primary layout
> > * An Arrow implementation MUST support conversion to/from the
> primary and
> > canonical layout
> > * An Arrow implementation's APIs MUST only provide data in the
> > alternative layout if it is explicitly asked for (e.g. schema inference
> > should prefer the primary layout).
> >   * We can still vote for new primary layouts (e.g. promoting a
> > canonical alternative)
> > but, in these votes we don't only consider the value (e.g. performance)
> of
> > the layout but also the interoperability. In other words, a layout can
> only
> > become a primary layout if there is significant evidence that most
> > implementations
> > plan to adopt it.
> >
> >
> > To summarize some of the arguments against the proposal from the previous
> > threads, there are concerns about increasing the complexity of the Arrow
> > specification and the cost/burden of updating all of the Arrow
> > specifications to support them.
> >
> > Where these discussions, both about several proposed new types and 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-18 Thread Antoine Pitrou



Hello,

I'm trying to reason about the advantages and drawbacks of this 
proposal, but it seems to me that it lacks definition.


I would welcome a draft PR showcasing the changes necessary in the IPC 
format definition, and in the C Data Interface specification (no need to 
actually implement them for now :-)).



As it is, it seems that this proposal would allow us to switch from:

"""We'd like to add a more efficient physical data representation, so 
we'll introduce a new Arrow data type. Implementations may or may not 
support it, but we will progressively try to bring reference 
implementations to parity.""" (1)


to:

"""We'd like to add a more efficient physical data representation, so 
we'll introduce a new alternative layout for an existing Arrow data 
type. Implementations may or may not support it, but we will 
progressively try to bring reference implementations to parity.""" (2)


The expected advantage of (2) over (1) seems to be mainly a difference 
in how new format features are communicated. There are mainline 
features, and there are experimental / provisional features.


Regards

Antoine.



Le 13/07/2023 à 00:01, Neal Richardson a écrit :

Hi all,
As was previously raised in [1] and surfaced again in [2], there is a
proposal for representing alternative layouts. The intent, as I understand
it, is to be able to support memory layouts that some (but perhaps not all)
applications of Arrow find valuable, so that these nearly Arrow systems can
be fully Arrow-native.

I wanted to start a more focused discussion on it because I think it's
worth being considered on its own merits, but I also think this gets to the
core of what the Arrow project is and should be, and I don't want us to
lose sight of that.

To restate the proposal from [1]:

  * There are one or more primary layouts
* Existing layouts are automatically considered primary layouts,
even if they
wouldn't have been primary layouts initially (e.g. large list)
  * A new layout, if it is semantically equivalent to another, is considered an
alternative layout
  * An alternative layout still has the same requirements for adoption
(two implementations
and a vote)
* An implementation should not feel pressured to rush and implement the new
layout. It would be good if they contribute in the discussion and consider
the layout and vote if they feel it would be an acceptable design.
  * We can define and vote and approve as many canonical alternative layouts as
we want:
* A canonical alternative layout should, at a minimum, have some reasonable
justification, such as improved performance for algorithm X
  * Arrow implementations MUST support the primary layouts
  * An Arrow implementation MAY support a canonical alternative, however:
* An Arrow implementation MUST first support the primary layout
* An Arrow implementation MUST support conversion to/from the primary and
canonical layout
* An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema inference
should prefer the primary layout).
  * We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g. performance) of
the layout but also the interoperability. In other words, a layout can only
become a primary layout if there is significant evidence that most
implementations
plan to adopt it.


To summarize some of the arguments against the proposal from the previous
threads, there are concerns about increasing the complexity of the Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.

Where these discussions, both about several proposed new types and this
layout proposal, get to the core of Arrow is well expressed in the comments
on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
matters to people more, interoperability or best-in-class performance?" And
Pedro notes that because of the overhead of converting these not-yet-Arrow
types to the Arrow C ABI is high enough that they've considered abandoning
Arrow as their interchange format. So: on the one hand, we're kinda
choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.

What I see that we're trying to do here is find a way to expand the Arrow
specification just enough so that Arrow becomes or remains the in-memory
standard everywhere, but not so much that it creates too much complexity or
burden to implement. Expand too much and you get a fragmented ecosystem
where everyone is writing subsets of the Arrow standard and so nothing is
fully compatible and the whole premise is undermined. But expand too little
and projects will abandon the standard and we've also failed.

I don't have a tidy answer, but I wanted to acknowledge the bigger issues,
and see if this helps us reason about the various proposals on the table. I
wonder if the 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-14 Thread Andrew Lamb
Thank you Neil for writing this summary and everyone whose thoughts went
into the discussions -- I think the proposal, as summarized, offers a great
path forward by allowing the various Arrow communities to specialize when
advantageous but remain compatible.


On Thu, Jul 13, 2023 at 11:59 AM Ian Cook  wrote:

> Thank you Weston for proposing this solution and Neal for describing
> its context and implications. I agree with the other replies here—this
> seems like an elegant solution to a growing need that could, if left
> unaddressed, increase the fragmentation of the ecosystem and reduce
> the centrality of the Arrow format.
>
> Greater diversity of layouts is happening. Whether it happens inside
> of Arrow or outside of Arrow is up to us. I think we all would like to
> see it happen inside of Arrow. This proposal allows for that, while
> striking a balance as Raphael describes.
>
> However I think there is still some ambiguity about exactly how an
> Arrow implementation that is consuming/producing data would negotiate
> with an Arrow implementation or other component that is
> producing/consuming data to determine whether an alternative layout is
> supported. This was discussed briefly in [5] but I am interested to
> see how this negotiation would be implemented in practice in the C
> data interface, IPC, Flight, etc.
>
> Ian
>
> [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2
>
>
> On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
>  wrote:
> >
> > I like this proposal, I think it strikes a pragmatic balance between
> > preserving interoperability whilst still allowing new ideas to be
> > incorporated into the standard. Thank you for writing this up.
> >
> > On 13/07/2023 10:22, Matt Topol wrote:
> > > I don't have much to add but I do want to second Jacob's comments. I
> agree
> > > that this is a good way to avoid the fragmentation while keeping Arrow
> > > relevant, and likely something we need to do so that we can ensure
> Arrow
> > > remains the way to do this data integration and interoperability.
> > >
> > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> > >  wrote:
> > >
> > >> Hello Everyone,
> > >>
> > >> Thanks for this comprehensive but concise write up Neal! I think this
> > >> proposal is a good way to avoid both fragmentation of the arrow
> ecosystem
> > >> as well as its obsolescence. In my opinion of these two problems the
> > >> obsolescence is the bigger issue as (as mentioned in the proposal)
> arrow is
> > >> already (close to) being relegated to the sidelines in eco-system
> defining
> > >> projects.
> > >>
> > >> Jacob
> > >>
> > >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> > >> neal.p.richard...@gmail.com> wrote:
> > >>
> > >>> Hi all,
> > >>> As was previously raised in [1] and surfaced again in [2], there is a
> > >>> proposal for representing alternative layouts. The intent, as I
> > >> understand
> > >>> it, is to be able to support memory layouts that some (but perhaps
> not
> > >> all)
> > >>> applications of Arrow find valuable, so that these nearly Arrow
> systems
> > >> can
> > >>> be fully Arrow-native.
> > >>>
> > >>> I wanted to start a more focused discussion on it because I think
> it's
> > >>> worth being considered on its own merits, but I also think this gets
> to
> > >> the
> > >>> core of what the Arrow project is and should be, and I don't want us
> to
> > >>> lose sight of that.
> > >>>
> > >>> To restate the proposal from [1]:
> > >>>
> > >>>   * There are one or more primary layouts
> > >>> * Existing layouts are automatically considered primary layouts,
> > >>> even if they
> > >>> wouldn't have been primary layouts initially (e.g. large list)
> > >>>   * A new layout, if it is semantically equivalent to another, is
> > >>> considered an
> > >>> alternative layout
> > >>>   * An alternative layout still has the same requirements for
> adoption
> > >>> (two implementations
> > >>> and a vote)
> > >>> * An implementation should not feel pressured to rush and
> implement
> > >> the
> > >>> new
> > >>> layout. It would be good if they contribute in the discussion and
> > >> consider
> > >>> the layout and vote if they feel it would be an acceptable design.
> > >>>   * We can define and vote and approve as many canonical alternative
> > >>> layouts as
> > >>> we want:
> > >>> * A canonical alternative layout should, at a minimum, have some
> > >>> reasonable
> > >>> justification, such as improved performance for algorithm X
> > >>>   * Arrow implementations MUST support the primary layouts
> > >>>   * An Arrow implementation MAY support a canonical alternative,
> however:
> > >>> * An Arrow implementation MUST first support the primary layout
> > >>> * An Arrow implementation MUST support conversion to/from the
> primary
> > >>> and
> > >>> canonical layout
> > >>> * An Arrow implementation's APIs MUST only provide data in the
> > >>> alternative layout if it is explicitly asked for (e.g. 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Raphael Taylor-Davies

clarify what constitutes support for a canonical alternative
layout


I had envisaged, perhaps naively, that we would just add a new DataType 
containing a string layout name, perhaps DataType::Raw(String). This 
would have no restrictions on the number of buffers, children, etc... 
and would effectively just be an opaque ArrayData. As interpreting such 
an array would require the layout name, I think it warrants inclusion at 
a lower level than just Field metadata. This is in contrast to extension 
types, where this metadata is not strictly necessary to operate on the 
arrays.


I haven't given this a huge amount of thought though, and so it is 
entirely possible this has been discounted for some reason, or has some 
peculiar edge cases.



would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported
My interpretation of the proposal was that no such negotiation would 
take place, instead the primary layout would always be chosen except in 
the presence of an explicit contrary signal. This is not hugely 
dissimilar from how dictionaries or run-encoded arrays are currently 
handled, where they will only be returned if either present in the input 
or explicitly requested.


Perhaps we might clarify this with something along the lines of?

- An Arrow implementation's APIs MAY produce data in an alternative 
layout to match one or more of its inputs, or an embedded arrow schema


This covers both the cases of writing data to files and sending data 
over FFI. It does, however, carry the implication that alternative 
layouts are viral. I therefore wonder if we generalize the wording to:


- Arrow-native APIs MUST only produce data in an alternative layout if 
it is already present in one of its inputs, or explicitly requested


This is to help ensure that users only end up with alternative layouts 
if they explicitly opt-in to this behaviour. What I think we want to 
avoid is systems producing alternative layouts by default, and this then 
leading to user confusion when data produced by one system is not 
interoperable with those of another.


On 13/07/2023 20:28, Benjamin Kietzman wrote:

Canonical alternative layouts sounds like a workable path forward. Perhaps
understandably, my immediate thought is how I could rephrase Utf8View as a
canonical alternative layout for Utf8. In light of that, I have a few
questions to clarify what constitutes support for a canonical alternative
layout. Specifically:
- do we extend Field to indicate if and which alternative layout is being
used
   - or do we add AltSchema to wrap a schema and indicate which of its
fields have alternate layouts
   - ...
- do we extend RecordBatch to support canonical alternative layouts
   - or do we add AltRecordBatch for that purpose (which iiuc would
complicate dictionary batches containing any column of an alternate layout)
   - ...

To add context, one of the reasons we could not just use extension types
for Utf8View is that these are required to be backed by a known layout, and
no primary layout in the format has a variable number of buffers. In order
to accommodate Utf8View as an alternative layout, the minimal change which
I can think of right now is
- to add `stringField::alternative_layout` to identify alternative layouts
in a Schema
- to extend RecordBatch with support for variable buffer counts

This will put some burden on implementers to navigate the multiple
character buffers when reading serialized arrow batches. However it will
not require that any implementations' data structures support multiple
buffers since the explicit default for any implementation is to always
convert Utf8View to Utf8. If this sounds acceptable, I'll prepare a draft
PR which
- adds language for canonical alternative layouts to Columnar.rst
- addsField::alternative_layout  andRecordBatch::variable_buffer_counts
- adds the "view" alternative layout for the Utf8 Type as an initial example

Ben Kietzman

On Thu, Jul 13, 2023, 18:32 Aldrin  wrote:


Thanks Neal and Weston!

I prepared a diagram to solidify my own understanding of the context,
which can be found at [1].

I think alternative layouts sounds like a nice first approach to allowing
new layouts that can be supported lazily (implemented when it is
beneficial) by various implementations of the Arrow Columnar Format. But, I
do think that it's just a (practical) formalization of saying what layouts
are required and which ones are optional.

 From the making of the diagram, I also decided that the discussion isn't
limited to performance, since there are several reasons new physical
layouts may be proposed (or, at least, there are many aspects of
performance). Even if it's not "canonical alternative layouts," I think it
is important that there be some process for developers that use Arrow to
propose extensions to the columnar format without having to prove out the
benefits for libraries that use a different tech 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Benjamin Kietzman
Canonical alternative layouts sounds like a workable path forward. Perhaps
understandably, my immediate thought is how I could rephrase Utf8View as a
canonical alternative layout for Utf8. In light of that, I have a few
questions to clarify what constitutes support for a canonical alternative
layout. Specifically:
- do we extend Field to indicate if and which alternative layout is being
used
  - or do we add AltSchema to wrap a schema and indicate which of its
fields have alternate layouts
  - ...
- do we extend RecordBatch to support canonical alternative layouts
  - or do we add AltRecordBatch for that purpose (which iiuc would
complicate dictionary batches containing any column of an alternate layout)
  - ...

To add context, one of the reasons we could not just use extension types
for Utf8View is that these are required to be backed by a known layout, and
no primary layout in the format has a variable number of buffers. In order
to accommodate Utf8View as an alternative layout, the minimal change which
I can think of right now is
- to add `string Field::alternative_layout` to identify alternative layouts
in a Schema
- to extend RecordBatch with support for variable buffer counts

This will put some burden on implementers to navigate the multiple
character buffers when reading serialized arrow batches. However it will
not require that any implementations' data structures support multiple
buffers since the explicit default for any implementation is to always
convert Utf8View to Utf8. If this sounds acceptable, I'll prepare a draft
PR which
- adds language for canonical alternative layouts to Columnar.rst
- adds Field::alternative_layout and RecordBatch::variable_buffer_counts
- adds the "view" alternative layout for the Utf8 Type as an initial example

Ben Kietzman

On Thu, Jul 13, 2023, 18:32 Aldrin  wrote:

> Thanks Neal and Weston!
>
> I prepared a diagram to solidify my own understanding of the context,
> which can be found at [1].
>
> I think alternative layouts sounds like a nice first approach to allowing
> new layouts that can be supported lazily (implemented when it is
> beneficial) by various implementations of the Arrow Columnar Format. But, I
> do think that it's just a (practical) formalization of saying what layouts
> are required and which ones are optional.
>
> From the making of the diagram, I also decided that the discussion isn't
> limited to performance, since there are several reasons new physical
> layouts may be proposed (or, at least, there are many aspects of
> performance). Even if it's not "canonical alternative layouts," I think it
> is important that there be some process for developers that use Arrow to
> propose extensions to the columnar format without having to prove out the
> benefits for libraries that use a different tech stack (e.g. rust vs C++ vs
> go).
>
>
> [1]:
> https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing
>
>
>
>
> # --
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> --- Original Message ---
> On Thursday, July 13th, 2023 at 10:49, Dane Pitkin
>  wrote:
>
>
> > I am in favor of this proposal. IMO the Arrow project is the right place
> to
> > standardize both the interoperability and operability of columnar data
> > layouts. Data engines are a core component of the Arrow ecosystem and the
> > project should be able to grow with these data engines as they converge
> on
> > new layouts. Since columnar data is ubiquitous in analytical workloads,
> we
> > are seeing a natural progression into optimizing those workloads. This
> > includes new lossless compression schemes for columnar data that allows
> > engines to operate directly on the compressed data (e.g. RLE). If we
> can't
> > reliably support the growing needs of the broader data engine ecosystem
> in
> > a timely manner, then I also fear Arrow might lose relevancy over time.
> >
>
> > On Thu, Jul 13, 2023 at 11:59 AM Ian Cook ianmc...@apache.org wrote:
> >
>
> > > Thank you Weston for proposing this solution and Neal for describing
> > > its context and implications. I agree with the other replies here—this
> > > seems like an elegant solution to a growing need that could, if left
> > > unaddressed, increase the fragmentation of the ecosystem and reduce
> > > the centrality of the Arrow format.
> > >
>
> > > Greater diversity of layouts is happening. Whether it happens inside
> > > of Arrow or outside of Arrow is up to us. I think we all would like to
> > > see it happen inside of Arrow. This proposal allows for that, while
> > > striking a balance as Raphael describes.
> > >
>
> > > However I think there is still some ambiguity about exactly how an
> > > Arrow implementation that is consuming/producing data would negotiate
> > > with an Arrow implementation or other component that is
> > > producing/consuming data to determine whether an alternative 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Aldrin
Thanks Neal and Weston!

I prepared a diagram to solidify my own understanding of the context, which can 
be found at [1].

I think alternative layouts sounds like a nice first approach to allowing new 
layouts that can be supported lazily (implemented when it is beneficial) by 
various implementations of the Arrow Columnar Format. But, I do think that it's 
just a (practical) formalization of saying what layouts are required and which 
ones are optional.

>From the making of the diagram, I also decided that the discussion isn't 
>limited to performance, since there are several reasons new physical layouts 
>may be proposed (or, at least, there are many aspects of performance). Even if 
>it's not "canonical alternative layouts," I think it is important that there 
>be some process for developers that use Arrow to propose extensions to the 
>columnar format without having to prove out the benefits for libraries that 
>use a different tech stack (e.g. rust vs C++ vs go).


[1]: 
https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing




# --

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


--- Original Message ---
On Thursday, July 13th, 2023 at 10:49, Dane Pitkin 
 wrote:


> I am in favor of this proposal. IMO the Arrow project is the right place to
> standardize both the interoperability and operability of columnar data
> layouts. Data engines are a core component of the Arrow ecosystem and the
> project should be able to grow with these data engines as they converge on
> new layouts. Since columnar data is ubiquitous in analytical workloads, we
> are seeing a natural progression into optimizing those workloads. This
> includes new lossless compression schemes for columnar data that allows
> engines to operate directly on the compressed data (e.g. RLE). If we can't
> reliably support the growing needs of the broader data engine ecosystem in
> a timely manner, then I also fear Arrow might lose relevancy over time.
> 

> On Thu, Jul 13, 2023 at 11:59 AM Ian Cook ianmc...@apache.org wrote:
> 

> > Thank you Weston for proposing this solution and Neal for describing
> > its context and implications. I agree with the other replies here—this
> > seems like an elegant solution to a growing need that could, if left
> > unaddressed, increase the fragmentation of the ecosystem and reduce
> > the centrality of the Arrow format.
> > 

> > Greater diversity of layouts is happening. Whether it happens inside
> > of Arrow or outside of Arrow is up to us. I think we all would like to
> > see it happen inside of Arrow. This proposal allows for that, while
> > striking a balance as Raphael describes.
> > 

> > However I think there is still some ambiguity about exactly how an
> > Arrow implementation that is consuming/producing data would negotiate
> > with an Arrow implementation or other component that is
> > producing/consuming data to determine whether an alternative layout is
> > supported. This was discussed briefly in [5] but I am interested to
> > see how this negotiation would be implemented in practice in the C
> > data interface, IPC, Flight, etc.
> > 

> > Ian
> > 

> > [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2
> > 

> > On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
> > r.taylordav...@googlemail.com.invalid wrote:
> > 

> > > I like this proposal, I think it strikes a pragmatic balance between
> > > preserving interoperability whilst still allowing new ideas to be
> > > incorporated into the standard. Thank you for writing this up.
> > > 

> > > On 13/07/2023 10:22, Matt Topol wrote:
> > > 

> > > > I don't have much to add but I do want to second Jacob's comments. I
> > > > agree
> > > > that this is a good way to avoid the fragmentation while keeping Arrow
> > > > relevant, and likely something we need to do so that we can ensure
> > > > Arrow
> > > > remains the way to do this data integration and interoperability.
> > > > 

> > > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> > > > ja...@voltrondata.com.invalid wrote:
> > > > 

> > > > > Hello Everyone,
> > > > > 

> > > > > Thanks for this comprehensive but concise write up Neal! I think this
> > > > > proposal is a good way to avoid both fragmentation of the arrow
> > > > > ecosystem
> > > > > as well as its obsolescence. In my opinion of these two problems the
> > > > > obsolescence is the bigger issue as (as mentioned in the proposal)
> > > > > arrow is
> > > > > already (close to) being relegated to the sidelines in eco-system
> > > > > defining
> > > > > projects.
> > > > > 

> > > > > Jacob
> > > > > 

> > > > > On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> > > > > neal.p.richard...@gmail.com> wrote:
> > > > > 

> > > > > > Hi all,
> > > > > > As was previously raised in 1 and surfaced again in 2, there is a
> > > > > > proposal for representing alternative layouts. The 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Dane Pitkin
I am in favor of this proposal. IMO the Arrow project is the right place to
standardize both the interoperability *and operability* of columnar data
layouts. Data engines are a core component of the Arrow ecosystem and the
project should be able to grow with these data engines as they converge on
new layouts. Since columnar data is ubiquitous in analytical workloads, we
are seeing a natural progression into optimizing those workloads. This
includes new lossless compression schemes for columnar data that allows
engines to operate directly on the compressed data (e.g. RLE). If we can't
reliably support the growing needs of the broader data engine ecosystem in
a timely manner, then I also fear Arrow might lose relevancy over time.

On Thu, Jul 13, 2023 at 11:59 AM Ian Cook  wrote:

> Thank you Weston for proposing this solution and Neal for describing
> its context and implications. I agree with the other replies here—this
> seems like an elegant solution to a growing need that could, if left
> unaddressed, increase the fragmentation of the ecosystem and reduce
> the centrality of the Arrow format.
>
> Greater diversity of layouts is happening. Whether it happens inside
> of Arrow or outside of Arrow is up to us. I think we all would like to
> see it happen inside of Arrow. This proposal allows for that, while
> striking a balance as Raphael describes.
>
> However I think there is still some ambiguity about exactly how an
> Arrow implementation that is consuming/producing data would negotiate
> with an Arrow implementation or other component that is
> producing/consuming data to determine whether an alternative layout is
> supported. This was discussed briefly in [5] but I am interested to
> see how this negotiation would be implemented in practice in the C
> data interface, IPC, Flight, etc.
>
> Ian
>
> [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2
>
>
> On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
>  wrote:
> >
> > I like this proposal, I think it strikes a pragmatic balance between
> > preserving interoperability whilst still allowing new ideas to be
> > incorporated into the standard. Thank you for writing this up.
> >
> > On 13/07/2023 10:22, Matt Topol wrote:
> > > I don't have much to add but I do want to second Jacob's comments. I
> agree
> > > that this is a good way to avoid the fragmentation while keeping Arrow
> > > relevant, and likely something we need to do so that we can ensure
> Arrow
> > > remains the way to do this data integration and interoperability.
> > >
> > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> > >  wrote:
> > >
> > >> Hello Everyone,
> > >>
> > >> Thanks for this comprehensive but concise write up Neal! I think this
> > >> proposal is a good way to avoid both fragmentation of the arrow
> ecosystem
> > >> as well as its obsolescence. In my opinion of these two problems the
> > >> obsolescence is the bigger issue as (as mentioned in the proposal)
> arrow is
> > >> already (close to) being relegated to the sidelines in eco-system
> defining
> > >> projects.
> > >>
> > >> Jacob
> > >>
> > >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> > >> neal.p.richard...@gmail.com> wrote:
> > >>
> > >>> Hi all,
> > >>> As was previously raised in [1] and surfaced again in [2], there is a
> > >>> proposal for representing alternative layouts. The intent, as I
> > >> understand
> > >>> it, is to be able to support memory layouts that some (but perhaps
> not
> > >> all)
> > >>> applications of Arrow find valuable, so that these nearly Arrow
> systems
> > >> can
> > >>> be fully Arrow-native.
> > >>>
> > >>> I wanted to start a more focused discussion on it because I think
> it's
> > >>> worth being considered on its own merits, but I also think this gets
> to
> > >> the
> > >>> core of what the Arrow project is and should be, and I don't want us
> to
> > >>> lose sight of that.
> > >>>
> > >>> To restate the proposal from [1]:
> > >>>
> > >>>   * There are one or more primary layouts
> > >>> * Existing layouts are automatically considered primary layouts,
> > >>> even if they
> > >>> wouldn't have been primary layouts initially (e.g. large list)
> > >>>   * A new layout, if it is semantically equivalent to another, is
> > >>> considered an
> > >>> alternative layout
> > >>>   * An alternative layout still has the same requirements for
> adoption
> > >>> (two implementations
> > >>> and a vote)
> > >>> * An implementation should not feel pressured to rush and
> implement
> > >> the
> > >>> new
> > >>> layout. It would be good if they contribute in the discussion and
> > >> consider
> > >>> the layout and vote if they feel it would be an acceptable design.
> > >>>   * We can define and vote and approve as many canonical alternative
> > >>> layouts as
> > >>> we want:
> > >>> * A canonical alternative layout should, at a minimum, have some
> > >>> reasonable
> > >>> justification, such as improved performance for algorithm X
> > >>>   * 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Ian Cook
Thank you Weston for proposing this solution and Neal for describing
its context and implications. I agree with the other replies here—this
seems like an elegant solution to a growing need that could, if left
unaddressed, increase the fragmentation of the ecosystem and reduce
the centrality of the Arrow format.

Greater diversity of layouts is happening. Whether it happens inside
of Arrow or outside of Arrow is up to us. I think we all would like to
see it happen inside of Arrow. This proposal allows for that, while
striking a balance as Raphael describes.

However I think there is still some ambiguity about exactly how an
Arrow implementation that is consuming/producing data would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported. This was discussed briefly in [5] but I am interested to
see how this negotiation would be implemented in practice in the C
data interface, IPC, Flight, etc.

Ian

[5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2


On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
 wrote:
>
> I like this proposal, I think it strikes a pragmatic balance between
> preserving interoperability whilst still allowing new ideas to be
> incorporated into the standard. Thank you for writing this up.
>
> On 13/07/2023 10:22, Matt Topol wrote:
> > I don't have much to add but I do want to second Jacob's comments. I agree
> > that this is a good way to avoid the fragmentation while keeping Arrow
> > relevant, and likely something we need to do so that we can ensure Arrow
> > remains the way to do this data integration and interoperability.
> >
> > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> >  wrote:
> >
> >> Hello Everyone,
> >>
> >> Thanks for this comprehensive but concise write up Neal! I think this
> >> proposal is a good way to avoid both fragmentation of the arrow ecosystem
> >> as well as its obsolescence. In my opinion of these two problems the
> >> obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
> >> already (close to) being relegated to the sidelines in eco-system defining
> >> projects.
> >>
> >> Jacob
> >>
> >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> >> neal.p.richard...@gmail.com> wrote:
> >>
> >>> Hi all,
> >>> As was previously raised in [1] and surfaced again in [2], there is a
> >>> proposal for representing alternative layouts. The intent, as I
> >> understand
> >>> it, is to be able to support memory layouts that some (but perhaps not
> >> all)
> >>> applications of Arrow find valuable, so that these nearly Arrow systems
> >> can
> >>> be fully Arrow-native.
> >>>
> >>> I wanted to start a more focused discussion on it because I think it's
> >>> worth being considered on its own merits, but I also think this gets to
> >> the
> >>> core of what the Arrow project is and should be, and I don't want us to
> >>> lose sight of that.
> >>>
> >>> To restate the proposal from [1]:
> >>>
> >>>   * There are one or more primary layouts
> >>> * Existing layouts are automatically considered primary layouts,
> >>> even if they
> >>> wouldn't have been primary layouts initially (e.g. large list)
> >>>   * A new layout, if it is semantically equivalent to another, is
> >>> considered an
> >>> alternative layout
> >>>   * An alternative layout still has the same requirements for adoption
> >>> (two implementations
> >>> and a vote)
> >>> * An implementation should not feel pressured to rush and implement
> >> the
> >>> new
> >>> layout. It would be good if they contribute in the discussion and
> >> consider
> >>> the layout and vote if they feel it would be an acceptable design.
> >>>   * We can define and vote and approve as many canonical alternative
> >>> layouts as
> >>> we want:
> >>> * A canonical alternative layout should, at a minimum, have some
> >>> reasonable
> >>> justification, such as improved performance for algorithm X
> >>>   * Arrow implementations MUST support the primary layouts
> >>>   * An Arrow implementation MAY support a canonical alternative, however:
> >>> * An Arrow implementation MUST first support the primary layout
> >>> * An Arrow implementation MUST support conversion to/from the primary
> >>> and
> >>> canonical layout
> >>> * An Arrow implementation's APIs MUST only provide data in the
> >>> alternative layout if it is explicitly asked for (e.g. schema inference
> >>> should prefer the primary layout).
> >>>   * We can still vote for new primary layouts (e.g. promoting a
> >>> canonical alternative)
> >>> but, in these votes we don't only consider the value (e.g. performance)
> >> of
> >>> the layout but also the interoperability. In other words, a layout can
> >> only
> >>> become a primary layout if there is significant evidence that most
> >>> implementations
> >>> plan to adopt it.
> >>>
> >>>
> >>> To summarize some of the arguments against the proposal from the previous

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Raphael Taylor-Davies
I like this proposal, I think it strikes a pragmatic balance between 
preserving interoperability whilst still allowing new ideas to be 
incorporated into the standard. Thank you for writing this up.


On 13/07/2023 10:22, Matt Topol wrote:

I don't have much to add but I do want to second Jacob's comments. I agree
that this is a good way to avoid the fragmentation while keeping Arrow
relevant, and likely something we need to do so that we can ensure Arrow
remains the way to do this data integration and interoperability.

On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
 wrote:


Hello Everyone,

Thanks for this comprehensive but concise write up Neal! I think this
proposal is a good way to avoid both fragmentation of the arrow ecosystem
as well as its obsolescence. In my opinion of these two problems the
obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
already (close to) being relegated to the sidelines in eco-system defining
projects.

Jacob

On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:


Hi all,
As was previously raised in [1] and surfaced again in [2], there is a
proposal for representing alternative layouts. The intent, as I

understand

it, is to be able to support memory layouts that some (but perhaps not

all)

applications of Arrow find valuable, so that these nearly Arrow systems

can

be fully Arrow-native.

I wanted to start a more focused discussion on it because I think it's
worth being considered on its own merits, but I also think this gets to

the

core of what the Arrow project is and should be, and I don't want us to
lose sight of that.

To restate the proposal from [1]:

  * There are one or more primary layouts
* Existing layouts are automatically considered primary layouts,
even if they
wouldn't have been primary layouts initially (e.g. large list)
  * A new layout, if it is semantically equivalent to another, is
considered an
alternative layout
  * An alternative layout still has the same requirements for adoption
(two implementations
and a vote)
* An implementation should not feel pressured to rush and implement

the

new
layout. It would be good if they contribute in the discussion and

consider

the layout and vote if they feel it would be an acceptable design.
  * We can define and vote and approve as many canonical alternative
layouts as
we want:
* A canonical alternative layout should, at a minimum, have some
reasonable
justification, such as improved performance for algorithm X
  * Arrow implementations MUST support the primary layouts
  * An Arrow implementation MAY support a canonical alternative, however:
* An Arrow implementation MUST first support the primary layout
* An Arrow implementation MUST support conversion to/from the primary
and
canonical layout
* An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema inference
should prefer the primary layout).
  * We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g. performance)

of

the layout but also the interoperability. In other words, a layout can

only

become a primary layout if there is significant evidence that most
implementations
plan to adopt it.


To summarize some of the arguments against the proposal from the previous
threads, there are concerns about increasing the complexity of the Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.

Where these discussions, both about several proposed new types and this
layout proposal, get to the core of Arrow is well expressed in the

comments

on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
matters to people more, interoperability or best-in-class performance?"

And

Pedro notes that because of the overhead of converting these

not-yet-Arrow

types to the Arrow C ABI is high enough that they've considered

abandoning

Arrow as their interchange format. So: on the one hand, we're kinda
choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.

What I see that we're trying to do here is find a way to expand the Arrow
specification just enough so that Arrow becomes or remains the in-memory
standard everywhere, but not so much that it creates too much complexity

or

burden to implement. Expand too much and you get a fragmented ecosystem
where everyone is writing subsets of the Arrow standard and so nothing is
fully compatible and the whole premise is undermined. But expand too

little

and projects will abandon the standard and we've also failed.

I don't have a tidy answer, but I wanted to acknowledge the bigger

issues,

and see if this helps us reason about the various proposals on the

table. I

wonder if the alternative layout proposal is the happy medium that adds
some complexity to 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Matt Topol
I don't have much to add but I do want to second Jacob's comments. I agree
that this is a good way to avoid the fragmentation while keeping Arrow
relevant, and likely something we need to do so that we can ensure Arrow
remains the way to do this data integration and interoperability.

On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
 wrote:

> Hello Everyone,
>
> Thanks for this comprehensive but concise write up Neal! I think this
> proposal is a good way to avoid both fragmentation of the arrow ecosystem
> as well as its obsolescence. In my opinion of these two problems the
> obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
> already (close to) being relegated to the sidelines in eco-system defining
> projects.
>
> Jacob
>
> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> neal.p.richard...@gmail.com> wrote:
>
> > Hi all,
> > As was previously raised in [1] and surfaced again in [2], there is a
> > proposal for representing alternative layouts. The intent, as I
> understand
> > it, is to be able to support memory layouts that some (but perhaps not
> all)
> > applications of Arrow find valuable, so that these nearly Arrow systems
> can
> > be fully Arrow-native.
> >
> > I wanted to start a more focused discussion on it because I think it's
> > worth being considered on its own merits, but I also think this gets to
> the
> > core of what the Arrow project is and should be, and I don't want us to
> > lose sight of that.
> >
> > To restate the proposal from [1]:
> >
> >  * There are one or more primary layouts
> >* Existing layouts are automatically considered primary layouts,
> > even if they
> > wouldn't have been primary layouts initially (e.g. large list)
> >  * A new layout, if it is semantically equivalent to another, is
> > considered an
> > alternative layout
> >  * An alternative layout still has the same requirements for adoption
> > (two implementations
> > and a vote)
> >* An implementation should not feel pressured to rush and implement
> the
> > new
> > layout. It would be good if they contribute in the discussion and
> consider
> > the layout and vote if they feel it would be an acceptable design.
> >  * We can define and vote and approve as many canonical alternative
> > layouts as
> > we want:
> >* A canonical alternative layout should, at a minimum, have some
> > reasonable
> > justification, such as improved performance for algorithm X
> >  * Arrow implementations MUST support the primary layouts
> >  * An Arrow implementation MAY support a canonical alternative, however:
> >* An Arrow implementation MUST first support the primary layout
> >* An Arrow implementation MUST support conversion to/from the primary
> > and
> > canonical layout
> >* An Arrow implementation's APIs MUST only provide data in the
> > alternative layout if it is explicitly asked for (e.g. schema inference
> > should prefer the primary layout).
> >  * We can still vote for new primary layouts (e.g. promoting a
> > canonical alternative)
> > but, in these votes we don't only consider the value (e.g. performance)
> of
> > the layout but also the interoperability. In other words, a layout can
> only
> > become a primary layout if there is significant evidence that most
> > implementations
> > plan to adopt it.
> >
> >
> > To summarize some of the arguments against the proposal from the previous
> > threads, there are concerns about increasing the complexity of the Arrow
> > specification and the cost/burden of updating all of the Arrow
> > specifications to support them.
> >
> > Where these discussions, both about several proposed new types and this
> > layout proposal, get to the core of Arrow is well expressed in the
> comments
> > on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
> > matters to people more, interoperability or best-in-class performance?"
> And
> > Pedro notes that because of the overhead of converting these
> not-yet-Arrow
> > types to the Arrow C ABI is high enough that they've considered
> abandoning
> > Arrow as their interchange format. So: on the one hand, we're kinda
> > choosing which quality we're optimizing for, but on the other,
> > interoperability and performance are dependent on each other.
> >
> > What I see that we're trying to do here is find a way to expand the Arrow
> > specification just enough so that Arrow becomes or remains the in-memory
> > standard everywhere, but not so much that it creates too much complexity
> or
> > burden to implement. Expand too much and you get a fragmented ecosystem
> > where everyone is writing subsets of the Arrow standard and so nothing is
> > fully compatible and the whole premise is undermined. But expand too
> little
> > and projects will abandon the standard and we've also failed.
> >
> > I don't have a tidy answer, but I wanted to acknowledge the bigger
> issues,
> > and see if this helps us reason about the various proposals on the
> table. I
> > wonder if the 

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-12 Thread Jacob Wujciak-Jens
Hello Everyone,

Thanks for this comprehensive but concise write up Neal! I think this
proposal is a good way to avoid both fragmentation of the arrow ecosystem
as well as its obsolescence. In my opinion of these two problems the
obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
already (close to) being relegated to the sidelines in eco-system defining
projects.

Jacob

On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

> Hi all,
> As was previously raised in [1] and surfaced again in [2], there is a
> proposal for representing alternative layouts. The intent, as I understand
> it, is to be able to support memory layouts that some (but perhaps not all)
> applications of Arrow find valuable, so that these nearly Arrow systems can
> be fully Arrow-native.
>
> I wanted to start a more focused discussion on it because I think it's
> worth being considered on its own merits, but I also think this gets to the
> core of what the Arrow project is and should be, and I don't want us to
> lose sight of that.
>
> To restate the proposal from [1]:
>
>  * There are one or more primary layouts
>* Existing layouts are automatically considered primary layouts,
> even if they
> wouldn't have been primary layouts initially (e.g. large list)
>  * A new layout, if it is semantically equivalent to another, is
> considered an
> alternative layout
>  * An alternative layout still has the same requirements for adoption
> (two implementations
> and a vote)
>* An implementation should not feel pressured to rush and implement the
> new
> layout. It would be good if they contribute in the discussion and consider
> the layout and vote if they feel it would be an acceptable design.
>  * We can define and vote and approve as many canonical alternative
> layouts as
> we want:
>* A canonical alternative layout should, at a minimum, have some
> reasonable
> justification, such as improved performance for algorithm X
>  * Arrow implementations MUST support the primary layouts
>  * An Arrow implementation MAY support a canonical alternative, however:
>* An Arrow implementation MUST first support the primary layout
>* An Arrow implementation MUST support conversion to/from the primary
> and
> canonical layout
>* An Arrow implementation's APIs MUST only provide data in the
> alternative layout if it is explicitly asked for (e.g. schema inference
> should prefer the primary layout).
>  * We can still vote for new primary layouts (e.g. promoting a
> canonical alternative)
> but, in these votes we don't only consider the value (e.g. performance) of
> the layout but also the interoperability. In other words, a layout can only
> become a primary layout if there is significant evidence that most
> implementations
> plan to adopt it.
>
>
> To summarize some of the arguments against the proposal from the previous
> threads, there are concerns about increasing the complexity of the Arrow
> specification and the cost/burden of updating all of the Arrow
> specifications to support them.
>
> Where these discussions, both about several proposed new types and this
> layout proposal, get to the core of Arrow is well expressed in the comments
> on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
> matters to people more, interoperability or best-in-class performance?" And
> Pedro notes that because of the overhead of converting these not-yet-Arrow
> types to the Arrow C ABI is high enough that they've considered abandoning
> Arrow as their interchange format. So: on the one hand, we're kinda
> choosing which quality we're optimizing for, but on the other,
> interoperability and performance are dependent on each other.
>
> What I see that we're trying to do here is find a way to expand the Arrow
> specification just enough so that Arrow becomes or remains the in-memory
> standard everywhere, but not so much that it creates too much complexity or
> burden to implement. Expand too much and you get a fragmented ecosystem
> where everyone is writing subsets of the Arrow standard and so nothing is
> fully compatible and the whole premise is undermined. But expand too little
> and projects will abandon the standard and we've also failed.
>
> I don't have a tidy answer, but I wanted to acknowledge the bigger issues,
> and see if this helps us reason about the various proposals on the table. I
> wonder if the alternative layout proposal is the happy medium that adds
> some complexity to the specification, but less than there would be if three
> new types were added, and still meets the needs of projects like DuckDB,
> Velox, and Gluten and gets them fully Arrow native.
>
> Neal
>
>
> [1]: https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
> [2]: https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
> [3]: https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
> [4]: 

[DISCUSS] Canonical alternative layout proposal

2023-07-12 Thread Neal Richardson
Hi all,
As was previously raised in [1] and surfaced again in [2], there is a
proposal for representing alternative layouts. The intent, as I understand
it, is to be able to support memory layouts that some (but perhaps not all)
applications of Arrow find valuable, so that these nearly Arrow systems can
be fully Arrow-native.

I wanted to start a more focused discussion on it because I think it's
worth being considered on its own merits, but I also think this gets to the
core of what the Arrow project is and should be, and I don't want us to
lose sight of that.

To restate the proposal from [1]:

 * There are one or more primary layouts
   * Existing layouts are automatically considered primary layouts,
even if they
wouldn't have been primary layouts initially (e.g. large list)
 * A new layout, if it is semantically equivalent to another, is considered an
alternative layout
 * An alternative layout still has the same requirements for adoption
(two implementations
and a vote)
   * An implementation should not feel pressured to rush and implement the new
layout. It would be good if they contribute in the discussion and consider
the layout and vote if they feel it would be an acceptable design.
 * We can define and vote and approve as many canonical alternative layouts as
we want:
   * A canonical alternative layout should, at a minimum, have some reasonable
justification, such as improved performance for algorithm X
 * Arrow implementations MUST support the primary layouts
 * An Arrow implementation MAY support a canonical alternative, however:
   * An Arrow implementation MUST first support the primary layout
   * An Arrow implementation MUST support conversion to/from the primary and
canonical layout
   * An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema inference
should prefer the primary layout).
 * We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g. performance) of
the layout but also the interoperability. In other words, a layout can only
become a primary layout if there is significant evidence that most
implementations
plan to adopt it.


To summarize some of the arguments against the proposal from the previous
threads, there are concerns about increasing the complexity of the Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.

Where these discussions, both about several proposed new types and this
layout proposal, get to the core of Arrow is well expressed in the comments
on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
matters to people more, interoperability or best-in-class performance?" And
Pedro notes that because of the overhead of converting these not-yet-Arrow
types to the Arrow C ABI is high enough that they've considered abandoning
Arrow as their interchange format. So: on the one hand, we're kinda
choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.

What I see that we're trying to do here is find a way to expand the Arrow
specification just enough so that Arrow becomes or remains the in-memory
standard everywhere, but not so much that it creates too much complexity or
burden to implement. Expand too much and you get a fragmented ecosystem
where everyone is writing subsets of the Arrow standard and so nothing is
fully compatible and the whole premise is undermined. But expand too little
and projects will abandon the standard and we've also failed.

I don't have a tidy answer, but I wanted to acknowledge the bigger issues,
and see if this helps us reason about the various proposals on the table. I
wonder if the alternative layout proposal is the happy medium that adds
some complexity to the specification, but less than there would be if three
new types were added, and still meets the needs of projects like DuckDB,
Velox, and Gluten and gets them fully Arrow native.

Neal


[1]: https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
[2]: https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
[3]: https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
[4]: https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h