Re: [DISCUSS] Canonical alternative layout proposal

Felipe Oliveira Carvalho Sat, 05 Aug 2023 08:23:07 -0700

> I think this is similar to the proposal with the exception that your
> suggestion would require amending existing types that happen to be
> alternatives to each other.


I want to avoid electing one canonical layout for a kind (AKA "logical
type"). And the existence of "alternative layouts" implies the existence of
a canonical layout.

In my suggestion, a layout being canonical is not a property of the
specification, but a choice of the system implementing the specification.

One concrete example of this is how Polars elected LargeList as the
canonical type for the List logical type [1] while Velox settled on a list
array representation based on 32-bit offsets and sizes.

----

The specification can define the rules of communication upfront, achieving
two goals:

1) implementations can add new layouts and immediately inter-operate better
with other implementations
2) implementations can add new behaviors without concerning themselves with
new layouts other implementations are adding

This is not a full solution to the "expression problem" because we are
still left with some conversions at runtime, but as each implementation
gets closer to understanding all layouts, the conversions disappear.

If we settle on canonical layouts, communication is forced to always
convert to the canonical layout when passing data around, penalizing
layouts that are better for computation.

[1]
https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/dtype.rs#L247

* in terms of speed, and memory consumption, but not binary size

On Thu, Aug 3, 2023 at 12:28 AM Weston Pace <weston.p...@gmail.com> wrote:
>
> > I would welcome a draft PR showcasing the changes necessary in the IPC
> > format definition, and in the C Data Interface specification (no need to
> > actually implement them for now :-)).
>
> I've proposed something at [1].
>
> > One sketch of an idea: define sets of types that we can call “kinds”**
> > (e.g. “string kind” = {string, string view, large string, ree<string>…},
> > “list kind” = {list, large_list, list_view, large_list_view…}).
>
> I think this is similar to the proposal with the exception that your
> suggestion would require amending existing types that happen to be
> alternatives to each other.  I'm not opposed to it but I think it's
> compatible and we don't necessarily need all of the complexity just yet
> (feel free to correct me if I'm wrong).  I don't think we need to
introduce
> the concept of "kind".  We already have a concept of "logical type" in the
> spec.  I think what you are stating is that a single logical type may have
> multiple physical layouts.  I agree.  E.g. variable size list<32>,
variable
> size list<64>, and REE are the physical layouts that, combined with the
> logical type "string", give you "string", "large string", and
"ree<string>"
>
> [1] https://github.com/apache/arrow/pull/37000
>
> On Tue, Aug 1, 2023 at 1:51 AM Felipe Oliveira Carvalho <
felipe...@gmail.com>
> wrote:
>
> > A major difficulty in making the Arrow array types open for extension
[1]
> > is that as soon as we define an (a) universal representation* or (b)
> > abstract interface, we close the door for vectorization. (a) prevents
> > having new vectorization friendly formats and (b) limits the
implementation
> > of new vectorized operations. This is an instance of the “expression
> > problem” [2].
> >
> > The way Arrow currently “solves” the data abstraction problem is by
having
> > no data abstraction — every operation takes a type and should provide
> > specializations for every type. Sometimes it’s possible to re-use the
same
> > kernel for different types, but the general approach is that we
specialize
> > (in the case of C++, we sometimes can specialize by just instantiating a
> > template, but that’s still an specialization).
> >
> > Given these constraints, what could be done?
> >
> > One sketch of an idea: define sets of types that we can call “kinds”**
> > (e.g. “string kind” = {string, string view, large string, ree<string>…},
> > “list kind” = {list, large_list, list_view, large_list_view…}).
> >
> > Then when different implementations have to communicate or interoperate,
> > they have to only be up to date on the list of Arrow Kinds and before
data
> > is moved a conversion step between types within the same kind is
performed
> > if required to make that communication possible.
> >
> > Example: a system that has a string_view Array and needs to send that
array
> > to a system that only understands large_string instances of the string
kind
> > MUST perform a conversion. This means that as long as all Arrow
> > implementations understand one established type on each of the kinds,
they
> > can communicate.
> >
> > This imposes a reasonable requirement on new types: when introduced,
they
> > should come with conversions to the previously specified types on that
> > kind.
> >
> > Any thoughts?
> >
> > —
> > Felipe
> > Voltron Data
> >
> >
> > [1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
> > [2] https://en.wikipedia.org/wiki/Expression_problem
> >
> > * “an array is a list of buffers and child arrays” doesn’t qualify as
> > “universal representation” because it doesn’t make a commitment on what
all
> > the buffers and child arrays mean universally
> >
> > ** if kind is already taken to mean scalar/array, we can use the term
> > “sort”
> >
> > On Mon, 31 Jul 2023 at 04:39 Gang Wu <ust...@gmail.com> wrote:
> >
> > > I am also in favor of the idea of an alternative layout. IIRC, a new
> > > alternative
> > > layout still goes into a process of standardization though it is the
> > choice
> > > of
> > > each implementation to decide support now or later. I'd like to ask
if we
> > > can
> > > provide the flexibility for implementations or downstream projects to
> > > actually
> > > implement a new alternative layout by means of a pluggable interface
> > before
> > > starting the standardization process. This is similar to promoting a
> > > popular
> > > extension type implemented by many users to a canonical extension
type.
> > > I know this is more complicated as extension type simply reuses
existing
> > > layout but alternative layout usually means a brand new one. For
example,
> > > if two projects speak Arrow and now they want to share a new layout,
they
> > > can simply implement a pluggable alternative layout before Arrow
adopts
> > it.
> > > This can unblock projects to evolve and help Arrow not to be
fragmented.
> > >
> > > Best,
> > > Gang
> > >
> > > On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > I'm trying to reason about the advantages and drawbacks of this
> > > > proposal, but it seems to me that it lacks definition.
> > > >
> > > > I would welcome a draft PR showcasing the changes necessary in the
IPC
> > > > format definition, and in the C Data Interface specification (no
need
> > to
> > > > actually implement them for now :-)).
> > > >
> > > >
> > > > As it is, it seems that this proposal would allow us to switch from:
> > > >
> > > > """We'd like to add a more efficient physical data representation,
so
> > > > we'll introduce a new Arrow data type. Implementations may or may
not
> > > > support it, but we will progressively try to bring reference
> > > > implementations to parity.""" (1)
> > > >
> > > > to:
> > > >
> > > > """We'd like to add a more efficient physical data representation,
so
> > > > we'll introduce a new alternative layout for an existing Arrow data
> > > > type. Implementations may or may not support it, but we will
> > > > progressively try to bring reference implementations to parity."""
(2)
> > > >
> > > > The expected advantage of (2) over (1) seems to be mainly a
difference
> > > > in how new format features are communicated. There are mainline
> > > > features, and there are experimental / provisional features.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > > > Le 13/07/2023 à 00:01, Neal Richardson a écrit :
> > > > > Hi all,
> > > > > As was previously raised in [1] and surfaced again in [2], there
is a
> > > > > proposal for representing alternative layouts. The intent, as I
> > > > understand
> > > > > it, is to be able to support memory layouts that some (but perhaps
> > not
> > > > all)
> > > > > applications of Arrow find valuable, so that these nearly Arrow
> > systems
> > > > can
> > > > > be fully Arrow-native.
> > > > >
> > > > > I wanted to start a more focused discussion on it because I think
> > it's
> > > > > worth being considered on its own merits, but I also think this
gets
> > to
> > > > the
> > > > > core of what the Arrow project is and should be, and I don't want
us
> > to
> > > > > lose sight of that.
> > > > >
> > > > > To restate the proposal from [1]:
> > > > >
> > > > >   * There are one or more primary layouts
> > > > >     * Existing layouts are automatically considered primary
layouts,
> > > > > even if they
> > > > > wouldn't have been primary layouts initially (e.g. large list)
> > > > >   * A new layout, if it is semantically equivalent to another, is
> > > > considered an
> > > > > alternative layout
> > > > >   * An alternative layout still has the same requirements for
> > adoption
> > > > > (two implementations
> > > > > and a vote)
> > > > >     * An implementation should not feel pressured to rush and
> > implement
> > > > the new
> > > > > layout. It would be good if they contribute in the discussion and
> > > > consider
> > > > > the layout and vote if they feel it would be an acceptable design.
> > > > >   * We can define and vote and approve as many canonical
alternative
> > > > layouts as
> > > > > we want:
> > > > >     * A canonical alternative layout should, at a minimum, have
some
> > > > reasonable
> > > > > justification, such as improved performance for algorithm X
> > > > >   * Arrow implementations MUST support the primary layouts
> > > > >   * An Arrow implementation MAY support a canonical alternative,
> > > however:
> > > > >     * An Arrow implementation MUST first support the primary
layout
> > > > >     * An Arrow implementation MUST support conversion to/from the
> > > > primary and
> > > > > canonical layout
> > > > >     * An Arrow implementation's APIs MUST only provide data in the
> > > > > alternative layout if it is explicitly asked for (e.g. schema
> > inference
> > > > > should prefer the primary layout).
> > > > >   * We can still vote for new primary layouts (e.g. promoting a
> > > > > canonical alternative)
> > > > > but, in these votes we don't only consider the value (e.g.
> > performance)
> > > > of
> > > > > the layout but also the interoperability. In other words, a layout
> > can
> > > > only
> > > > > become a primary layout if there is significant evidence that most
> > > > > implementations
> > > > > plan to adopt it.
> > > > >
> > > > >
> > > > > To summarize some of the arguments against the proposal from the
> > > previous
> > > > > threads, there are concerns about increasing the complexity of the
> > > Arrow
> > > > > specification and the cost/burden of updating all of the Arrow
> > > > > specifications to support them.
> > > > >
> > > > > Where these discussions, both about several proposed new types and
> > this
> > > > > layout proposal, get to the core of Arrow is well expressed in the
> > > > comments
> > > > > on the previous thread by Raphael [3] and Pedro [4]. Raphael asks:
> > > "what
> > > > > matters to people more, interoperability or best-in-class
> > performance?"
> > > > And
> > > > > Pedro notes that because of the overhead of converting these
> > > > not-yet-Arrow
> > > > > types to the Arrow C ABI is high enough that they've considered
> > > > abandoning
> > > > > Arrow as their interchange format. So: on the one hand, we're
kinda
> > > > > choosing which quality we're optimizing for, but on the other,
> > > > > interoperability and performance are dependent on each other.
> > > > >
> > > > > What I see that we're trying to do here is find a way to expand
the
> > > Arrow
> > > > > specification just enough so that Arrow becomes or remains the
> > > in-memory
> > > > > standard everywhere, but not so much that it creates too much
> > > complexity
> > > > or
> > > > > burden to implement. Expand too much and you get a fragmented
> > ecosystem
> > > > > where everyone is writing subsets of the Arrow standard and so
> > nothing
> > > is
> > > > > fully compatible and the whole premise is undermined. But expand
too
> > > > little
> > > > > and projects will abandon the standard and we've also failed.
> > > > >
> > > > > I don't have a tidy answer, but I wanted to acknowledge the bigger
> > > > issues,
> > > > > and see if this helps us reason about the various proposals on the
> > > > table. I
> > > > > wonder if the alternative layout proposal is the happy medium that
> > adds
> > > > > some complexity to the specification, but less than there would
be if
> > > > three
> > > > > new types were added, and still meets the needs of projects like
> > > DuckDB,
> > > > > Velox, and Gluten and gets them fully Arrow native.
> > > > >
> > > > > Neal
> > > > >
> > > > >
> > > > > [1]:
> > https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
> > > > > [2]:
> > https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
> > > > > [3]:
> > https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
> > > > > [4]:
> > https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] Canonical alternative layout proposal

Reply via email to