Re: [DISCUSS] Canonical alternative layout proposal
> I think this is similar to the proposal with the exception that your > suggestion would require amending existing types that happen to be > alternatives to each other. I want to avoid electing one canonical layout for a kind (AKA "logical type"). And the existence of "alternative layouts" implies the existence of a canonical layout. In my suggestion, a layout being canonical is not a property of the specification, but a choice of the system implementing the specification. One concrete example of this is how Polars elected LargeList as the canonical type for the List logical type [1] while Velox settled on a list array representation based on 32-bit offsets and sizes. The specification can define the rules of communication upfront, achieving two goals: 1) implementations can add new layouts and immediately inter-operate better with other implementations 2) implementations can add new behaviors without concerning themselves with new layouts other implementations are adding This is not a full solution to the "expression problem" because we are still left with some conversions at runtime, but as each implementation gets closer to understanding all layouts, the conversions disappear. If we settle on canonical layouts, communication is forced to always convert to the canonical layout when passing data around, penalizing layouts that are better for computation. [1] https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/dtype.rs#L247 * in terms of speed, and memory consumption, but not binary size On Thu, Aug 3, 2023 at 12:28 AM Weston Pace wrote: > > > I would welcome a draft PR showcasing the changes necessary in the IPC > > format definition, and in the C Data Interface specification (no need to > > actually implement them for now :-)). > > I've proposed something at [1]. > > > One sketch of an idea: define sets of types that we can call “kinds”** > > (e.g. “string kind” = {string, string view, large string, ree…}, > > “list kind” = {list, large_list, list_view, large_list_view…}). > > I think this is similar to the proposal with the exception that your > suggestion would require amending existing types that happen to be > alternatives to each other. I'm not opposed to it but I think it's > compatible and we don't necessarily need all of the complexity just yet > (feel free to correct me if I'm wrong). I don't think we need to introduce > the concept of "kind". We already have a concept of "logical type" in the > spec. I think what you are stating is that a single logical type may have > multiple physical layouts. I agree. E.g. variable size list<32>, variable > size list<64>, and REE are the physical layouts that, combined with the > logical type "string", give you "string", "large string", and "ree" > > [1] https://github.com/apache/arrow/pull/37000 > > On Tue, Aug 1, 2023 at 1:51 AM Felipe Oliveira Carvalho < felipe...@gmail.com> > wrote: > > > A major difficulty in making the Arrow array types open for extension [1] > > is that as soon as we define an (a) universal representation* or (b) > > abstract interface, we close the door for vectorization. (a) prevents > > having new vectorization friendly formats and (b) limits the implementation > > of new vectorized operations. This is an instance of the “expression > > problem” [2]. > > > > The way Arrow currently “solves” the data abstraction problem is by having > > no data abstraction — every operation takes a type and should provide > > specializations for every type. Sometimes it’s possible to re-use the same > > kernel for different types, but the general approach is that we specialize > > (in the case of C++, we sometimes can specialize by just instantiating a > > template, but that’s still an specialization). > > > > Given these constraints, what could be done? > > > > One sketch of an idea: define sets of types that we can call “kinds”** > > (e.g. “string kind” = {string, string view, large string, ree…}, > > “list kind” = {list, large_list, list_view, large_list_view…}). > > > > Then when different implementations have to communicate or interoperate, > > they have to only be up to date on the list of Arrow Kinds and before data > > is moved a conversion step between types within the same kind is performed > > if required to make that communication possible. > > > > Example: a system that has a string_view Array and needs to send that array > > to a system that only understands large_string instances of the string kind > > MUST perform a conversion. This means that as long as all Arrow > > implementations understand one established type on each of the kinds, they > > can communicate. > > > > This imposes a reasonable requirement on new types: when introduced, they > > should come with conversions to the previously specified types on that > > kind. > > > > Any thoughts? > > > > — > > Felipe > > Voltron Data > > > > > > [1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle > > [2] https://en.wikiped
Re: [DISCUSS] Canonical alternative layout proposal
> I would welcome a draft PR showcasing the changes necessary in the IPC > format definition, and in the C Data Interface specification (no need to > actually implement them for now :-)). I've proposed something at [1]. > One sketch of an idea: define sets of types that we can call “kinds”** > (e.g. “string kind” = {string, string view, large string, ree…}, > “list kind” = {list, large_list, list_view, large_list_view…}). I think this is similar to the proposal with the exception that your suggestion would require amending existing types that happen to be alternatives to each other. I'm not opposed to it but I think it's compatible and we don't necessarily need all of the complexity just yet (feel free to correct me if I'm wrong). I don't think we need to introduce the concept of "kind". We already have a concept of "logical type" in the spec. I think what you are stating is that a single logical type may have multiple physical layouts. I agree. E.g. variable size list<32>, variable size list<64>, and REE are the physical layouts that, combined with the logical type "string", give you "string", "large string", and "ree" [1] https://github.com/apache/arrow/pull/37000 On Tue, Aug 1, 2023 at 1:51 AM Felipe Oliveira Carvalho wrote: > A major difficulty in making the Arrow array types open for extension [1] > is that as soon as we define an (a) universal representation* or (b) > abstract interface, we close the door for vectorization. (a) prevents > having new vectorization friendly formats and (b) limits the implementation > of new vectorized operations. This is an instance of the “expression > problem” [2]. > > The way Arrow currently “solves” the data abstraction problem is by having > no data abstraction — every operation takes a type and should provide > specializations for every type. Sometimes it’s possible to re-use the same > kernel for different types, but the general approach is that we specialize > (in the case of C++, we sometimes can specialize by just instantiating a > template, but that’s still an specialization). > > Given these constraints, what could be done? > > One sketch of an idea: define sets of types that we can call “kinds”** > (e.g. “string kind” = {string, string view, large string, ree…}, > “list kind” = {list, large_list, list_view, large_list_view…}). > > Then when different implementations have to communicate or interoperate, > they have to only be up to date on the list of Arrow Kinds and before data > is moved a conversion step between types within the same kind is performed > if required to make that communication possible. > > Example: a system that has a string_view Array and needs to send that array > to a system that only understands large_string instances of the string kind > MUST perform a conversion. This means that as long as all Arrow > implementations understand one established type on each of the kinds, they > can communicate. > > This imposes a reasonable requirement on new types: when introduced, they > should come with conversions to the previously specified types on that > kind. > > Any thoughts? > > — > Felipe > Voltron Data > > > [1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle > [2] https://en.wikipedia.org/wiki/Expression_problem > > * “an array is a list of buffers and child arrays” doesn’t qualify as > “universal representation” because it doesn’t make a commitment on what all > the buffers and child arrays mean universally > > ** if kind is already taken to mean scalar/array, we can use the term > “sort” > > On Mon, 31 Jul 2023 at 04:39 Gang Wu wrote: > > > I am also in favor of the idea of an alternative layout. IIRC, a new > > alternative > > layout still goes into a process of standardization though it is the > choice > > of > > each implementation to decide support now or later. I'd like to ask if we > > can > > provide the flexibility for implementations or downstream projects to > > actually > > implement a new alternative layout by means of a pluggable interface > before > > starting the standardization process. This is similar to promoting a > > popular > > extension type implemented by many users to a canonical extension type. > > I know this is more complicated as extension type simply reuses existing > > layout but alternative layout usually means a brand new one. For example, > > if two projects speak Arrow and now they want to share a new layout, they > > can simply implement a pluggable alternative layout before Arrow adopts > it. > > This can unblock projects to evolve and help Arrow not to be fragmented. > > > > Best, > > Gang > > > > On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou > > wrote: > > > > > > > > Hello, > > > > > > I'm trying to reason about the advantages and drawbacks of this > > > proposal, but it seems to me that it lacks definition. > > > > > > I would welcome a draft PR showcasing the changes necessary in the IPC > > > format definition, and in the C Data Interface specification (no need > to
Re: [DISCUSS] Canonical alternative layout proposal
A major difficulty in making the Arrow array types open for extension [1] is that as soon as we define an (a) universal representation* or (b) abstract interface, we close the door for vectorization. (a) prevents having new vectorization friendly formats and (b) limits the implementation of new vectorized operations. This is an instance of the “expression problem” [2]. The way Arrow currently “solves” the data abstraction problem is by having no data abstraction — every operation takes a type and should provide specializations for every type. Sometimes it’s possible to re-use the same kernel for different types, but the general approach is that we specialize (in the case of C++, we sometimes can specialize by just instantiating a template, but that’s still an specialization). Given these constraints, what could be done? One sketch of an idea: define sets of types that we can call “kinds”** (e.g. “string kind” = {string, string view, large string, ree…}, “list kind” = {list, large_list, list_view, large_list_view…}). Then when different implementations have to communicate or interoperate, they have to only be up to date on the list of Arrow Kinds and before data is moved a conversion step between types within the same kind is performed if required to make that communication possible. Example: a system that has a string_view Array and needs to send that array to a system that only understands large_string instances of the string kind MUST perform a conversion. This means that as long as all Arrow implementations understand one established type on each of the kinds, they can communicate. This imposes a reasonable requirement on new types: when introduced, they should come with conversions to the previously specified types on that kind. Any thoughts? — Felipe Voltron Data [1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle [2] https://en.wikipedia.org/wiki/Expression_problem * “an array is a list of buffers and child arrays” doesn’t qualify as “universal representation” because it doesn’t make a commitment on what all the buffers and child arrays mean universally ** if kind is already taken to mean scalar/array, we can use the term “sort” On Mon, 31 Jul 2023 at 04:39 Gang Wu wrote: > I am also in favor of the idea of an alternative layout. IIRC, a new > alternative > layout still goes into a process of standardization though it is the choice > of > each implementation to decide support now or later. I'd like to ask if we > can > provide the flexibility for implementations or downstream projects to > actually > implement a new alternative layout by means of a pluggable interface before > starting the standardization process. This is similar to promoting a > popular > extension type implemented by many users to a canonical extension type. > I know this is more complicated as extension type simply reuses existing > layout but alternative layout usually means a brand new one. For example, > if two projects speak Arrow and now they want to share a new layout, they > can simply implement a pluggable alternative layout before Arrow adopts it. > This can unblock projects to evolve and help Arrow not to be fragmented. > > Best, > Gang > > On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou > wrote: > > > > > Hello, > > > > I'm trying to reason about the advantages and drawbacks of this > > proposal, but it seems to me that it lacks definition. > > > > I would welcome a draft PR showcasing the changes necessary in the IPC > > format definition, and in the C Data Interface specification (no need to > > actually implement them for now :-)). > > > > > > As it is, it seems that this proposal would allow us to switch from: > > > > """We'd like to add a more efficient physical data representation, so > > we'll introduce a new Arrow data type. Implementations may or may not > > support it, but we will progressively try to bring reference > > implementations to parity.""" (1) > > > > to: > > > > """We'd like to add a more efficient physical data representation, so > > we'll introduce a new alternative layout for an existing Arrow data > > type. Implementations may or may not support it, but we will > > progressively try to bring reference implementations to parity.""" (2) > > > > The expected advantage of (2) over (1) seems to be mainly a difference > > in how new format features are communicated. There are mainline > > features, and there are experimental / provisional features. > > > > Regards > > > > Antoine. > > > > > > > > Le 13/07/2023 à 00:01, Neal Richardson a écrit : > > > Hi all, > > > As was previously raised in [1] and surfaced again in [2], there is a > > > proposal for representing alternative layouts. The intent, as I > > understand > > > it, is to be able to support memory layouts that some (but perhaps not > > all) > > > applications of Arrow find valuable, so that these nearly Arrow systems > > can > > > be fully Arrow-native. > > > > > > I wanted to start a more focused discussion on i
Re: [DISCUSS] Canonical alternative layout proposal
I am also in favor of the idea of an alternative layout. IIRC, a new alternative layout still goes into a process of standardization though it is the choice of each implementation to decide support now or later. I'd like to ask if we can provide the flexibility for implementations or downstream projects to actually implement a new alternative layout by means of a pluggable interface before starting the standardization process. This is similar to promoting a popular extension type implemented by many users to a canonical extension type. I know this is more complicated as extension type simply reuses existing layout but alternative layout usually means a brand new one. For example, if two projects speak Arrow and now they want to share a new layout, they can simply implement a pluggable alternative layout before Arrow adopts it. This can unblock projects to evolve and help Arrow not to be fragmented. Best, Gang On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou wrote: > > Hello, > > I'm trying to reason about the advantages and drawbacks of this > proposal, but it seems to me that it lacks definition. > > I would welcome a draft PR showcasing the changes necessary in the IPC > format definition, and in the C Data Interface specification (no need to > actually implement them for now :-)). > > > As it is, it seems that this proposal would allow us to switch from: > > """We'd like to add a more efficient physical data representation, so > we'll introduce a new Arrow data type. Implementations may or may not > support it, but we will progressively try to bring reference > implementations to parity.""" (1) > > to: > > """We'd like to add a more efficient physical data representation, so > we'll introduce a new alternative layout for an existing Arrow data > type. Implementations may or may not support it, but we will > progressively try to bring reference implementations to parity.""" (2) > > The expected advantage of (2) over (1) seems to be mainly a difference > in how new format features are communicated. There are mainline > features, and there are experimental / provisional features. > > Regards > > Antoine. > > > > Le 13/07/2023 à 00:01, Neal Richardson a écrit : > > Hi all, > > As was previously raised in [1] and surfaced again in [2], there is a > > proposal for representing alternative layouts. The intent, as I > understand > > it, is to be able to support memory layouts that some (but perhaps not > all) > > applications of Arrow find valuable, so that these nearly Arrow systems > can > > be fully Arrow-native. > > > > I wanted to start a more focused discussion on it because I think it's > > worth being considered on its own merits, but I also think this gets to > the > > core of what the Arrow project is and should be, and I don't want us to > > lose sight of that. > > > > To restate the proposal from [1]: > > > > * There are one or more primary layouts > > * Existing layouts are automatically considered primary layouts, > > even if they > > wouldn't have been primary layouts initially (e.g. large list) > > * A new layout, if it is semantically equivalent to another, is > considered an > > alternative layout > > * An alternative layout still has the same requirements for adoption > > (two implementations > > and a vote) > > * An implementation should not feel pressured to rush and implement > the new > > layout. It would be good if they contribute in the discussion and > consider > > the layout and vote if they feel it would be an acceptable design. > > * We can define and vote and approve as many canonical alternative > layouts as > > we want: > > * A canonical alternative layout should, at a minimum, have some > reasonable > > justification, such as improved performance for algorithm X > > * Arrow implementations MUST support the primary layouts > > * An Arrow implementation MAY support a canonical alternative, however: > > * An Arrow implementation MUST first support the primary layout > > * An Arrow implementation MUST support conversion to/from the > primary and > > canonical layout > > * An Arrow implementation's APIs MUST only provide data in the > > alternative layout if it is explicitly asked for (e.g. schema inference > > should prefer the primary layout). > > * We can still vote for new primary layouts (e.g. promoting a > > canonical alternative) > > but, in these votes we don't only consider the value (e.g. performance) > of > > the layout but also the interoperability. In other words, a layout can > only > > become a primary layout if there is significant evidence that most > > implementations > > plan to adopt it. > > > > > > To summarize some of the arguments against the proposal from the previous > > threads, there are concerns about increasing the complexity of the Arrow > > specification and the cost/burden of updating all of the Arrow > > specifications to support them. > > > > Where these discussions, both about several proposed new types and t
Re: [DISCUSS] Canonical alternative layout proposal
Hello, I'm trying to reason about the advantages and drawbacks of this proposal, but it seems to me that it lacks definition. I would welcome a draft PR showcasing the changes necessary in the IPC format definition, and in the C Data Interface specification (no need to actually implement them for now :-)). As it is, it seems that this proposal would allow us to switch from: """We'd like to add a more efficient physical data representation, so we'll introduce a new Arrow data type. Implementations may or may not support it, but we will progressively try to bring reference implementations to parity.""" (1) to: """We'd like to add a more efficient physical data representation, so we'll introduce a new alternative layout for an existing Arrow data type. Implementations may or may not support it, but we will progressively try to bring reference implementations to parity.""" (2) The expected advantage of (2) over (1) seems to be mainly a difference in how new format features are communicated. There are mainline features, and there are experimental / provisional features. Regards Antoine. Le 13/07/2023 à 00:01, Neal Richardson a écrit : Hi all, As was previously raised in [1] and surfaced again in [2], there is a proposal for representing alternative layouts. The intent, as I understand it, is to be able to support memory layouts that some (but perhaps not all) applications of Arrow find valuable, so that these nearly Arrow systems can be fully Arrow-native. I wanted to start a more focused discussion on it because I think it's worth being considered on its own merits, but I also think this gets to the core of what the Arrow project is and should be, and I don't want us to lose sight of that. To restate the proposal from [1]: * There are one or more primary layouts * Existing layouts are automatically considered primary layouts, even if they wouldn't have been primary layouts initially (e.g. large list) * A new layout, if it is semantically equivalent to another, is considered an alternative layout * An alternative layout still has the same requirements for adoption (two implementations and a vote) * An implementation should not feel pressured to rush and implement the new layout. It would be good if they contribute in the discussion and consider the layout and vote if they feel it would be an acceptable design. * We can define and vote and approve as many canonical alternative layouts as we want: * A canonical alternative layout should, at a minimum, have some reasonable justification, such as improved performance for algorithm X * Arrow implementations MUST support the primary layouts * An Arrow implementation MAY support a canonical alternative, however: * An Arrow implementation MUST first support the primary layout * An Arrow implementation MUST support conversion to/from the primary and canonical layout * An Arrow implementation's APIs MUST only provide data in the alternative layout if it is explicitly asked for (e.g. schema inference should prefer the primary layout). * We can still vote for new primary layouts (e.g. promoting a canonical alternative) but, in these votes we don't only consider the value (e.g. performance) of the layout but also the interoperability. In other words, a layout can only become a primary layout if there is significant evidence that most implementations plan to adopt it. To summarize some of the arguments against the proposal from the previous threads, there are concerns about increasing the complexity of the Arrow specification and the cost/burden of updating all of the Arrow specifications to support them. Where these discussions, both about several proposed new types and this layout proposal, get to the core of Arrow is well expressed in the comments on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what matters to people more, interoperability or best-in-class performance?" And Pedro notes that because of the overhead of converting these not-yet-Arrow types to the Arrow C ABI is high enough that they've considered abandoning Arrow as their interchange format. So: on the one hand, we're kinda choosing which quality we're optimizing for, but on the other, interoperability and performance are dependent on each other. What I see that we're trying to do here is find a way to expand the Arrow specification just enough so that Arrow becomes or remains the in-memory standard everywhere, but not so much that it creates too much complexity or burden to implement. Expand too much and you get a fragmented ecosystem where everyone is writing subsets of the Arrow standard and so nothing is fully compatible and the whole premise is undermined. But expand too little and projects will abandon the standard and we've also failed. I don't have a tidy answer, but I wanted to acknowledge the bigger issues, and see if this helps us reason about the various proposals on the table. I wonder if the alter
Re: [DISCUSS] Canonical alternative layout proposal
Thank you Neil for writing this summary and everyone whose thoughts went into the discussions -- I think the proposal, as summarized, offers a great path forward by allowing the various Arrow communities to specialize when advantageous but remain compatible. On Thu, Jul 13, 2023 at 11:59 AM Ian Cook wrote: > Thank you Weston for proposing this solution and Neal for describing > its context and implications. I agree with the other replies here—this > seems like an elegant solution to a growing need that could, if left > unaddressed, increase the fragmentation of the ecosystem and reduce > the centrality of the Arrow format. > > Greater diversity of layouts is happening. Whether it happens inside > of Arrow or outside of Arrow is up to us. I think we all would like to > see it happen inside of Arrow. This proposal allows for that, while > striking a balance as Raphael describes. > > However I think there is still some ambiguity about exactly how an > Arrow implementation that is consuming/producing data would negotiate > with an Arrow implementation or other component that is > producing/consuming data to determine whether an alternative layout is > supported. This was discussed briefly in [5] but I am interested to > see how this negotiation would be implemented in practice in the C > data interface, IPC, Flight, etc. > > Ian > > [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2 > > > On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies > wrote: > > > > I like this proposal, I think it strikes a pragmatic balance between > > preserving interoperability whilst still allowing new ideas to be > > incorporated into the standard. Thank you for writing this up. > > > > On 13/07/2023 10:22, Matt Topol wrote: > > > I don't have much to add but I do want to second Jacob's comments. I > agree > > > that this is a good way to avoid the fragmentation while keeping Arrow > > > relevant, and likely something we need to do so that we can ensure > Arrow > > > remains the way to do this data integration and interoperability. > > > > > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens > > > wrote: > > > > > >> Hello Everyone, > > >> > > >> Thanks for this comprehensive but concise write up Neal! I think this > > >> proposal is a good way to avoid both fragmentation of the arrow > ecosystem > > >> as well as its obsolescence. In my opinion of these two problems the > > >> obsolescence is the bigger issue as (as mentioned in the proposal) > arrow is > > >> already (close to) being relegated to the sidelines in eco-system > defining > > >> projects. > > >> > > >> Jacob > > >> > > >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson < > > >> neal.p.richard...@gmail.com> wrote: > > >> > > >>> Hi all, > > >>> As was previously raised in [1] and surfaced again in [2], there is a > > >>> proposal for representing alternative layouts. The intent, as I > > >> understand > > >>> it, is to be able to support memory layouts that some (but perhaps > not > > >> all) > > >>> applications of Arrow find valuable, so that these nearly Arrow > systems > > >> can > > >>> be fully Arrow-native. > > >>> > > >>> I wanted to start a more focused discussion on it because I think > it's > > >>> worth being considered on its own merits, but I also think this gets > to > > >> the > > >>> core of what the Arrow project is and should be, and I don't want us > to > > >>> lose sight of that. > > >>> > > >>> To restate the proposal from [1]: > > >>> > > >>> * There are one or more primary layouts > > >>> * Existing layouts are automatically considered primary layouts, > > >>> even if they > > >>> wouldn't have been primary layouts initially (e.g. large list) > > >>> * A new layout, if it is semantically equivalent to another, is > > >>> considered an > > >>> alternative layout > > >>> * An alternative layout still has the same requirements for > adoption > > >>> (two implementations > > >>> and a vote) > > >>> * An implementation should not feel pressured to rush and > implement > > >> the > > >>> new > > >>> layout. It would be good if they contribute in the discussion and > > >> consider > > >>> the layout and vote if they feel it would be an acceptable design. > > >>> * We can define and vote and approve as many canonical alternative > > >>> layouts as > > >>> we want: > > >>> * A canonical alternative layout should, at a minimum, have some > > >>> reasonable > > >>> justification, such as improved performance for algorithm X > > >>> * Arrow implementations MUST support the primary layouts > > >>> * An Arrow implementation MAY support a canonical alternative, > however: > > >>> * An Arrow implementation MUST first support the primary layout > > >>> * An Arrow implementation MUST support conversion to/from the > primary > > >>> and > > >>> canonical layout > > >>> * An Arrow implementation's APIs MUST only provide data in the > > >>> alternative layout if it is explicitly asked for (e.g. sche
Re: [DISCUSS] Canonical alternative layout proposal
clarify what constitutes support for a canonical alternative layout I had envisaged, perhaps naively, that we would just add a new DataType containing a string layout name, perhaps DataType::Raw(String). This would have no restrictions on the number of buffers, children, etc... and would effectively just be an opaque ArrayData. As interpreting such an array would require the layout name, I think it warrants inclusion at a lower level than just Field metadata. This is in contrast to extension types, where this metadata is not strictly necessary to operate on the arrays. I haven't given this a huge amount of thought though, and so it is entirely possible this has been discounted for some reason, or has some peculiar edge cases. would negotiate with an Arrow implementation or other component that is producing/consuming data to determine whether an alternative layout is supported My interpretation of the proposal was that no such negotiation would take place, instead the primary layout would always be chosen except in the presence of an explicit contrary signal. This is not hugely dissimilar from how dictionaries or run-encoded arrays are currently handled, where they will only be returned if either present in the input or explicitly requested. Perhaps we might clarify this with something along the lines of? - An Arrow implementation's APIs MAY produce data in an alternative layout to match one or more of its inputs, or an embedded arrow schema This covers both the cases of writing data to files and sending data over FFI. It does, however, carry the implication that alternative layouts are viral. I therefore wonder if we generalize the wording to: - Arrow-native APIs MUST only produce data in an alternative layout if it is already present in one of its inputs, or explicitly requested This is to help ensure that users only end up with alternative layouts if they explicitly opt-in to this behaviour. What I think we want to avoid is systems producing alternative layouts by default, and this then leading to user confusion when data produced by one system is not interoperable with those of another. On 13/07/2023 20:28, Benjamin Kietzman wrote: Canonical alternative layouts sounds like a workable path forward. Perhaps understandably, my immediate thought is how I could rephrase Utf8View as a canonical alternative layout for Utf8. In light of that, I have a few questions to clarify what constitutes support for a canonical alternative layout. Specifically: - do we extend Field to indicate if and which alternative layout is being used - or do we add AltSchema to wrap a schema and indicate which of its fields have alternate layouts - ... - do we extend RecordBatch to support canonical alternative layouts - or do we add AltRecordBatch for that purpose (which iiuc would complicate dictionary batches containing any column of an alternate layout) - ... To add context, one of the reasons we could not just use extension types for Utf8View is that these are required to be backed by a known layout, and no primary layout in the format has a variable number of buffers. In order to accommodate Utf8View as an alternative layout, the minimal change which I can think of right now is - to add `stringField::alternative_layout` to identify alternative layouts in a Schema - to extend RecordBatch with support for variable buffer counts This will put some burden on implementers to navigate the multiple character buffers when reading serialized arrow batches. However it will not require that any implementations' data structures support multiple buffers since the explicit default for any implementation is to always convert Utf8View to Utf8. If this sounds acceptable, I'll prepare a draft PR which - adds language for canonical alternative layouts to Columnar.rst - addsField::alternative_layout andRecordBatch::variable_buffer_counts - adds the "view" alternative layout for the Utf8 Type as an initial example Ben Kietzman On Thu, Jul 13, 2023, 18:32 Aldrin wrote: Thanks Neal and Weston! I prepared a diagram to solidify my own understanding of the context, which can be found at [1]. I think alternative layouts sounds like a nice first approach to allowing new layouts that can be supported lazily (implemented when it is beneficial) by various implementations of the Arrow Columnar Format. But, I do think that it's just a (practical) formalization of saying what layouts are required and which ones are optional. From the making of the diagram, I also decided that the discussion isn't limited to performance, since there are several reasons new physical layouts may be proposed (or, at least, there are many aspects of performance). Even if it's not "canonical alternative layouts," I think it is important that there be some process for developers that use Arrow to propose extensions to the columnar format without having to prove out the benefits for libraries that use a different tech sta
Re: [DISCUSS] Canonical alternative layout proposal
Canonical alternative layouts sounds like a workable path forward. Perhaps understandably, my immediate thought is how I could rephrase Utf8View as a canonical alternative layout for Utf8. In light of that, I have a few questions to clarify what constitutes support for a canonical alternative layout. Specifically: - do we extend Field to indicate if and which alternative layout is being used - or do we add AltSchema to wrap a schema and indicate which of its fields have alternate layouts - ... - do we extend RecordBatch to support canonical alternative layouts - or do we add AltRecordBatch for that purpose (which iiuc would complicate dictionary batches containing any column of an alternate layout) - ... To add context, one of the reasons we could not just use extension types for Utf8View is that these are required to be backed by a known layout, and no primary layout in the format has a variable number of buffers. In order to accommodate Utf8View as an alternative layout, the minimal change which I can think of right now is - to add `string Field::alternative_layout` to identify alternative layouts in a Schema - to extend RecordBatch with support for variable buffer counts This will put some burden on implementers to navigate the multiple character buffers when reading serialized arrow batches. However it will not require that any implementations' data structures support multiple buffers since the explicit default for any implementation is to always convert Utf8View to Utf8. If this sounds acceptable, I'll prepare a draft PR which - adds language for canonical alternative layouts to Columnar.rst - adds Field::alternative_layout and RecordBatch::variable_buffer_counts - adds the "view" alternative layout for the Utf8 Type as an initial example Ben Kietzman On Thu, Jul 13, 2023, 18:32 Aldrin wrote: > Thanks Neal and Weston! > > I prepared a diagram to solidify my own understanding of the context, > which can be found at [1]. > > I think alternative layouts sounds like a nice first approach to allowing > new layouts that can be supported lazily (implemented when it is > beneficial) by various implementations of the Arrow Columnar Format. But, I > do think that it's just a (practical) formalization of saying what layouts > are required and which ones are optional. > > From the making of the diagram, I also decided that the discussion isn't > limited to performance, since there are several reasons new physical > layouts may be proposed (or, at least, there are many aspects of > performance). Even if it's not "canonical alternative layouts," I think it > is important that there be some process for developers that use Arrow to > propose extensions to the columnar format without having to prove out the > benefits for libraries that use a different tech stack (e.g. rust vs C++ vs > go). > > > [1]: > https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing > > > > > # -- > > # Aldrin > > > https://github.com/drin/ > > https://gitlab.com/octalene > > https://keybase.io/octalene > > > --- Original Message --- > On Thursday, July 13th, 2023 at 10:49, Dane Pitkin > wrote: > > > > I am in favor of this proposal. IMO the Arrow project is the right place > to > > standardize both the interoperability and operability of columnar data > > layouts. Data engines are a core component of the Arrow ecosystem and the > > project should be able to grow with these data engines as they converge > on > > new layouts. Since columnar data is ubiquitous in analytical workloads, > we > > are seeing a natural progression into optimizing those workloads. This > > includes new lossless compression schemes for columnar data that allows > > engines to operate directly on the compressed data (e.g. RLE). If we > can't > > reliably support the growing needs of the broader data engine ecosystem > in > > a timely manner, then I also fear Arrow might lose relevancy over time. > > > > > On Thu, Jul 13, 2023 at 11:59 AM Ian Cook ianmc...@apache.org wrote: > > > > > > Thank you Weston for proposing this solution and Neal for describing > > > its context and implications. I agree with the other replies here—this > > > seems like an elegant solution to a growing need that could, if left > > > unaddressed, increase the fragmentation of the ecosystem and reduce > > > the centrality of the Arrow format. > > > > > > > Greater diversity of layouts is happening. Whether it happens inside > > > of Arrow or outside of Arrow is up to us. I think we all would like to > > > see it happen inside of Arrow. This proposal allows for that, while > > > striking a balance as Raphael describes. > > > > > > > However I think there is still some ambiguity about exactly how an > > > Arrow implementation that is consuming/producing data would negotiate > > > with an Arrow implementation or other component that is > > > producing/consuming data to determine whether an alternative lay
Re: [DISCUSS] Canonical alternative layout proposal
Thanks Neal and Weston! I prepared a diagram to solidify my own understanding of the context, which can be found at [1]. I think alternative layouts sounds like a nice first approach to allowing new layouts that can be supported lazily (implemented when it is beneficial) by various implementations of the Arrow Columnar Format. But, I do think that it's just a (practical) formalization of saying what layouts are required and which ones are optional. >From the making of the diagram, I also decided that the discussion isn't >limited to performance, since there are several reasons new physical layouts >may be proposed (or, at least, there are many aspects of performance). Even if >it's not "canonical alternative layouts," I think it is important that there >be some process for developers that use Arrow to propose extensions to the >columnar format without having to prove out the benefits for libraries that >use a different tech stack (e.g. rust vs C++ vs go). [1]: https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing # -- # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene --- Original Message --- On Thursday, July 13th, 2023 at 10:49, Dane Pitkin wrote: > I am in favor of this proposal. IMO the Arrow project is the right place to > standardize both the interoperability and operability of columnar data > layouts. Data engines are a core component of the Arrow ecosystem and the > project should be able to grow with these data engines as they converge on > new layouts. Since columnar data is ubiquitous in analytical workloads, we > are seeing a natural progression into optimizing those workloads. This > includes new lossless compression schemes for columnar data that allows > engines to operate directly on the compressed data (e.g. RLE). If we can't > reliably support the growing needs of the broader data engine ecosystem in > a timely manner, then I also fear Arrow might lose relevancy over time. > > On Thu, Jul 13, 2023 at 11:59 AM Ian Cook ianmc...@apache.org wrote: > > > Thank you Weston for proposing this solution and Neal for describing > > its context and implications. I agree with the other replies here—this > > seems like an elegant solution to a growing need that could, if left > > unaddressed, increase the fragmentation of the ecosystem and reduce > > the centrality of the Arrow format. > > > > Greater diversity of layouts is happening. Whether it happens inside > > of Arrow or outside of Arrow is up to us. I think we all would like to > > see it happen inside of Arrow. This proposal allows for that, while > > striking a balance as Raphael describes. > > > > However I think there is still some ambiguity about exactly how an > > Arrow implementation that is consuming/producing data would negotiate > > with an Arrow implementation or other component that is > > producing/consuming data to determine whether an alternative layout is > > supported. This was discussed briefly in [5] but I am interested to > > see how this negotiation would be implemented in practice in the C > > data interface, IPC, Flight, etc. > > > > Ian > > > > [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2 > > > > On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies > > r.taylordav...@googlemail.com.invalid wrote: > > > > > I like this proposal, I think it strikes a pragmatic balance between > > > preserving interoperability whilst still allowing new ideas to be > > > incorporated into the standard. Thank you for writing this up. > > > > > > On 13/07/2023 10:22, Matt Topol wrote: > > > > > > > I don't have much to add but I do want to second Jacob's comments. I > > > > agree > > > > that this is a good way to avoid the fragmentation while keeping Arrow > > > > relevant, and likely something we need to do so that we can ensure > > > > Arrow > > > > remains the way to do this data integration and interoperability. > > > > > > > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens > > > > ja...@voltrondata.com.invalid wrote: > > > > > > > > > Hello Everyone, > > > > > > > > > > Thanks for this comprehensive but concise write up Neal! I think this > > > > > proposal is a good way to avoid both fragmentation of the arrow > > > > > ecosystem > > > > > as well as its obsolescence. In my opinion of these two problems the > > > > > obsolescence is the bigger issue as (as mentioned in the proposal) > > > > > arrow is > > > > > already (close to) being relegated to the sidelines in eco-system > > > > > defining > > > > > projects. > > > > > > > > > > Jacob > > > > > > > > > > On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson < > > > > > neal.p.richard...@gmail.com> wrote: > > > > > > > > > > > Hi all, > > > > > > As was previously raised in 1 and surfaced again in 2, there is a > > > > > > proposal for representing alternative layouts. The in
Re: [DISCUSS] Canonical alternative layout proposal
I am in favor of this proposal. IMO the Arrow project is the right place to standardize both the interoperability *and operability* of columnar data layouts. Data engines are a core component of the Arrow ecosystem and the project should be able to grow with these data engines as they converge on new layouts. Since columnar data is ubiquitous in analytical workloads, we are seeing a natural progression into optimizing those workloads. This includes new lossless compression schemes for columnar data that allows engines to operate directly on the compressed data (e.g. RLE). If we can't reliably support the growing needs of the broader data engine ecosystem in a timely manner, then I also fear Arrow might lose relevancy over time. On Thu, Jul 13, 2023 at 11:59 AM Ian Cook wrote: > Thank you Weston for proposing this solution and Neal for describing > its context and implications. I agree with the other replies here—this > seems like an elegant solution to a growing need that could, if left > unaddressed, increase the fragmentation of the ecosystem and reduce > the centrality of the Arrow format. > > Greater diversity of layouts is happening. Whether it happens inside > of Arrow or outside of Arrow is up to us. I think we all would like to > see it happen inside of Arrow. This proposal allows for that, while > striking a balance as Raphael describes. > > However I think there is still some ambiguity about exactly how an > Arrow implementation that is consuming/producing data would negotiate > with an Arrow implementation or other component that is > producing/consuming data to determine whether an alternative layout is > supported. This was discussed briefly in [5] but I am interested to > see how this negotiation would be implemented in practice in the C > data interface, IPC, Flight, etc. > > Ian > > [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2 > > > On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies > wrote: > > > > I like this proposal, I think it strikes a pragmatic balance between > > preserving interoperability whilst still allowing new ideas to be > > incorporated into the standard. Thank you for writing this up. > > > > On 13/07/2023 10:22, Matt Topol wrote: > > > I don't have much to add but I do want to second Jacob's comments. I > agree > > > that this is a good way to avoid the fragmentation while keeping Arrow > > > relevant, and likely something we need to do so that we can ensure > Arrow > > > remains the way to do this data integration and interoperability. > > > > > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens > > > wrote: > > > > > >> Hello Everyone, > > >> > > >> Thanks for this comprehensive but concise write up Neal! I think this > > >> proposal is a good way to avoid both fragmentation of the arrow > ecosystem > > >> as well as its obsolescence. In my opinion of these two problems the > > >> obsolescence is the bigger issue as (as mentioned in the proposal) > arrow is > > >> already (close to) being relegated to the sidelines in eco-system > defining > > >> projects. > > >> > > >> Jacob > > >> > > >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson < > > >> neal.p.richard...@gmail.com> wrote: > > >> > > >>> Hi all, > > >>> As was previously raised in [1] and surfaced again in [2], there is a > > >>> proposal for representing alternative layouts. The intent, as I > > >> understand > > >>> it, is to be able to support memory layouts that some (but perhaps > not > > >> all) > > >>> applications of Arrow find valuable, so that these nearly Arrow > systems > > >> can > > >>> be fully Arrow-native. > > >>> > > >>> I wanted to start a more focused discussion on it because I think > it's > > >>> worth being considered on its own merits, but I also think this gets > to > > >> the > > >>> core of what the Arrow project is and should be, and I don't want us > to > > >>> lose sight of that. > > >>> > > >>> To restate the proposal from [1]: > > >>> > > >>> * There are one or more primary layouts > > >>> * Existing layouts are automatically considered primary layouts, > > >>> even if they > > >>> wouldn't have been primary layouts initially (e.g. large list) > > >>> * A new layout, if it is semantically equivalent to another, is > > >>> considered an > > >>> alternative layout > > >>> * An alternative layout still has the same requirements for > adoption > > >>> (two implementations > > >>> and a vote) > > >>> * An implementation should not feel pressured to rush and > implement > > >> the > > >>> new > > >>> layout. It would be good if they contribute in the discussion and > > >> consider > > >>> the layout and vote if they feel it would be an acceptable design. > > >>> * We can define and vote and approve as many canonical alternative > > >>> layouts as > > >>> we want: > > >>> * A canonical alternative layout should, at a minimum, have some > > >>> reasonable > > >>> justification, such as improved performance for algorithm X > > >>> *
Re: [DISCUSS] Canonical alternative layout proposal
Thank you Weston for proposing this solution and Neal for describing its context and implications. I agree with the other replies here—this seems like an elegant solution to a growing need that could, if left unaddressed, increase the fragmentation of the ecosystem and reduce the centrality of the Arrow format. Greater diversity of layouts is happening. Whether it happens inside of Arrow or outside of Arrow is up to us. I think we all would like to see it happen inside of Arrow. This proposal allows for that, while striking a balance as Raphael describes. However I think there is still some ambiguity about exactly how an Arrow implementation that is consuming/producing data would negotiate with an Arrow implementation or other component that is producing/consuming data to determine whether an alternative layout is supported. This was discussed briefly in [5] but I am interested to see how this negotiation would be implemented in practice in the C data interface, IPC, Flight, etc. Ian [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2 On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies wrote: > > I like this proposal, I think it strikes a pragmatic balance between > preserving interoperability whilst still allowing new ideas to be > incorporated into the standard. Thank you for writing this up. > > On 13/07/2023 10:22, Matt Topol wrote: > > I don't have much to add but I do want to second Jacob's comments. I agree > > that this is a good way to avoid the fragmentation while keeping Arrow > > relevant, and likely something we need to do so that we can ensure Arrow > > remains the way to do this data integration and interoperability. > > > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens > > wrote: > > > >> Hello Everyone, > >> > >> Thanks for this comprehensive but concise write up Neal! I think this > >> proposal is a good way to avoid both fragmentation of the arrow ecosystem > >> as well as its obsolescence. In my opinion of these two problems the > >> obsolescence is the bigger issue as (as mentioned in the proposal) arrow is > >> already (close to) being relegated to the sidelines in eco-system defining > >> projects. > >> > >> Jacob > >> > >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson < > >> neal.p.richard...@gmail.com> wrote: > >> > >>> Hi all, > >>> As was previously raised in [1] and surfaced again in [2], there is a > >>> proposal for representing alternative layouts. The intent, as I > >> understand > >>> it, is to be able to support memory layouts that some (but perhaps not > >> all) > >>> applications of Arrow find valuable, so that these nearly Arrow systems > >> can > >>> be fully Arrow-native. > >>> > >>> I wanted to start a more focused discussion on it because I think it's > >>> worth being considered on its own merits, but I also think this gets to > >> the > >>> core of what the Arrow project is and should be, and I don't want us to > >>> lose sight of that. > >>> > >>> To restate the proposal from [1]: > >>> > >>> * There are one or more primary layouts > >>> * Existing layouts are automatically considered primary layouts, > >>> even if they > >>> wouldn't have been primary layouts initially (e.g. large list) > >>> * A new layout, if it is semantically equivalent to another, is > >>> considered an > >>> alternative layout > >>> * An alternative layout still has the same requirements for adoption > >>> (two implementations > >>> and a vote) > >>> * An implementation should not feel pressured to rush and implement > >> the > >>> new > >>> layout. It would be good if they contribute in the discussion and > >> consider > >>> the layout and vote if they feel it would be an acceptable design. > >>> * We can define and vote and approve as many canonical alternative > >>> layouts as > >>> we want: > >>> * A canonical alternative layout should, at a minimum, have some > >>> reasonable > >>> justification, such as improved performance for algorithm X > >>> * Arrow implementations MUST support the primary layouts > >>> * An Arrow implementation MAY support a canonical alternative, however: > >>> * An Arrow implementation MUST first support the primary layout > >>> * An Arrow implementation MUST support conversion to/from the primary > >>> and > >>> canonical layout > >>> * An Arrow implementation's APIs MUST only provide data in the > >>> alternative layout if it is explicitly asked for (e.g. schema inference > >>> should prefer the primary layout). > >>> * We can still vote for new primary layouts (e.g. promoting a > >>> canonical alternative) > >>> but, in these votes we don't only consider the value (e.g. performance) > >> of > >>> the layout but also the interoperability. In other words, a layout can > >> only > >>> become a primary layout if there is significant evidence that most > >>> implementations > >>> plan to adopt it. > >>> > >>> > >>> To summarize some of the arguments against the proposal from the previous >
Re: [DISCUSS] Canonical alternative layout proposal
I like this proposal, I think it strikes a pragmatic balance between preserving interoperability whilst still allowing new ideas to be incorporated into the standard. Thank you for writing this up. On 13/07/2023 10:22, Matt Topol wrote: I don't have much to add but I do want to second Jacob's comments. I agree that this is a good way to avoid the fragmentation while keeping Arrow relevant, and likely something we need to do so that we can ensure Arrow remains the way to do this data integration and interoperability. On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens wrote: Hello Everyone, Thanks for this comprehensive but concise write up Neal! I think this proposal is a good way to avoid both fragmentation of the arrow ecosystem as well as its obsolescence. In my opinion of these two problems the obsolescence is the bigger issue as (as mentioned in the proposal) arrow is already (close to) being relegated to the sidelines in eco-system defining projects. Jacob On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson < neal.p.richard...@gmail.com> wrote: Hi all, As was previously raised in [1] and surfaced again in [2], there is a proposal for representing alternative layouts. The intent, as I understand it, is to be able to support memory layouts that some (but perhaps not all) applications of Arrow find valuable, so that these nearly Arrow systems can be fully Arrow-native. I wanted to start a more focused discussion on it because I think it's worth being considered on its own merits, but I also think this gets to the core of what the Arrow project is and should be, and I don't want us to lose sight of that. To restate the proposal from [1]: * There are one or more primary layouts * Existing layouts are automatically considered primary layouts, even if they wouldn't have been primary layouts initially (e.g. large list) * A new layout, if it is semantically equivalent to another, is considered an alternative layout * An alternative layout still has the same requirements for adoption (two implementations and a vote) * An implementation should not feel pressured to rush and implement the new layout. It would be good if they contribute in the discussion and consider the layout and vote if they feel it would be an acceptable design. * We can define and vote and approve as many canonical alternative layouts as we want: * A canonical alternative layout should, at a minimum, have some reasonable justification, such as improved performance for algorithm X * Arrow implementations MUST support the primary layouts * An Arrow implementation MAY support a canonical alternative, however: * An Arrow implementation MUST first support the primary layout * An Arrow implementation MUST support conversion to/from the primary and canonical layout * An Arrow implementation's APIs MUST only provide data in the alternative layout if it is explicitly asked for (e.g. schema inference should prefer the primary layout). * We can still vote for new primary layouts (e.g. promoting a canonical alternative) but, in these votes we don't only consider the value (e.g. performance) of the layout but also the interoperability. In other words, a layout can only become a primary layout if there is significant evidence that most implementations plan to adopt it. To summarize some of the arguments against the proposal from the previous threads, there are concerns about increasing the complexity of the Arrow specification and the cost/burden of updating all of the Arrow specifications to support them. Where these discussions, both about several proposed new types and this layout proposal, get to the core of Arrow is well expressed in the comments on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what matters to people more, interoperability or best-in-class performance?" And Pedro notes that because of the overhead of converting these not-yet-Arrow types to the Arrow C ABI is high enough that they've considered abandoning Arrow as their interchange format. So: on the one hand, we're kinda choosing which quality we're optimizing for, but on the other, interoperability and performance are dependent on each other. What I see that we're trying to do here is find a way to expand the Arrow specification just enough so that Arrow becomes or remains the in-memory standard everywhere, but not so much that it creates too much complexity or burden to implement. Expand too much and you get a fragmented ecosystem where everyone is writing subsets of the Arrow standard and so nothing is fully compatible and the whole premise is undermined. But expand too little and projects will abandon the standard and we've also failed. I don't have a tidy answer, but I wanted to acknowledge the bigger issues, and see if this helps us reason about the various proposals on the table. I wonder if the alternative layout proposal is the happy medium that adds some complexity to
Re: [DISCUSS] Canonical alternative layout proposal
I don't have much to add but I do want to second Jacob's comments. I agree that this is a good way to avoid the fragmentation while keeping Arrow relevant, and likely something we need to do so that we can ensure Arrow remains the way to do this data integration and interoperability. On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens wrote: > Hello Everyone, > > Thanks for this comprehensive but concise write up Neal! I think this > proposal is a good way to avoid both fragmentation of the arrow ecosystem > as well as its obsolescence. In my opinion of these two problems the > obsolescence is the bigger issue as (as mentioned in the proposal) arrow is > already (close to) being relegated to the sidelines in eco-system defining > projects. > > Jacob > > On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson < > neal.p.richard...@gmail.com> wrote: > > > Hi all, > > As was previously raised in [1] and surfaced again in [2], there is a > > proposal for representing alternative layouts. The intent, as I > understand > > it, is to be able to support memory layouts that some (but perhaps not > all) > > applications of Arrow find valuable, so that these nearly Arrow systems > can > > be fully Arrow-native. > > > > I wanted to start a more focused discussion on it because I think it's > > worth being considered on its own merits, but I also think this gets to > the > > core of what the Arrow project is and should be, and I don't want us to > > lose sight of that. > > > > To restate the proposal from [1]: > > > > * There are one or more primary layouts > >* Existing layouts are automatically considered primary layouts, > > even if they > > wouldn't have been primary layouts initially (e.g. large list) > > * A new layout, if it is semantically equivalent to another, is > > considered an > > alternative layout > > * An alternative layout still has the same requirements for adoption > > (two implementations > > and a vote) > >* An implementation should not feel pressured to rush and implement > the > > new > > layout. It would be good if they contribute in the discussion and > consider > > the layout and vote if they feel it would be an acceptable design. > > * We can define and vote and approve as many canonical alternative > > layouts as > > we want: > >* A canonical alternative layout should, at a minimum, have some > > reasonable > > justification, such as improved performance for algorithm X > > * Arrow implementations MUST support the primary layouts > > * An Arrow implementation MAY support a canonical alternative, however: > >* An Arrow implementation MUST first support the primary layout > >* An Arrow implementation MUST support conversion to/from the primary > > and > > canonical layout > >* An Arrow implementation's APIs MUST only provide data in the > > alternative layout if it is explicitly asked for (e.g. schema inference > > should prefer the primary layout). > > * We can still vote for new primary layouts (e.g. promoting a > > canonical alternative) > > but, in these votes we don't only consider the value (e.g. performance) > of > > the layout but also the interoperability. In other words, a layout can > only > > become a primary layout if there is significant evidence that most > > implementations > > plan to adopt it. > > > > > > To summarize some of the arguments against the proposal from the previous > > threads, there are concerns about increasing the complexity of the Arrow > > specification and the cost/burden of updating all of the Arrow > > specifications to support them. > > > > Where these discussions, both about several proposed new types and this > > layout proposal, get to the core of Arrow is well expressed in the > comments > > on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what > > matters to people more, interoperability or best-in-class performance?" > And > > Pedro notes that because of the overhead of converting these > not-yet-Arrow > > types to the Arrow C ABI is high enough that they've considered > abandoning > > Arrow as their interchange format. So: on the one hand, we're kinda > > choosing which quality we're optimizing for, but on the other, > > interoperability and performance are dependent on each other. > > > > What I see that we're trying to do here is find a way to expand the Arrow > > specification just enough so that Arrow becomes or remains the in-memory > > standard everywhere, but not so much that it creates too much complexity > or > > burden to implement. Expand too much and you get a fragmented ecosystem > > where everyone is writing subsets of the Arrow standard and so nothing is > > fully compatible and the whole premise is undermined. But expand too > little > > and projects will abandon the standard and we've also failed. > > > > I don't have a tidy answer, but I wanted to acknowledge the bigger > issues, > > and see if this helps us reason about the various proposals on the > table. I > > wonder if the altern
Re: [DISCUSS] Canonical alternative layout proposal
Hello Everyone, Thanks for this comprehensive but concise write up Neal! I think this proposal is a good way to avoid both fragmentation of the arrow ecosystem as well as its obsolescence. In my opinion of these two problems the obsolescence is the bigger issue as (as mentioned in the proposal) arrow is already (close to) being relegated to the sidelines in eco-system defining projects. Jacob On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson < neal.p.richard...@gmail.com> wrote: > Hi all, > As was previously raised in [1] and surfaced again in [2], there is a > proposal for representing alternative layouts. The intent, as I understand > it, is to be able to support memory layouts that some (but perhaps not all) > applications of Arrow find valuable, so that these nearly Arrow systems can > be fully Arrow-native. > > I wanted to start a more focused discussion on it because I think it's > worth being considered on its own merits, but I also think this gets to the > core of what the Arrow project is and should be, and I don't want us to > lose sight of that. > > To restate the proposal from [1]: > > * There are one or more primary layouts >* Existing layouts are automatically considered primary layouts, > even if they > wouldn't have been primary layouts initially (e.g. large list) > * A new layout, if it is semantically equivalent to another, is > considered an > alternative layout > * An alternative layout still has the same requirements for adoption > (two implementations > and a vote) >* An implementation should not feel pressured to rush and implement the > new > layout. It would be good if they contribute in the discussion and consider > the layout and vote if they feel it would be an acceptable design. > * We can define and vote and approve as many canonical alternative > layouts as > we want: >* A canonical alternative layout should, at a minimum, have some > reasonable > justification, such as improved performance for algorithm X > * Arrow implementations MUST support the primary layouts > * An Arrow implementation MAY support a canonical alternative, however: >* An Arrow implementation MUST first support the primary layout >* An Arrow implementation MUST support conversion to/from the primary > and > canonical layout >* An Arrow implementation's APIs MUST only provide data in the > alternative layout if it is explicitly asked for (e.g. schema inference > should prefer the primary layout). > * We can still vote for new primary layouts (e.g. promoting a > canonical alternative) > but, in these votes we don't only consider the value (e.g. performance) of > the layout but also the interoperability. In other words, a layout can only > become a primary layout if there is significant evidence that most > implementations > plan to adopt it. > > > To summarize some of the arguments against the proposal from the previous > threads, there are concerns about increasing the complexity of the Arrow > specification and the cost/burden of updating all of the Arrow > specifications to support them. > > Where these discussions, both about several proposed new types and this > layout proposal, get to the core of Arrow is well expressed in the comments > on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what > matters to people more, interoperability or best-in-class performance?" And > Pedro notes that because of the overhead of converting these not-yet-Arrow > types to the Arrow C ABI is high enough that they've considered abandoning > Arrow as their interchange format. So: on the one hand, we're kinda > choosing which quality we're optimizing for, but on the other, > interoperability and performance are dependent on each other. > > What I see that we're trying to do here is find a way to expand the Arrow > specification just enough so that Arrow becomes or remains the in-memory > standard everywhere, but not so much that it creates too much complexity or > burden to implement. Expand too much and you get a fragmented ecosystem > where everyone is writing subsets of the Arrow standard and so nothing is > fully compatible and the whole premise is undermined. But expand too little > and projects will abandon the standard and we've also failed. > > I don't have a tidy answer, but I wanted to acknowledge the bigger issues, > and see if this helps us reason about the various proposals on the table. I > wonder if the alternative layout proposal is the happy medium that adds > some complexity to the specification, but less than there would be if three > new types were added, and still meets the needs of projects like DuckDB, > Velox, and Gluten and gets them fully Arrow native. > > Neal > > > [1]: https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk > [2]: https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6 > [3]: https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832 > [4]: https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38