Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-04-14 Thread Micah Kornfield
> > 1. Is there any reason to expect these will need to be batched into one new > version of the Arrow format? Or would we have no problem adding (as an > theoretical example) RLE arrays in format version 2, and then later string > views in version 3? These shouldn't require a major version bump

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-04-14 Thread Will Jones
Hi all, I have a few questions to understand expectations for how work on these could proceed: 1. Is there any reason to expect these will need to be batched into one new version of the Arrow format? Or would we have no problem adding (as an theoretical example) RLE arrays in format version 2,

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-19 Thread Jorge Cardoso Leitão
I have prototyped the sequence views in Rust [1], and it seems a pretty straightforward addition with a trivial representation in both IPC and FFI. I did observe a performance difference between using signed (int64) and unsigned (uint64) offsets/lengths: take/sequence/20time:

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-12 Thread Andrew Lamb
I also agree that splitting the StringView proposal into its own thing would be beneficial for discussion clarity On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou wrote: > > Le 12/01/2022 à 01:49, Wes McKinney a écrit : > > hi all, > > > > Thank you for all the comments on this mailing list

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-12 Thread Antoine Pitrou
Le 12/01/2022 à 01:49, Wes McKinney a écrit : hi all, Thank you for all the comments on this mailing list thread and in the Google document. There is definitely a lot of work to take some next steps from here, so I think it would make sense to fork off each of the proposed additions into

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-11 Thread Wes McKinney
hi all, Thank you for all the comments on this mailing list thread and in the Google document. There is definitely a lot of work to take some next steps from here, so I think it would make sense to fork off each of the proposed additions into dedicated discussions. The most contentious issue, it

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2022-01-08 Thread Jorge Cardoso Leitão
Fair enough (wrt to deprecation). Think that the sequence view is a replacement for our existing (that allows O(N) selections), but I agree with the sentiment that preserving compatibility is more important than a single way of doing it. Thanks for that angle! Imo the Arrow format is already

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-26 Thread Antoine Pitrou
Le 23/12/2021 à 17:59, Neal Richardson a écrit : I think in this particular case, we should consider the C ABI / in-memory representation and IPC format as separate beasts. If an implementation of Arrow does not want to use this string-view array type at all (for example, if it created memory

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-23 Thread Andrew Lamb
> If we go forward with these changes, it would be a good opportunity for us to clarify in our docs/website that the "Arrow format" is not a single thing. The idea of using Arrow as a common memory format for interchange between C/C++ implementations makes lots of sense to me. What if we took a

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-23 Thread Neal Richardson
> I think in this particular case, we should consider the C ABI / > in-memory representation and IPC format as separate beasts. If an > implementation of Arrow does not want to use this string-view array > type at all (for example, if it created memory safety issues in Rust), > then it can choose

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-22 Thread Wes McKinney
hi Andrew, On Thu, Dec 16, 2021 at 2:40 PM Andrew Lamb wrote: > > > DuckDB and Velox are two projects which have designed themselves to be > > very nearly Arrow-compatible but have implemented alternative memory > > layouts to achieve O(# records) selections on all data types. I am > >

RE: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-16 Thread Yang, Binwei
: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI) > DuckDB and Velox are two projects which have designed themselves to be > very nearly Arrow-compatible but have implemented alternative memory > layouts to achieve O(# records) selections on all data types

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-16 Thread Andrew Lamb
> DuckDB and Velox are two projects which have designed themselves to be > very nearly Arrow-compatible but have implemented alternative memory > layouts to achieve O(# records) selections on all data types. I am > proposing to adopt these innovations as additional memory layouts in > Arrow with a

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-15 Thread Wes McKinney
On Wed, Dec 15, 2021 at 6:22 PM Micah Kornfield wrote: >> >> In any case, having memory layouts that support O(# records) >> selections on strings and nested data will greatly benefit some data >> processing systems built on Arrow. > > > Wes, something that still isn't clear to me, are we

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-15 Thread Micah Kornfield
> > In any case, having memory layouts that support O(# records) > selections on strings and nested data will greatly benefit some data > processing systems built on Arrow. Wes, something that still isn't clear to me, are we proposing these new encoding for ONLY the C-ABI or do we want to plumb

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-15 Thread Wes McKinney
On Wed, Dec 15, 2021 at 3:56 PM Micah Kornfield wrote: > > > > > Big +1 in replacing our current representation of variable-sized arrays by > > the "sequence view". atm I am -0.5 in adding it without removing the > > [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently > >

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-15 Thread Micah Kornfield
> > Big +1 in replacing our current representation of variable-sized arrays by > the "sequence view". atm I am -0.5 in adding it without removing the > [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently > large to break compatibility and deprecate the previous

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-15 Thread Weston Pace
> I am -0.5 in adding it without removing the > [Large]Utf8Array / Binary / List I'm not sure about dropping List. Is SequenceView semantically equivalent to List / FixedSizeList? In other words, is SequenceView a nested type? The document seems to suggest it is but the use case you described

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-15 Thread Jorge Cardoso Leitão
Hi, Thanks a lot for this initiative and the write up. I did a small bench for the sequence view and added a graph to the document for evidence of what Wes is writing wrt to performance of "selection / take / filter". Big +1 in replacing our current representation of variable-sized arrays by

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-14 Thread Wes McKinney
Ultimately, the problem comes down to providing a means of O(# records) selection (take, filter) performance and memory use for non-numeric data (strings, arrays, maps, etc.). DuckDB and Velox are two projects which have designed themselves to be very nearly Arrow-compatible but have implemented

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-14 Thread Weston Pace
Would it be simpler to change the spec so that child arrays can be chunked? This might reduce the data type growth and make the intent more clear. This will add another dimension to performance analysis. We pretty regularly get issues/tickets from users that have unknowingly created parquet

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-14 Thread Wes McKinney
hi folks, A few things in the general discussion, before certain things will have to be split off into their own dedicated discussions. It seems that I didn't do a very good job of motivating the "sequence view" type. Let me take a step back and discuss one of the problems these new memory

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-14 Thread Antoine Pitrou
Hello, I think my main concern is how we can prevent the community from fragmenting too much over supported encodings. The more complex the encodings, the less likely they are to be supported by all main implementations. We see this in Parquet where the efficient "delta" encodings have

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-13 Thread Micah Kornfield
Hi Wes, I'm also in favor of most of this, I need to think more about the new list layout, and I think the RLE encoding as proposed contains redundancies with dictionary encoding data we might not want. A further question on this, do you expect all of this to be packaged up as a RecordBatch for

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-13 Thread Andrew Lamb
Thank you for writing this down Wes I think my project is very interested in the RLE encoding and constant view. The StringView, as written, seems fairly tightly tied to C/C++, though I may be mistaken. I think allowing Rust to consume such StringViews would be possible but it seems very

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-10 Thread Jacques Nadeau
I'm strongly in support of much of this. Thanks for bringing this up. It is long overdue. On initial read, my thoughts would be: Stongly inclined: - String view - constant view Weakly inclined - All null - rle Somewhat disinclined - Sequence change With dictionary and string view, I feel

[DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

2021-12-10 Thread Wes McKinney
hello all, This topic may provoke , but, given that Arrow is approaching its 6-year anniversary, I think this is an important discussion about how we can thoughtfully expand the Arrow specifications to support next-generation columnar data processing. In recent times, I have been motivated by