>
> 1. Is there any reason to expect these will need to be batched into one new
> version of the Arrow format? Or would we have no problem adding (as an
> theoretical example) RLE arrays in format version 2, and then later string
> views in version 3?
These shouldn't require a major version bump i
Hi all,
I have a few questions to understand expectations for how work on these
could proceed:
1. Is there any reason to expect these will need to be batched into one new
version of the Arrow format? Or would we have no problem adding (as an
theoretical example) RLE arrays in format version 2, an
I have prototyped the sequence views in Rust [1], and it seems a pretty
straightforward addition with a trivial representation in both IPC and FFI.
I did observe a performance difference between using signed (int64) and
unsigned (uint64) offsets/lengths:
take/sequence/20time: [20.49
I also agree that splitting the StringView proposal into its own thing
would be beneficial for discussion clarity
On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou wrote:
>
> Le 12/01/2022 à 01:49, Wes McKinney a écrit :
> > hi all,
> >
> > Thank you for all the comments on this mailing list thread
Le 12/01/2022 à 01:49, Wes McKinney a écrit :
hi all,
Thank you for all the comments on this mailing list thread and in the
Google document. There is definitely a lot of work to take some next
steps from here, so I think it would make sense to fork off each of
the proposed additions into dedic
hi all,
Thank you for all the comments on this mailing list thread and in the
Google document. There is definitely a lot of work to take some next
steps from here, so I think it would make sense to fork off each of
the proposed additions into dedicated discussions. The most
contentious issue, it s
Fair enough (wrt to deprecation). Think that the sequence view is a
replacement for our existing (that allows O(N) selections), but I agree
with the sentiment that preserving compatibility is more important than a
single way of doing it. Thanks for that angle!
Imo the Arrow format is already compo
Le 23/12/2021 à 17:59, Neal Richardson a écrit :
I think in this particular case, we should consider the C ABI /
in-memory representation and IPC format as separate beasts. If an
implementation of Arrow does not want to use this string-view array
type at all (for example, if it created memory
> If we go forward with these changes, it would be a good
opportunity for us to clarify in our docs/website that the "Arrow format"
is not a single thing.
The idea of using Arrow as a common memory format for interchange between
C/C++ implementations makes lots of sense to me.
What if we took a m
> I think in this particular case, we should consider the C ABI /
> in-memory representation and IPC format as separate beasts. If an
> implementation of Arrow does not want to use this string-view array
> type at all (for example, if it created memory safety issues in Rust),
> then it can choose t
hi Andrew,
On Thu, Dec 16, 2021 at 2:40 PM Andrew Lamb wrote:
>
> > DuckDB and Velox are two projects which have designed themselves to be
> > very nearly Arrow-compatible but have implemented alternative memory
> > layouts to achieve O(# records) selections on all data types. I am
> > proposing
ld
Subject: Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory,
IPC, C ABI)
> DuckDB and Velox are two projects which have designed themselves to be
> very nearly Arrow-compatible but have implemented alternative memory
> layouts to achieve O(# records) selections on al
> DuckDB and Velox are two projects which have designed themselves to be
> very nearly Arrow-compatible but have implemented alternative memory
> layouts to achieve O(# records) selections on all data types. I am
> proposing to adopt these innovations as additional memory layouts in
> Arrow with a
On Wed, Dec 15, 2021 at 6:22 PM Micah Kornfield wrote:
>>
>> In any case, having memory layouts that support O(# records)
>> selections on strings and nested data will greatly benefit some data
>> processing systems built on Arrow.
>
>
> Wes, something that still isn't clear to me, are we proposin
>
> In any case, having memory layouts that support O(# records)
> selections on strings and nested data will greatly benefit some data
> processing systems built on Arrow.
Wes, something that still isn't clear to me, are we proposing these new
encoding for ONLY the C-ABI or do we want to plumb t
On Wed, Dec 15, 2021 at 3:56 PM Micah Kornfield wrote:
>
> >
> > Big +1 in replacing our current representation of variable-sized arrays by
> > the "sequence view". atm I am -0.5 in adding it without removing the
> > [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
> > lar
>
> Big +1 in replacing our current representation of variable-sized arrays by
> the "sequence view". atm I am -0.5 in adding it without removing the
> [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
> large to break compatibility and deprecate the previous representations
> I am -0.5 in adding it without removing the
> [Large]Utf8Array / Binary / List
I'm not sure about dropping List.
Is SequenceView semantically equivalent to List / FixedSizeList? In
other words, is SequenceView a nested type? The document seems to
suggest it is but the use case you described d
Hi,
Thanks a lot for this initiative and the write up.
I did a small bench for the sequence view and added a graph to the document
for evidence of what Wes is writing wrt to performance of "selection / take
/ filter".
Big +1 in replacing our current representation of variable-sized arrays by
the
Ultimately, the problem comes down to providing a means of O(#
records) selection (take, filter) performance and memory use for
non-numeric data (strings, arrays, maps, etc.).
DuckDB and Velox are two projects which have designed themselves to be
very nearly Arrow-compatible but have implemented a
Would it be simpler to change the spec so that child arrays can be
chunked? This might reduce the data type growth and make the intent
more clear.
This will add another dimension to performance analysis. We pretty
regularly get issues/tickets from users that have unknowingly created
parquet file
hi folks,
A few things in the general discussion, before certain things will
have to be split off into their own dedicated discussions.
It seems that I didn't do a very good job of motivating the "sequence
view" type. Let me take a step back and discuss one of the problems
these new memory layout
Hello,
I think my main concern is how we can prevent the community from
fragmenting too much over supported encodings. The more complex the
encodings, the less likely they are to be supported by all main
implementations. We see this in Parquet where the efficient "delta"
encodings have ju
Hi Wes,
I'm also in favor of most of this, I need to think more about the new list
layout, and I think the RLE encoding as proposed contains redundancies with
dictionary encoding data we might not want.
A further question on this, do you expect all of this to be packaged up as
a RecordBatch for IP
Thank you for writing this down Wes
I think my project is very interested in the RLE encoding and constant
view.
The StringView, as written, seems fairly tightly tied to C/C++, though I
may be mistaken. I think allowing Rust to consume such StringViews would be
possible but it seems very unlikely
I'm strongly in support of much of this. Thanks for bringing this up. It is
long overdue.
On initial read, my thoughts would be:
Stongly inclined:
- String view
- constant view
Weakly inclined
- All null
- rle
Somewhat disinclined
- Sequence change
With dictionary and string view, I feel like
26 matches
Mail list logo