>
> 1. Is there any reason to expect these will need to be batched into one new
> version of the Arrow format? Or would we have no problem adding (as an
> theoretical example) RLE arrays in format version 2, and then later string
> views in version 3?
These shouldn't require a major version bump
Hi all,
I have a few questions to understand expectations for how work on these
could proceed:
1. Is there any reason to expect these will need to be batched into one new
version of the Arrow format? Or would we have no problem adding (as an
theoretical example) RLE arrays in format version 2,
I have prototyped the sequence views in Rust [1], and it seems a pretty
straightforward addition with a trivial representation in both IPC and FFI.
I did observe a performance difference between using signed (int64) and
unsigned (uint64) offsets/lengths:
take/sequence/20time:
I also agree that splitting the StringView proposal into its own thing
would be beneficial for discussion clarity
On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou wrote:
>
> Le 12/01/2022 à 01:49, Wes McKinney a écrit :
> > hi all,
> >
> > Thank you for all the comments on this mailing list
Le 12/01/2022 à 01:49, Wes McKinney a écrit :
hi all,
Thank you for all the comments on this mailing list thread and in the
Google document. There is definitely a lot of work to take some next
steps from here, so I think it would make sense to fork off each of
the proposed additions into
hi all,
Thank you for all the comments on this mailing list thread and in the
Google document. There is definitely a lot of work to take some next
steps from here, so I think it would make sense to fork off each of
the proposed additions into dedicated discussions. The most
contentious issue, it
Fair enough (wrt to deprecation). Think that the sequence view is a
replacement for our existing (that allows O(N) selections), but I agree
with the sentiment that preserving compatibility is more important than a
single way of doing it. Thanks for that angle!
Imo the Arrow format is already
Le 23/12/2021 à 17:59, Neal Richardson a écrit :
I think in this particular case, we should consider the C ABI /
in-memory representation and IPC format as separate beasts. If an
implementation of Arrow does not want to use this string-view array
type at all (for example, if it created memory
> If we go forward with these changes, it would be a good
opportunity for us to clarify in our docs/website that the "Arrow format"
is not a single thing.
The idea of using Arrow as a common memory format for interchange between
C/C++ implementations makes lots of sense to me.
What if we took a
> I think in this particular case, we should consider the C ABI /
> in-memory representation and IPC format as separate beasts. If an
> implementation of Arrow does not want to use this string-view array
> type at all (for example, if it created memory safety issues in Rust),
> then it can choose
hi Andrew,
On Thu, Dec 16, 2021 at 2:40 PM Andrew Lamb wrote:
>
> > DuckDB and Velox are two projects which have designed themselves to be
> > very nearly Arrow-compatible but have implemented alternative memory
> > layouts to achieve O(# records) selections on all data types. I am
> >
: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory,
IPC, C ABI)
> DuckDB and Velox are two projects which have designed themselves to be
> very nearly Arrow-compatible but have implemented alternative memory
> layouts to achieve O(# records) selections on all data types
> DuckDB and Velox are two projects which have designed themselves to be
> very nearly Arrow-compatible but have implemented alternative memory
> layouts to achieve O(# records) selections on all data types. I am
> proposing to adopt these innovations as additional memory layouts in
> Arrow with a
On Wed, Dec 15, 2021 at 6:22 PM Micah Kornfield wrote:
>>
>> In any case, having memory layouts that support O(# records)
>> selections on strings and nested data will greatly benefit some data
>> processing systems built on Arrow.
>
>
> Wes, something that still isn't clear to me, are we
>
> In any case, having memory layouts that support O(# records)
> selections on strings and nested data will greatly benefit some data
> processing systems built on Arrow.
Wes, something that still isn't clear to me, are we proposing these new
encoding for ONLY the C-ABI or do we want to plumb
On Wed, Dec 15, 2021 at 3:56 PM Micah Kornfield wrote:
>
> >
> > Big +1 in replacing our current representation of variable-sized arrays by
> > the "sequence view". atm I am -0.5 in adding it without removing the
> > [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
> >
>
> Big +1 in replacing our current representation of variable-sized arrays by
> the "sequence view". atm I am -0.5 in adding it without removing the
> [Large]Utf8Array / Binary / List, as I see the advantages as sufficiently
> large to break compatibility and deprecate the previous
> I am -0.5 in adding it without removing the
> [Large]Utf8Array / Binary / List
I'm not sure about dropping List.
Is SequenceView semantically equivalent to List / FixedSizeList? In
other words, is SequenceView a nested type? The document seems to
suggest it is but the use case you described
Hi,
Thanks a lot for this initiative and the write up.
I did a small bench for the sequence view and added a graph to the document
for evidence of what Wes is writing wrt to performance of "selection / take
/ filter".
Big +1 in replacing our current representation of variable-sized arrays by
Ultimately, the problem comes down to providing a means of O(#
records) selection (take, filter) performance and memory use for
non-numeric data (strings, arrays, maps, etc.).
DuckDB and Velox are two projects which have designed themselves to be
very nearly Arrow-compatible but have implemented
Would it be simpler to change the spec so that child arrays can be
chunked? This might reduce the data type growth and make the intent
more clear.
This will add another dimension to performance analysis. We pretty
regularly get issues/tickets from users that have unknowingly created
parquet
hi folks,
A few things in the general discussion, before certain things will
have to be split off into their own dedicated discussions.
It seems that I didn't do a very good job of motivating the "sequence
view" type. Let me take a step back and discuss one of the problems
these new memory
Hello,
I think my main concern is how we can prevent the community from
fragmenting too much over supported encodings. The more complex the
encodings, the less likely they are to be supported by all main
implementations. We see this in Parquet where the efficient "delta"
encodings have
Hi Wes,
I'm also in favor of most of this, I need to think more about the new list
layout, and I think the RLE encoding as proposed contains redundancies with
dictionary encoding data we might not want.
A further question on this, do you expect all of this to be packaged up as
a RecordBatch for
Thank you for writing this down Wes
I think my project is very interested in the RLE encoding and constant
view.
The StringView, as written, seems fairly tightly tied to C/C++, though I
may be mistaken. I think allowing Rust to consume such StringViews would be
possible but it seems very
I'm strongly in support of much of this. Thanks for bringing this up. It is
long overdue.
On initial read, my thoughts would be:
Stongly inclined:
- String view
- constant view
Weakly inclined
- All null
- rle
Somewhat disinclined
- Sequence change
With dictionary and string view, I feel
hello all,
This topic may provoke , but, given that Arrow is approaching its
6-year anniversary, I think this is an important discussion about how
we can thoughtfully expand the Arrow specifications to support
next-generation columnar data processing. In recent times, I have been
motivated by
27 matches
Mail list logo