I have prototyped the sequence views in Rust [1], and it seems a pretty
straightforward addition with a trivial representation in both IPC and FFI.

I did observe a performance difference between using signed (int64) and
unsigned (uint64) offsets/lengths:

take/sequence/20            time:   [20.491 ms 20.800 ms 21.125 ms]

take/sequence_signed/20     time:   [22.719 ms 23.142 ms 23.593 ms]

take/array/20               time:   [44.454 ms 45.056 ms 45.712 ms]

where 20 means 2^20 entries,
* array is our current array
* sequence is a sequence view of utf8 with uint64 indices, and
* sequence_signed is the same sequence view layout but with int64 indices

I.e. I observe a ~10% loss to support signed offsets/lengths. Details in
[2].

Best,
Jorge

[1] https://github.com/jorgecarleitao/arrow2/pull/784
[2] https://github.com/DataEngineeringLabs/arrow-string-view

On Wed, Jan 12, 2022 at 2:34 PM Andrew Lamb <al...@influxdata.com> wrote:

> I also agree that splitting the StringView proposal into its own thing
> would be beneficial for discussion clarity
>
> On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Le 12/01/2022 à 01:49, Wes McKinney a écrit :
> > > hi all,
> > >
> > > Thank you for all the comments on this mailing list thread and in the
> > > Google document. There is definitely a lot of work to take some next
> > > steps from here, so I think it would make sense to fork off each of
> > > the proposed additions into dedicated discussions. The most
> > > contentious issue, it seems, is whether to maintain a 1-to-1
> > > relationship between the IPC format and the C ABI, which would make it
> > > rather difficult to implement the "string view" data type in a way
> > > that is flexible and useful to applications (for example, giving them
> > > control over their own memory management as opposed to forcing data to
> > > be "pre-serialized" into buffers that are referenced by offsets).
> > >
> > > I tend to be of the "practicality beats purity" mindset, where
> > > sufficiently beneficial changes to the in-memory format (and C ABI)
> > > may be worth breaking the implicit contract where the IPC format and
> > > the in-memory data structures have a strict 1-to-1 relationship. I
> > > suggest to help reach some consensus around this that I will create a
> > > new document focused only on the "string/binary view" type and the
> > > different implementation considerations (like what happens when you
> > > write it to the IPC format), as well as the different variants of the
> > > data structure itself that have been discussed with the associated
> > > trade-offs. Does this sound like a good approach?
> >
> > Indeed, this sounds like it will help making a decision.
> >
> > Personally, I am still very concerned by the idea of adding pointers to
> > the in-memory representation. Besides the loss of equivalence with the
> > IPC format, a representation using embedded pointers cannot be fully
> > validated for safety or correctness (how do you decide whether a pointer
> > is correct and doesn't reveal unrelated data?).
> >
> > I think we should discuss this with the DuckDB folks (and possibly the
> > Velox folks, but I guess that it might socio-politically more difficult)
> > so as to measure how much of an inconvenience it would be for them to
> > switch to a purely offsets-based approach.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > >
> > > Thanks,
> > > Wes
> > >
> > >
> > > On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão
> > > <jorgecarlei...@gmail.com> wrote:
> > >>
> > >> Fair enough (wrt to deprecation). Think that the sequence view is a
> > >> replacement for our existing (that allows O(N) selections), but I
> agree
> > >> with the sentiment that preserving compatibility is more important
> than
> > a
> > >> single way of doing it. Thanks for that angle!
> > >>
> > >> Imo the Arrow format is already composed of 3 specifications:
> > >>
> > >> * C data interface (intra-process communication)
> > >> * IPC format (inter-process communication)
> > >> * Flight (RPC protocol)
> > >>
> > >> E.g.
> > >> * IPC requires a `dict_id` in the fields declaration, but the C data
> > >> interface has no such requirement (because, pointers)
> > >> * IPC accepts endian and compression, the C data interface does not
> > >> * DataFusion does not support IPC (yet ^_^), but its Python bindings
> > >> leverage the C data interface to pass data to pyarrow
> > >>
> > >> This to say that imo as long as we document the different
> specifications
> > >> that compose Arrow and their intended purposes, it is ok. Because the
> c
> > >> data interface is the one with the highest constraints (zero-copy,
> > higher
> > >> chance of out of bound reads, etc.), it makes sense for proposals (and
> > >> implementations) first be written against it.
> > >>
> > >>
> > >> I agree with Neal's point wrt to the IPC. For extra context, many
> > `async`
> > >> implementations use cooperative scheduling, which are vulnerable to
> DOS
> > if
> > >> they need to perform heavy CPU-bound tasks (as the p-thread is blocked
> > and
> > >> can't switch). QP Hou and I have summarized a broader version of this
> > >> statement here [1].
> > >>
> > >> In async contexts, If deserializing from IPC requires a significant
> > amount
> > >> of compute, that task should (to avoid blocking) be sent to a separate
> > >> thread pool to avoid blocking the p-threads assigned to the runtime's
> > >> thread pool. If the format is O(1) in CPU-bounded work, its execution
> > can
> > >> be done in an async context without a separate thread pool. Arrow's
> IPC
> > >> format is quite unique there in that it requires almost always O(1)
> CPU
> > >> work to be loaded to memory (at the expense of more disk usage).
> > >>
> > >> I believe that atm we have two O(N) blocking tasks in reading IPC
> format
> > >> (decompression and byte swapping (big <-> little endian)), and three
> > O(N)
> > >> blocking tasks in writing (compression, de-offset bitmaps, byte
> > swapping).
> > >> The more prevalent O(N) CPU-bound tasks are at the IPC interface, the
> > less
> > >> compelling it becomes vs e.g. parquet (file) or avro (stream), which
> > have
> > >> an expectation of CPU-bound work. In this context, keeping the IPC
> > format
> > >> compatible with the ABI spec is imo an important characteristic of
> > Apache
> > >> Arrow that we should strive to preserve. Alternatively, we could also
> > just
> > >> abandon this idea and say that the format expects CPU-bound tasks to
> > >> deserialize (even if considerably smaller than avro or parquet), so
> that
> > >> people can design the APIs accordingly.
> > >>
> > >> Best,
> > >> Jorge
> > >>
> > >> [1]
> > >>
> >
> https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c
> > >>
> > >>
> > >> On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >>
> > >>>
> > >>>
> > >>> Le 23/12/2021 à 17:59, Neal Richardson a écrit :
> > >>>>> I think in this particular case, we should consider the C ABI /
> > >>>>> in-memory representation and IPC format as separate beasts. If an
> > >>>>> implementation of Arrow does not want to use this string-view array
> > >>>>> type at all (for example, if it created memory safety issues in
> > Rust),
> > >>>>> then it can choose to convert to the existing string array
> > >>>>> representation when receiving a C ABI payload. Whether or not there
> > is
> > >>>>> an alternate IPC format for this data type seems like a separate
> > >>>>> question -- my preference actually would be to support this for
> > >>>>> in-memory / C ABI use but not to alter the IPC format.
> > >>>>>
> > >>>>
> > >>>> I think this idea deserves some clarification or at least more
> > >>> exposition.
> > >>>> On first reading, it was not clear to me that we might add things to
> > the
> > >>>> in-memory Arrow format but not IPC, that that was even an option.
> I'm
> > >>>> guessing I'm not the only one who missed that.
> > >>>>
> > >>>> If these new types are only part of the Arrow in-memory format, then
> > it's
> > >>>> not the case that reading/writing IPC files involves no
> serialization
> > >>>> overhead. I recognize that that's technically already the case since
> > IPC
> > >>>> supports compression now, but it's not generally how we talk about
> the
> > >>>> relationship between the IPC and in-memory formats (see our own FAQ
> > [1],
> > >>>> for example). If we go forward with these changes, it would be a
> good
> > >>>> opportunity for us to clarify in our docs/website that the "Arrow
> > format"
> > >>>> is not a single thing.
> > >>>
> > >>> I'm worried that making the "Arrow format"
> polysemic/context-dependent
> > >>> would spread a lot of confusion among potential users of Arrow.
> > >>>
> > >>> Regards
> > >>>
> > >>> Antoine.
> > >>>
> >
>

Reply via email to