I have prototyped the sequence views in Rust [1], and it seems a pretty straightforward addition with a trivial representation in both IPC and FFI.
I did observe a performance difference between using signed (int64) and unsigned (uint64) offsets/lengths: take/sequence/20 time: [20.491 ms 20.800 ms 21.125 ms] take/sequence_signed/20 time: [22.719 ms 23.142 ms 23.593 ms] take/array/20 time: [44.454 ms 45.056 ms 45.712 ms] where 20 means 2^20 entries, * array is our current array * sequence is a sequence view of utf8 with uint64 indices, and * sequence_signed is the same sequence view layout but with int64 indices I.e. I observe a ~10% loss to support signed offsets/lengths. Details in [2]. Best, Jorge [1] https://github.com/jorgecarleitao/arrow2/pull/784 [2] https://github.com/DataEngineeringLabs/arrow-string-view On Wed, Jan 12, 2022 at 2:34 PM Andrew Lamb <al...@influxdata.com> wrote: > I also agree that splitting the StringView proposal into its own thing > would be beneficial for discussion clarity > > On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > Le 12/01/2022 à 01:49, Wes McKinney a écrit : > > > hi all, > > > > > > Thank you for all the comments on this mailing list thread and in the > > > Google document. There is definitely a lot of work to take some next > > > steps from here, so I think it would make sense to fork off each of > > > the proposed additions into dedicated discussions. The most > > > contentious issue, it seems, is whether to maintain a 1-to-1 > > > relationship between the IPC format and the C ABI, which would make it > > > rather difficult to implement the "string view" data type in a way > > > that is flexible and useful to applications (for example, giving them > > > control over their own memory management as opposed to forcing data to > > > be "pre-serialized" into buffers that are referenced by offsets). > > > > > > I tend to be of the "practicality beats purity" mindset, where > > > sufficiently beneficial changes to the in-memory format (and C ABI) > > > may be worth breaking the implicit contract where the IPC format and > > > the in-memory data structures have a strict 1-to-1 relationship. I > > > suggest to help reach some consensus around this that I will create a > > > new document focused only on the "string/binary view" type and the > > > different implementation considerations (like what happens when you > > > write it to the IPC format), as well as the different variants of the > > > data structure itself that have been discussed with the associated > > > trade-offs. Does this sound like a good approach? > > > > Indeed, this sounds like it will help making a decision. > > > > Personally, I am still very concerned by the idea of adding pointers to > > the in-memory representation. Besides the loss of equivalence with the > > IPC format, a representation using embedded pointers cannot be fully > > validated for safety or correctness (how do you decide whether a pointer > > is correct and doesn't reveal unrelated data?). > > > > I think we should discuss this with the DuckDB folks (and possibly the > > Velox folks, but I guess that it might socio-politically more difficult) > > so as to measure how much of an inconvenience it would be for them to > > switch to a purely offsets-based approach. > > > > Regards > > > > Antoine. > > > > > > > > > > > > Thanks, > > > Wes > > > > > > > > > On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão > > > <jorgecarlei...@gmail.com> wrote: > > >> > > >> Fair enough (wrt to deprecation). Think that the sequence view is a > > >> replacement for our existing (that allows O(N) selections), but I > agree > > >> with the sentiment that preserving compatibility is more important > than > > a > > >> single way of doing it. Thanks for that angle! > > >> > > >> Imo the Arrow format is already composed of 3 specifications: > > >> > > >> * C data interface (intra-process communication) > > >> * IPC format (inter-process communication) > > >> * Flight (RPC protocol) > > >> > > >> E.g. > > >> * IPC requires a `dict_id` in the fields declaration, but the C data > > >> interface has no such requirement (because, pointers) > > >> * IPC accepts endian and compression, the C data interface does not > > >> * DataFusion does not support IPC (yet ^_^), but its Python bindings > > >> leverage the C data interface to pass data to pyarrow > > >> > > >> This to say that imo as long as we document the different > specifications > > >> that compose Arrow and their intended purposes, it is ok. Because the > c > > >> data interface is the one with the highest constraints (zero-copy, > > higher > > >> chance of out of bound reads, etc.), it makes sense for proposals (and > > >> implementations) first be written against it. > > >> > > >> > > >> I agree with Neal's point wrt to the IPC. For extra context, many > > `async` > > >> implementations use cooperative scheduling, which are vulnerable to > DOS > > if > > >> they need to perform heavy CPU-bound tasks (as the p-thread is blocked > > and > > >> can't switch). QP Hou and I have summarized a broader version of this > > >> statement here [1]. > > >> > > >> In async contexts, If deserializing from IPC requires a significant > > amount > > >> of compute, that task should (to avoid blocking) be sent to a separate > > >> thread pool to avoid blocking the p-threads assigned to the runtime's > > >> thread pool. If the format is O(1) in CPU-bounded work, its execution > > can > > >> be done in an async context without a separate thread pool. Arrow's > IPC > > >> format is quite unique there in that it requires almost always O(1) > CPU > > >> work to be loaded to memory (at the expense of more disk usage). > > >> > > >> I believe that atm we have two O(N) blocking tasks in reading IPC > format > > >> (decompression and byte swapping (big <-> little endian)), and three > > O(N) > > >> blocking tasks in writing (compression, de-offset bitmaps, byte > > swapping). > > >> The more prevalent O(N) CPU-bound tasks are at the IPC interface, the > > less > > >> compelling it becomes vs e.g. parquet (file) or avro (stream), which > > have > > >> an expectation of CPU-bound work. In this context, keeping the IPC > > format > > >> compatible with the ABI spec is imo an important characteristic of > > Apache > > >> Arrow that we should strive to preserve. Alternatively, we could also > > just > > >> abandon this idea and say that the format expects CPU-bound tasks to > > >> deserialize (even if considerably smaller than avro or parquet), so > that > > >> people can design the APIs accordingly. > > >> > > >> Best, > > >> Jorge > > >> > > >> [1] > > >> > > > https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c > > >> > > >> > > >> On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <anto...@python.org> > > wrote: > > >> > > >>> > > >>> > > >>> Le 23/12/2021 à 17:59, Neal Richardson a écrit : > > >>>>> I think in this particular case, we should consider the C ABI / > > >>>>> in-memory representation and IPC format as separate beasts. If an > > >>>>> implementation of Arrow does not want to use this string-view array > > >>>>> type at all (for example, if it created memory safety issues in > > Rust), > > >>>>> then it can choose to convert to the existing string array > > >>>>> representation when receiving a C ABI payload. Whether or not there > > is > > >>>>> an alternate IPC format for this data type seems like a separate > > >>>>> question -- my preference actually would be to support this for > > >>>>> in-memory / C ABI use but not to alter the IPC format. > > >>>>> > > >>>> > > >>>> I think this idea deserves some clarification or at least more > > >>> exposition. > > >>>> On first reading, it was not clear to me that we might add things to > > the > > >>>> in-memory Arrow format but not IPC, that that was even an option. > I'm > > >>>> guessing I'm not the only one who missed that. > > >>>> > > >>>> If these new types are only part of the Arrow in-memory format, then > > it's > > >>>> not the case that reading/writing IPC files involves no > serialization > > >>>> overhead. I recognize that that's technically already the case since > > IPC > > >>>> supports compression now, but it's not generally how we talk about > the > > >>>> relationship between the IPC and in-memory formats (see our own FAQ > > [1], > > >>>> for example). If we go forward with these changes, it would be a > good > > >>>> opportunity for us to clarify in our docs/website that the "Arrow > > format" > > >>>> is not a single thing. > > >>> > > >>> I'm worried that making the "Arrow format" > polysemic/context-dependent > > >>> would spread a lot of confusion among potential users of Arrow. > > >>> > > >>> Regards > > >>> > > >>> Antoine. > > >>> > > >