Fair enough (wrt to deprecation). Think that the sequence view is a
replacement for our existing (that allows O(N) selections), but I agree
with the sentiment that preserving compatibility is more important than a
single way of doing it. Thanks for that angle!

Imo the Arrow format is already composed of 3 specifications:

* C data interface (intra-process communication)
* IPC format (inter-process communication)
* Flight (RPC protocol)

E.g.
* IPC requires a `dict_id` in the fields declaration, but the C data
interface has no such requirement (because, pointers)
* IPC accepts endian and compression, the C data interface does not
* DataFusion does not support IPC (yet ^_^), but its Python bindings
leverage the C data interface to pass data to pyarrow

This to say that imo as long as we document the different specifications
that compose Arrow and their intended purposes, it is ok. Because the c
data interface is the one with the highest constraints (zero-copy, higher
chance of out of bound reads, etc.), it makes sense for proposals (and
implementations) first be written against it.


I agree with Neal's point wrt to the IPC. For extra context, many `async`
implementations use cooperative scheduling, which are vulnerable to DOS if
they need to perform heavy CPU-bound tasks (as the p-thread is blocked and
can't switch). QP Hou and I have summarized a broader version of this
statement here [1].

In async contexts, If deserializing from IPC requires a significant amount
of compute, that task should (to avoid blocking) be sent to a separate
thread pool to avoid blocking the p-threads assigned to the runtime's
thread pool. If the format is O(1) in CPU-bounded work, its execution can
be done in an async context without a separate thread pool. Arrow's IPC
format is quite unique there in that it requires almost always O(1) CPU
work to be loaded to memory (at the expense of more disk usage).

I believe that atm we have two O(N) blocking tasks in reading IPC format
(decompression and byte swapping (big <-> little endian)), and three O(N)
blocking tasks in writing (compression, de-offset bitmaps, byte swapping).
The more prevalent O(N) CPU-bound tasks are at the IPC interface, the less
compelling it becomes vs e.g. parquet (file) or avro (stream), which have
an expectation of CPU-bound work. In this context, keeping the IPC format
compatible with the ABI spec is imo an important characteristic of Apache
Arrow that we should strive to preserve. Alternatively, we could also just
abandon this idea and say that the format expects CPU-bound tasks to
deserialize (even if considerably smaller than avro or parquet), so that
people can design the APIs accordingly.

Best,
Jorge

[1]
https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c


On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <anto...@python.org> wrote:

>
>
> Le 23/12/2021 à 17:59, Neal Richardson a écrit :
> >> I think in this particular case, we should consider the C ABI /
> >> in-memory representation and IPC format as separate beasts. If an
> >> implementation of Arrow does not want to use this string-view array
> >> type at all (for example, if it created memory safety issues in Rust),
> >> then it can choose to convert to the existing string array
> >> representation when receiving a C ABI payload. Whether or not there is
> >> an alternate IPC format for this data type seems like a separate
> >> question -- my preference actually would be to support this for
> >> in-memory / C ABI use but not to alter the IPC format.
> >>
> >
> > I think this idea deserves some clarification or at least more
> exposition.
> > On first reading, it was not clear to me that we might add things to the
> > in-memory Arrow format but not IPC, that that was even an option. I'm
> > guessing I'm not the only one who missed that.
> >
> > If these new types are only part of the Arrow in-memory format, then it's
> > not the case that reading/writing IPC files involves no serialization
> > overhead. I recognize that that's technically already the case since IPC
> > supports compression now, but it's not generally how we talk about the
> > relationship between the IPC and in-memory formats (see our own FAQ [1],
> > for example). If we go forward with these changes, it would be a good
> > opportunity for us to clarify in our docs/website that the "Arrow format"
> > is not a single thing.
>
> I'm worried that making the "Arrow format" polysemic/context-dependent
> would spread a lot of confusion among potential users of Arrow.
>
> Regards
>
> Antoine.
>

Reply via email to