Le 12/01/2022 à 01:49, Wes McKinney a écrit :
hi all,

Thank you for all the comments on this mailing list thread and in the
Google document. There is definitely a lot of work to take some next
steps from here, so I think it would make sense to fork off each of
the proposed additions into dedicated discussions. The most
contentious issue, it seems, is whether to maintain a 1-to-1
relationship between the IPC format and the C ABI, which would make it
rather difficult to implement the "string view" data type in a way
that is flexible and useful to applications (for example, giving them
control over their own memory management as opposed to forcing data to
be "pre-serialized" into buffers that are referenced by offsets).

I tend to be of the "practicality beats purity" mindset, where
sufficiently beneficial changes to the in-memory format (and C ABI)
may be worth breaking the implicit contract where the IPC format and
the in-memory data structures have a strict 1-to-1 relationship. I
suggest to help reach some consensus around this that I will create a
new document focused only on the "string/binary view" type and the
different implementation considerations (like what happens when you
write it to the IPC format), as well as the different variants of the
data structure itself that have been discussed with the associated
trade-offs. Does this sound like a good approach?

Indeed, this sounds like it will help making a decision.

Personally, I am still very concerned by the idea of adding pointers to the in-memory representation. Besides the loss of equivalence with the IPC format, a representation using embedded pointers cannot be fully validated for safety or correctness (how do you decide whether a pointer is correct and doesn't reveal unrelated data?).

I think we should discuss this with the DuckDB folks (and possibly the Velox folks, but I guess that it might socio-politically more difficult) so as to measure how much of an inconvenience it would be for them to switch to a purely offsets-based approach.

Regards

Antoine.




Thanks,
Wes


On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão
<jorgecarlei...@gmail.com> wrote:

Fair enough (wrt to deprecation). Think that the sequence view is a
replacement for our existing (that allows O(N) selections), but I agree
with the sentiment that preserving compatibility is more important than a
single way of doing it. Thanks for that angle!

Imo the Arrow format is already composed of 3 specifications:

* C data interface (intra-process communication)
* IPC format (inter-process communication)
* Flight (RPC protocol)

E.g.
* IPC requires a `dict_id` in the fields declaration, but the C data
interface has no such requirement (because, pointers)
* IPC accepts endian and compression, the C data interface does not
* DataFusion does not support IPC (yet ^_^), but its Python bindings
leverage the C data interface to pass data to pyarrow

This to say that imo as long as we document the different specifications
that compose Arrow and their intended purposes, it is ok. Because the c
data interface is the one with the highest constraints (zero-copy, higher
chance of out of bound reads, etc.), it makes sense for proposals (and
implementations) first be written against it.


I agree with Neal's point wrt to the IPC. For extra context, many `async`
implementations use cooperative scheduling, which are vulnerable to DOS if
they need to perform heavy CPU-bound tasks (as the p-thread is blocked and
can't switch). QP Hou and I have summarized a broader version of this
statement here [1].

In async contexts, If deserializing from IPC requires a significant amount
of compute, that task should (to avoid blocking) be sent to a separate
thread pool to avoid blocking the p-threads assigned to the runtime's
thread pool. If the format is O(1) in CPU-bounded work, its execution can
be done in an async context without a separate thread pool. Arrow's IPC
format is quite unique there in that it requires almost always O(1) CPU
work to be loaded to memory (at the expense of more disk usage).

I believe that atm we have two O(N) blocking tasks in reading IPC format
(decompression and byte swapping (big <-> little endian)), and three O(N)
blocking tasks in writing (compression, de-offset bitmaps, byte swapping).
The more prevalent O(N) CPU-bound tasks are at the IPC interface, the less
compelling it becomes vs e.g. parquet (file) or avro (stream), which have
an expectation of CPU-bound work. In this context, keeping the IPC format
compatible with the ABI spec is imo an important characteristic of Apache
Arrow that we should strive to preserve. Alternatively, we could also just
abandon this idea and say that the format expects CPU-bound tasks to
deserialize (even if considerably smaller than avro or parquet), so that
people can design the APIs accordingly.

Best,
Jorge

[1]
https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c


On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <anto...@python.org> wrote:



Le 23/12/2021 à 17:59, Neal Richardson a écrit :
I think in this particular case, we should consider the C ABI /
in-memory representation and IPC format as separate beasts. If an
implementation of Arrow does not want to use this string-view array
type at all (for example, if it created memory safety issues in Rust),
then it can choose to convert to the existing string array
representation when receiving a C ABI payload. Whether or not there is
an alternate IPC format for this data type seems like a separate
question -- my preference actually would be to support this for
in-memory / C ABI use but not to alter the IPC format.


I think this idea deserves some clarification or at least more
exposition.
On first reading, it was not clear to me that we might add things to the
in-memory Arrow format but not IPC, that that was even an option. I'm
guessing I'm not the only one who missed that.

If these new types are only part of the Arrow in-memory format, then it's
not the case that reading/writing IPC files involves no serialization
overhead. I recognize that that's technically already the case since IPC
supports compression now, but it's not generally how we talk about the
relationship between the IPC and in-memory formats (see our own FAQ [1],
for example). If we go forward with these changes, it would be a good
opportunity for us to clarify in our docs/website that the "Arrow format"
is not a single thing.

I'm worried that making the "Arrow format" polysemic/context-dependent
would spread a lot of confusion among potential users of Arrow.

Regards

Antoine.

Reply via email to