Oh I'm with you on it being a precedent we want to be very careful about
setting, but if there isn't a meaningful performance difference, we may
be able to sidestep that discussion entirely.
On 02/10/2023 14:11, Antoine Pitrou wrote:
Even if performance were significant better, I don't think it's a good
enough reason to add these representations to Arrow. By construction,
a standard cannot continuously chase the performance state of art, it
has to weigh the benefits of performance improvements against the
increased cost for the ecosystem (for example the cost of adapting to
frequent standard changes and a growing standard size).
We have extension types which could reasonably be used for
non-standard data types, especially the kind that are motivated by
leading-edge performance research and innovation and come with unusual
constraints (such as requiring trusting and dereferencing raw pointers
embedded in data buffers). There could even be an argument for making
some of them canonical extension types if there's enough anteriority
in favor.
Regards
Antoine.
Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
I think what would really help would be some concrete numbers, do we
have any numbers comparing the performance of the offset and pointer
based representations? If there isn't a significant performance
difference between them, would the systems that currently use a
pointer-based approach be willing to meet us in the middle and switch to
an offset based encoding? This to me feels like it would be the best
outcome for the ecosystem as a whole.
Kind Regards,
Raphael
On 02/10/2023 13:50, Antoine Pitrou wrote:
Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be
implemented in
some Arrow libraries.
I've lost a little context but on all the concerns of adding raw
pointers
as an official option to the spec. But I see making raw-pointer
variants
the best path forward.
Things captured from this thread or seem obvious at least to me:
1. Divergence of IPC spec from in-memory/C-ABI spec?
2. More parts of the spec to cover.
3. In-compatibility with some languages
4. Validation (in my mind different use-cases require different
levels of
validation, so this is a little bit less of a concern in my mind).
I think the broader issue is how we think about compatibility with
other
systems. For instance, what happens if Velox and DuckDb start adding
new
divergent memory layouts? Are we expecting to add them to the spec?
This is a slippery slope. The more Arrow has a policy of integrating
existing practices simply because they exist, the more the Arrow
format will become _à la carte_, with different implementations
choosing to implement whatever they want to spend their engineering
effort on (you can see this occur, in part, on the Parquet format with
its many different encodings, compression algorithms and a 96-bit
timestamp type).
We _have_ to think carefully about the middle- and long-term future of
the format when adopting new features.
In this instance, we are doing a large part of the effort by adopting
a string view format with variadic buffers, inlined prefixes and
offset-based views into those buffers. But some implementations with
historically different internal representations will have to share
part of the effort to align with the newly standardized format.
I don't think "we have to adjust the Arrow format so that existing
internal representations become Arrow-compliant without any
(re-)implementation effort" is a reasonable design principle.
Regards
Antoine.