Hi,

I am reporting back a conclusion that I recently arrived at when adding
support for reading Avro to Arrow.

Avro is a storage format that does not have an associated in-memory format.
In Rust, the official implementation deserializes an enum, in Python to a
vector of Object, and I suspect in Java to an equivalent vector of object.
The important aspect is that all of them use fragmented memory regions (as
opposed to what we do with e.g. one uint8 buffer for StringArray).

I benchmarked reading to arrow vs reading via the official Avro
implementations. The results are a bit surprising: reading 2^20 rows of 3
byte strings is ~6x faster than the official Avro Rust implementation and
~20x faster vs "fastavro", a C implementation with bindings for Python (pip
install fastavro), all with a difference slope (see graph below or numbers
and used code here [1]).
[image: avro_read.png]

I found this a bit surprising because we need to read row by row and
perform a transpose of the data (from rows to columns) which is usually
expensive. Furthermore, reading strings can't be that much optimized after
all.

To investigate the root cause, I drilled down to the flamegraphs for both
the official avro rust implementation and the arrow2 implementation: the
majority of the time in the Avro implementation is spent allocating
individual strings (to build the [str] - equivalents); the majority of the
time in arrow2 is equally divided between zigzag decoding (to get the
length of the item), reallocs, and utf8 validation.

My hypothesis is that the difference in performance is unrelated to a
particular implementation of arrow or avro, but to a general concept of
reading to [str] vs arrow. Specifically, the item by item allocation
strategy is far worse than what we do in Arrow with a single region which
we reallocate from time to time with exponential growth. In some
architectures we even benefit from the __memmove_avx_unaligned_erms
instruction that makes it even cheaper to reallocate.

Has anyone else performed such benchmarks or played with Avro -> Arrow and
found supporting / opposing findings to this hypothesis?

If this hypothesis holds (e.g. with a similar result against the Java
implementation of Avro), it imo puts arrow as a strong candidate for the
default format of Avro implementations to deserialize into when using it
in-memory, which could benefit both projects?

Best,
Jorge

[1] https://github.com/DataEngineeringLabs/arrow2-benches

Reply via email to