Re: Synergies with Apache Avro?

Jorge Cardoso Leitão Tue, 02 Nov 2021 04:39:45 -0700

Thank you all for all your comments.

The first comments: thanks a lot for your suggestions. I tried with
mimalloc and there is indeed a -25% improvement for avro-rs. =)


This sentence is a little bit hard to parse.  Is a row of 3 strings or a
> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
> of the complexity of parsing avro is the schema evolution rules, I haven't
> looked at whether the canonical implementations do any optimization for the
> happy case when reader and writer schema are the same.
>

The graph was for a single column of a constant string of 3 bytes ("foo")
each divided into (avro) blocks of 4000 rows each (default block size of
16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an
integer column, and compressed blocks (deflate): with equal speedups.
Generic benchmarks are obviously catered for. I agree that schema evolution
adds extra CPU time, and that this is the happy case; I have not
benchmarked those yet.

With respect to being a single column, I agree. The second bench that you
saw is still a single column (of integers): I wanted to check whether the
cost was the allocation of the strings, or the elements of the rows (the
speedup is equivalent).

However, I pushed a new bench where we are reading 6 columns [string, bool,
int, string, string, string|null], speedup is 5x for mz-avro and 4x for
avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]).
[image: avro_read_mixed.png]

Wrt to row iterations and native rows: my understanding is that even though
most Avro APIs present themselves as iterators of rows, internally they
read a whole compressed serialized block into memory, decompress it, and
then deserialize item by item into a row ("read block -> decompress block
-> decode item by item into rows -> read next block"). Avro is based on
batches of rows (blocks) that are compressed individually (similar to
parquet pages, but all column chunks are serialized in a single page within
a row group).

In this context, my thinking of Arrow vs Vec<Native> is that once loaded in
memory, a block behaves like a serialized blob that we can deserialize to
any in-memory format according to some rules.

My hypothesis (we can bench this) is that if the user wants to perform any
compute over the data, it is advantageous to load the block to arrow
(decompressed block -> RecordBatch), benefiting from arrow's analytics
performance instead, as opposed to using a native row-based format where we
can't leverage SIMD/cache hits/must allocate and deallocate on every item.
As usual, there are use-cases where this does not hold - I am thinking in
terms of traditional ETL / CPU intensive stuff.

My surprise is that even without the compute in mind, deserializing blocks
to arrow is faster than I antecipated, and  wanted to check if someone went
through this exercise before trying more exotic benches.

Best,
Jorge

[1] https://github.com/dataEngineeringLabs/arrow2-benches


On Mon, Nov 1, 2021 at 3:37 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Jorge,
>
> > The results are a bit surprising: reading 2^20 rows of 3 byte strings is
> > ~6x faster than the official Avro Rust implementation and ~20x faster vs
> > "fastavro"
>
>
> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
> of the complexity of parsing avro is the schema evolution rules, I haven't
> looked at whether the canonical implementations do any optimization for the
> happy case when reader and writer schema are the same.
>
> There is a "Java Avro -> Arrow" implementation checked but it is somewhat
> broken today (I filed an issue on this a while ago) that delegates parsing
> the t/from the Avro java library.  I also think there might be faster
> implementations that aren't the canonical implementations (I seem to recall
> a JIT version for java for example and fastavro is another).  For both Java
> and Python I'd imagine there would be some decent speed improvements simply
> by avoiding the "boxing" task of moving language primitive types to native
> memory.
>
> I was planning (and still might get to it sometime in 2022) to have a C++
> parser for Avro.  Wes cross-posted this to the Avro mailing list when I
> thought I had time to work on it a couple of years ago and I don't recall
> any response to it.  The Rust avro library I believe was also just recently
> adopted/donated into the Apache Avro project.
>
> Avro seems to be pretty common so having the ability to convert to and from
> it is I think is generally valuable.
>
> Cheers,
> Micah
>
>
> On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres <danielhe...@gmail.com>
> wrote:
>
> > Rust allows to easily swap the global allocator to e.g. mimalloc or
> > snmalloc, even without the library supporting to change the allocator. In
> > my experience this indeed helps with allocation heavy code (I have seen
> > changes of up to 30%).
> >
> > Best regards,
> >
> > Daniël
> >
> >
> > On Sun, Oct 31, 2021, 18:15 Adam Lippai <a...@rigo.sk> wrote:
> >
> > > Hi Jorge,
> > >
> > > Just an idea: Do the Avro libs support different allocators? Maybe
> using
> > a
> > > different one (e.g. mimalloc) would yield more similar results by
> working
> > > around the fragmentation you described.
> > >
> > > This wouldn't change the fact that they are relatively slow, however it
> > > could allow you better apples to apples comparison thus better CPU
> > > profiling and understanding of the nuances.
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > >
> > > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am reporting back a conclusion that I recently arrived at when
> adding
> > > > support for reading Avro to Arrow.
> > > >
> > > > Avro is a storage format that does not have an associated in-memory
> > > > format. In Rust, the official implementation deserializes an enum, in
> > > > Python to a vector of Object, and I suspect in Java to an equivalent
> > > vector
> > > > of object. The important aspect is that all of them use fragmented
> > memory
> > > > regions (as opposed to what we do with e.g. one uint8 buffer for
> > > > StringArray).
> > > >
> > > > I benchmarked reading to arrow vs reading via the official Avro
> > > > implementations. The results are a bit surprising: reading 2^20 rows
> > of 3
> > > > byte strings is ~6x faster than the official Avro Rust implementation
> > and
> > > > ~20x faster vs "fastavro", a C implementation with bindings for
> Python
> > > (pip
> > > > install fastavro), all with a difference slope (see graph below or
> > > numbers
> > > > and used code here [1]).
> > > > [image: avro_read.png]
> > > >
> > > > I found this a bit surprising because we need to read row by row and
> > > > perform a transpose of the data (from rows to columns) which is
> usually
> > > > expensive. Furthermore, reading strings can't be that much optimized
> > > after
> > > > all.
> > > >
> > > > To investigate the root cause, I drilled down to the flamegraphs for
> > both
> > > > the official avro rust implementation and the arrow2 implementation:
> > the
> > > > majority of the time in the Avro implementation is spent allocating
> > > > individual strings (to build the [str] - equivalents); the majority
> of
> > > the
> > > > time in arrow2 is equally divided between zigzag decoding (to get the
> > > > length of the item), reallocs, and utf8 validation.
> > > >
> > > > My hypothesis is that the difference in performance is unrelated to a
> > > > particular implementation of arrow or avro, but to a general concept
> of
> > > > reading to [str] vs arrow. Specifically, the item by item allocation
> > > > strategy is far worse than what we do in Arrow with a single region
> > which
> > > > we reallocate from time to time with exponential growth. In some
> > > > architectures we even benefit from the __memmove_avx_unaligned_erms
> > > > instruction that makes it even cheaper to reallocate.
> > > >
> > > > Has anyone else performed such benchmarks or played with Avro ->
> Arrow
> > > and
> > > > found supporting / opposing findings to this hypothesis?
> > > >
> > > > If this hypothesis holds (e.g. with a similar result against the Java
> > > > implementation of Avro), it imo puts arrow as a strong candidate for
> > the
> > > > default format of Avro implementations to deserialize into when using
> > it
> > > > in-memory, which could benefit both projects?
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1] https://github.com/DataEngineeringLabs/arrow2-benches
> > > >
> > > >
> > > >
> > >
> >
>

Re: Synergies with Apache Avro?

Reply via email to