Re: Synergies with Apache Avro?

Jorge Cardoso Leitão Tue, 16 Nov 2021 09:11:48 -0800

>
> I haven't looked at it for a while but my recollection, at least in java,
> is streaming process for each step outlined rather than a batch process
> (i.e. decompress some bytes, then decode them lazily a "Next Row" is
> called).



Sorry for the late reply, It took me a bit to go through the relevant parts
of the Java implementation. I agree that the deserialization of items
within a block is done on a per item, and can even re-use a previously
allocated item [1]. From what I can read, the blocks are still read to
memory as whole chunks via `nextRawBlock` [2]. I.e. from a row oriented
processing, the stream is still composed of blocks that are first read into
memory and then deserialized row by row (and item by item within a row).

Do you have a target system in mind?  As I said for columnar/arrow native
> query engines this obviously sounds like a win, but for row oriented
> processing engines, the transposition costs are going to eat into any gains.
>

I agree - I was thinking in terms of columnar query engines aiming at
leveraging simd and data locality.

That being said, I'd love to see real world ETL pipeline benchmarks :)
>

Definitely. This was an educational exercise.

[1]
https://github.com/apache/avro/blob/42822886c28ea74a744abb7e7a80a942c540faa5/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader12.java#L160
[2] https://github.com/apache/avro/blob/
42822886c28ea74a744abb7e7a80a942c540faa5
/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L213


On Tue, Nov 2, 2021 at 6:41 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Wrt to row iterations and native rows: my understanding is that even
>> though most Avro APIs present themselves as iterators of rows, internally
>> they read a whole compressed serialized block into memory, decompress it,
>> and then deserialize item by item into a row ("read block -> decompress
>> block -> decode item by item into rows -> read next block"). Avro is based
>> on batches of rows (blocks) that are compressed individually (similar to
>> parquet pages, but all column chunks are serialized in a single page within
>> a row group).
>
>
> I haven't looked at it for a while but my recollection, at least in java,
> is streaming process for each step outlined rather than a batch process
> (i.e. decompress some bytes, then decode them lazily a "Next Row" is
> called).
>
> My hypothesis (we can bench this) is that if the user wants to perform any
>> compute over the data, it is advantageous to load the block to arrow
>> (decompressed block -> RecordBatch), benefiting from arrow's analytics
>> performance instead, as opposed to using a native row-based format where we
>> can't leverage SIMD/cache hits/must allocate and deallocate on every item.
>> As usual, there are use-cases where this does not hold - I am thinking in
>> terms of traditional ETL / CPU intensive stuff.
>
>
> Do you have a target system in mind?  As I said for columnar/arrow native
> query engines this obviously sounds like a win, but for row oriented
> processing engines, the transposition costs are going to eat into any
> gains. There is also non-zero engineering effort to implement the necessary
> filter/selection push down APIs that most of them provide.  That being
> said, I'd love to see real world ETL pipeline benchmarks :)
>
>
> On Tue, Nov 2, 2021 at 4:39 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
>> Thank you all for all your comments.
>>
>> The first comments: thanks a lot for your suggestions. I tried with
>> mimalloc and there is indeed a -25% improvement for avro-rs. =)
>>
>> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
>>> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A
>>> lot
>>> of the complexity of parsing avro is the schema evolution rules, I
>>> haven't
>>> looked at whether the canonical implementations do any optimization for
>>> the
>>> happy case when reader and writer schema are the same.
>>>
>>
>> The graph was for a single column of a constant string of 3 bytes ("foo")
>> each divided into (avro) blocks of 4000 rows each (default block size of
>> 16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an
>> integer column, and compressed blocks (deflate): with equal speedups.
>> Generic benchmarks are obviously catered for. I agree that schema evolution
>> adds extra CPU time, and that this is the happy case; I have not
>> benchmarked those yet.
>>
>> With respect to being a single column, I agree. The second bench that you
>> saw is still a single column (of integers): I wanted to check whether the
>> cost was the allocation of the strings, or the elements of the rows (the
>> speedup is equivalent).
>>
>> However, I pushed a new bench where we are reading 6 columns [string,
>> bool, int, string, string, string|null], speedup is 5x for mz-avro and 4x
>> for avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]).
>> [image: avro_read_mixed.png]
>>
>> Wrt to row iterations and native rows: my understanding is that even
>> though most Avro APIs present themselves as iterators of rows, internally
>> they read a whole compressed serialized block into memory, decompress it,
>> and then deserialize item by item into a row ("read block -> decompress
>> block -> decode item by item into rows -> read next block"). Avro is based
>> on batches of rows (blocks) that are compressed individually (similar to
>> parquet pages, but all column chunks are serialized in a single page within
>> a row group).
>>
>> In this context, my thinking of Arrow vs Vec<Native> is that once loaded
>> in memory, a block behaves like a serialized blob that we can deserialize
>> to any in-memory format according to some rules.
>>
>> My hypothesis (we can bench this) is that if the user wants to perform
>> any compute over the data, it is advantageous to load the block to arrow
>> (decompressed block -> RecordBatch), benefiting from arrow's analytics
>> performance instead, as opposed to using a native row-based format where we
>> can't leverage SIMD/cache hits/must allocate and deallocate on every item.
>> As usual, there are use-cases where this does not hold - I am thinking in
>> terms of traditional ETL / CPU intensive stuff.
>>
>> My surprise is that even without the compute in mind, deserializing
>> blocks to arrow is faster than I antecipated, and  wanted to check if
>> someone went through this exercise before trying more exotic benches.
>>
>> Best,
>> Jorge
>>
>> [1] https://github.com/dataEngineeringLabs/arrow2-benches
>>
>>
>> On Mon, Nov 1, 2021 at 3:37 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Hi Jorge,
>>>
>>> > The results are a bit surprising: reading 2^20 rows of 3 byte strings
>>> is
>>> > ~6x faster than the official Avro Rust implementation and ~20x faster
>>> vs
>>> > "fastavro"
>>>
>>>
>>> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
>>> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A
>>> lot
>>> of the complexity of parsing avro is the schema evolution rules, I
>>> haven't
>>> looked at whether the canonical implementations do any optimization for
>>> the
>>> happy case when reader and writer schema are the same.
>>>
>>> There is a "Java Avro -> Arrow" implementation checked but it is somewhat
>>> broken today (I filed an issue on this a while ago) that delegates
>>> parsing
>>> the t/from the Avro java library.  I also think there might be faster
>>> implementations that aren't the canonical implementations (I seem to
>>> recall
>>> a JIT version for java for example and fastavro is another).  For both
>>> Java
>>> and Python I'd imagine there would be some decent speed improvements
>>> simply
>>> by avoiding the "boxing" task of moving language primitive types to
>>> native
>>> memory.
>>>
>>> I was planning (and still might get to it sometime in 2022) to have a C++
>>> parser for Avro.  Wes cross-posted this to the Avro mailing list when I
>>> thought I had time to work on it a couple of years ago and I don't recall
>>> any response to it.  The Rust avro library I believe was also just
>>> recently
>>> adopted/donated into the Apache Avro project.
>>>
>>> Avro seems to be pretty common so having the ability to convert to and
>>> from
>>> it is I think is generally valuable.
>>>
>>> Cheers,
>>> Micah
>>>
>>>
>>> On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres <danielhe...@gmail.com>
>>> wrote:
>>>
>>> > Rust allows to easily swap the global allocator to e.g. mimalloc or
>>> > snmalloc, even without the library supporting to change the allocator.
>>> In
>>> > my experience this indeed helps with allocation heavy code (I have seen
>>> > changes of up to 30%).
>>> >
>>> > Best regards,
>>> >
>>> > Daniël
>>> >
>>> >
>>> > On Sun, Oct 31, 2021, 18:15 Adam Lippai <a...@rigo.sk> wrote:
>>> >
>>> > > Hi Jorge,
>>> > >
>>> > > Just an idea: Do the Avro libs support different allocators? Maybe
>>> using
>>> > a
>>> > > different one (e.g. mimalloc) would yield more similar results by
>>> working
>>> > > around the fragmentation you described.
>>> > >
>>> > > This wouldn't change the fact that they are relatively slow, however
>>> it
>>> > > could allow you better apples to apples comparison thus better CPU
>>> > > profiling and understanding of the nuances.
>>> > >
>>> > > Best regards,
>>> > > Adam Lippai
>>> > >
>>> > >
>>> > > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão <
>>> > jorgecarlei...@gmail.com
>>> > > >
>>> > > wrote:
>>> > >
>>> > > > Hi,
>>> > > >
>>> > > > I am reporting back a conclusion that I recently arrived at when
>>> adding
>>> > > > support for reading Avro to Arrow.
>>> > > >
>>> > > > Avro is a storage format that does not have an associated in-memory
>>> > > > format. In Rust, the official implementation deserializes an enum,
>>> in
>>> > > > Python to a vector of Object, and I suspect in Java to an
>>> equivalent
>>> > > vector
>>> > > > of object. The important aspect is that all of them use fragmented
>>> > memory
>>> > > > regions (as opposed to what we do with e.g. one uint8 buffer for
>>> > > > StringArray).
>>> > > >
>>> > > > I benchmarked reading to arrow vs reading via the official Avro
>>> > > > implementations. The results are a bit surprising: reading 2^20
>>> rows
>>> > of 3
>>> > > > byte strings is ~6x faster than the official Avro Rust
>>> implementation
>>> > and
>>> > > > ~20x faster vs "fastavro", a C implementation with bindings for
>>> Python
>>> > > (pip
>>> > > > install fastavro), all with a difference slope (see graph below or
>>> > > numbers
>>> > > > and used code here [1]).
>>> > > > [image: avro_read.png]
>>> > > >
>>> > > > I found this a bit surprising because we need to read row by row
>>> and
>>> > > > perform a transpose of the data (from rows to columns) which is
>>> usually
>>> > > > expensive. Furthermore, reading strings can't be that much
>>> optimized
>>> > > after
>>> > > > all.
>>> > > >
>>> > > > To investigate the root cause, I drilled down to the flamegraphs
>>> for
>>> > both
>>> > > > the official avro rust implementation and the arrow2
>>> implementation:
>>> > the
>>> > > > majority of the time in the Avro implementation is spent allocating
>>> > > > individual strings (to build the [str] - equivalents); the
>>> majority of
>>> > > the
>>> > > > time in arrow2 is equally divided between zigzag decoding (to get
>>> the
>>> > > > length of the item), reallocs, and utf8 validation.
>>> > > >
>>> > > > My hypothesis is that the difference in performance is unrelated
>>> to a
>>> > > > particular implementation of arrow or avro, but to a general
>>> concept of
>>> > > > reading to [str] vs arrow. Specifically, the item by item
>>> allocation
>>> > > > strategy is far worse than what we do in Arrow with a single region
>>> > which
>>> > > > we reallocate from time to time with exponential growth. In some
>>> > > > architectures we even benefit from the __memmove_avx_unaligned_erms
>>> > > > instruction that makes it even cheaper to reallocate.
>>> > > >
>>> > > > Has anyone else performed such benchmarks or played with Avro ->
>>> Arrow
>>> > > and
>>> > > > found supporting / opposing findings to this hypothesis?
>>> > > >
>>> > > > If this hypothesis holds (e.g. with a similar result against the
>>> Java
>>> > > > implementation of Avro), it imo puts arrow as a strong candidate
>>> for
>>> > the
>>> > > > default format of Avro implementations to deserialize into when
>>> using
>>> > it
>>> > > > in-memory, which could benefit both projects?
>>> > > >
>>> > > > Best,
>>> > > > Jorge
>>> > > >
>>> > > > [1] https://github.com/DataEngineeringLabs/arrow2-benches
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: Synergies with Apache Avro?

Reply via email to