> > Wrt to row iterations and native rows: my understanding is that even > though most Avro APIs present themselves as iterators of rows, internally > they read a whole compressed serialized block into memory, decompress it, > and then deserialize item by item into a row ("read block -> decompress > block -> decode item by item into rows -> read next block"). Avro is based > on batches of rows (blocks) that are compressed individually (similar to > parquet pages, but all column chunks are serialized in a single page within > a row group).
I haven't looked at it for a while but my recollection, at least in java, is streaming process for each step outlined rather than a batch process (i.e. decompress some bytes, then decode them lazily a "Next Row" is called). My hypothesis (we can bench this) is that if the user wants to perform any > compute over the data, it is advantageous to load the block to arrow > (decompressed block -> RecordBatch), benefiting from arrow's analytics > performance instead, as opposed to using a native row-based format where we > can't leverage SIMD/cache hits/must allocate and deallocate on every item. > As usual, there are use-cases where this does not hold - I am thinking in > terms of traditional ETL / CPU intensive stuff. Do you have a target system in mind? As I said for columnar/arrow native query engines this obviously sounds like a win, but for row oriented processing engines, the transposition costs are going to eat into any gains. There is also non-zero engineering effort to implement the necessary filter/selection push down APIs that most of them provide. That being said, I'd love to see real world ETL pipeline benchmarks :) On Tue, Nov 2, 2021 at 4:39 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Thank you all for all your comments. > > The first comments: thanks a lot for your suggestions. I tried with > mimalloc and there is indeed a -25% improvement for avro-rs. =) > > This sentence is a little bit hard to parse. Is a row of 3 strings or a >> row of 1 string consisting of 3 bytes? Was the example hard-coded? A lot >> of the complexity of parsing avro is the schema evolution rules, I haven't >> looked at whether the canonical implementations do any optimization for >> the >> happy case when reader and writer schema are the same. >> > > The graph was for a single column of a constant string of 3 bytes ("foo") > each divided into (avro) blocks of 4000 rows each (default block size of > 16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an > integer column, and compressed blocks (deflate): with equal speedups. > Generic benchmarks are obviously catered for. I agree that schema evolution > adds extra CPU time, and that this is the happy case; I have not > benchmarked those yet. > > With respect to being a single column, I agree. The second bench that you > saw is still a single column (of integers): I wanted to check whether the > cost was the allocation of the strings, or the elements of the rows (the > speedup is equivalent). > > However, I pushed a new bench where we are reading 6 columns [string, > bool, int, string, string, string|null], speedup is 5x for mz-avro and 4x > for avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]). > [image: avro_read_mixed.png] > > Wrt to row iterations and native rows: my understanding is that even > though most Avro APIs present themselves as iterators of rows, internally > they read a whole compressed serialized block into memory, decompress it, > and then deserialize item by item into a row ("read block -> decompress > block -> decode item by item into rows -> read next block"). Avro is based > on batches of rows (blocks) that are compressed individually (similar to > parquet pages, but all column chunks are serialized in a single page within > a row group). > > In this context, my thinking of Arrow vs Vec<Native> is that once loaded > in memory, a block behaves like a serialized blob that we can deserialize > to any in-memory format according to some rules. > > My hypothesis (we can bench this) is that if the user wants to perform any > compute over the data, it is advantageous to load the block to arrow > (decompressed block -> RecordBatch), benefiting from arrow's analytics > performance instead, as opposed to using a native row-based format where we > can't leverage SIMD/cache hits/must allocate and deallocate on every item. > As usual, there are use-cases where this does not hold - I am thinking in > terms of traditional ETL / CPU intensive stuff. > > My surprise is that even without the compute in mind, deserializing blocks > to arrow is faster than I antecipated, and wanted to check if someone went > through this exercise before trying more exotic benches. > > Best, > Jorge > > [1] https://github.com/dataEngineeringLabs/arrow2-benches > > > On Mon, Nov 1, 2021 at 3:37 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Jorge, >> >> > The results are a bit surprising: reading 2^20 rows of 3 byte strings is >> > ~6x faster than the official Avro Rust implementation and ~20x faster vs >> > "fastavro" >> >> >> This sentence is a little bit hard to parse. Is a row of 3 strings or a >> row of 1 string consisting of 3 bytes? Was the example hard-coded? A lot >> of the complexity of parsing avro is the schema evolution rules, I haven't >> looked at whether the canonical implementations do any optimization for >> the >> happy case when reader and writer schema are the same. >> >> There is a "Java Avro -> Arrow" implementation checked but it is somewhat >> broken today (I filed an issue on this a while ago) that delegates parsing >> the t/from the Avro java library. I also think there might be faster >> implementations that aren't the canonical implementations (I seem to >> recall >> a JIT version for java for example and fastavro is another). For both >> Java >> and Python I'd imagine there would be some decent speed improvements >> simply >> by avoiding the "boxing" task of moving language primitive types to native >> memory. >> >> I was planning (and still might get to it sometime in 2022) to have a C++ >> parser for Avro. Wes cross-posted this to the Avro mailing list when I >> thought I had time to work on it a couple of years ago and I don't recall >> any response to it. The Rust avro library I believe was also just >> recently >> adopted/donated into the Apache Avro project. >> >> Avro seems to be pretty common so having the ability to convert to and >> from >> it is I think is generally valuable. >> >> Cheers, >> Micah >> >> >> On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres <danielhe...@gmail.com> >> wrote: >> >> > Rust allows to easily swap the global allocator to e.g. mimalloc or >> > snmalloc, even without the library supporting to change the allocator. >> In >> > my experience this indeed helps with allocation heavy code (I have seen >> > changes of up to 30%). >> > >> > Best regards, >> > >> > Daniël >> > >> > >> > On Sun, Oct 31, 2021, 18:15 Adam Lippai <a...@rigo.sk> wrote: >> > >> > > Hi Jorge, >> > > >> > > Just an idea: Do the Avro libs support different allocators? Maybe >> using >> > a >> > > different one (e.g. mimalloc) would yield more similar results by >> working >> > > around the fragmentation you described. >> > > >> > > This wouldn't change the fact that they are relatively slow, however >> it >> > > could allow you better apples to apples comparison thus better CPU >> > > profiling and understanding of the nuances. >> > > >> > > Best regards, >> > > Adam Lippai >> > > >> > > >> > > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão < >> > jorgecarlei...@gmail.com >> > > > >> > > wrote: >> > > >> > > > Hi, >> > > > >> > > > I am reporting back a conclusion that I recently arrived at when >> adding >> > > > support for reading Avro to Arrow. >> > > > >> > > > Avro is a storage format that does not have an associated in-memory >> > > > format. In Rust, the official implementation deserializes an enum, >> in >> > > > Python to a vector of Object, and I suspect in Java to an equivalent >> > > vector >> > > > of object. The important aspect is that all of them use fragmented >> > memory >> > > > regions (as opposed to what we do with e.g. one uint8 buffer for >> > > > StringArray). >> > > > >> > > > I benchmarked reading to arrow vs reading via the official Avro >> > > > implementations. The results are a bit surprising: reading 2^20 rows >> > of 3 >> > > > byte strings is ~6x faster than the official Avro Rust >> implementation >> > and >> > > > ~20x faster vs "fastavro", a C implementation with bindings for >> Python >> > > (pip >> > > > install fastavro), all with a difference slope (see graph below or >> > > numbers >> > > > and used code here [1]). >> > > > [image: avro_read.png] >> > > > >> > > > I found this a bit surprising because we need to read row by row and >> > > > perform a transpose of the data (from rows to columns) which is >> usually >> > > > expensive. Furthermore, reading strings can't be that much optimized >> > > after >> > > > all. >> > > > >> > > > To investigate the root cause, I drilled down to the flamegraphs for >> > both >> > > > the official avro rust implementation and the arrow2 implementation: >> > the >> > > > majority of the time in the Avro implementation is spent allocating >> > > > individual strings (to build the [str] - equivalents); the majority >> of >> > > the >> > > > time in arrow2 is equally divided between zigzag decoding (to get >> the >> > > > length of the item), reallocs, and utf8 validation. >> > > > >> > > > My hypothesis is that the difference in performance is unrelated to >> a >> > > > particular implementation of arrow or avro, but to a general >> concept of >> > > > reading to [str] vs arrow. Specifically, the item by item allocation >> > > > strategy is far worse than what we do in Arrow with a single region >> > which >> > > > we reallocate from time to time with exponential growth. In some >> > > > architectures we even benefit from the __memmove_avx_unaligned_erms >> > > > instruction that makes it even cheaper to reallocate. >> > > > >> > > > Has anyone else performed such benchmarks or played with Avro -> >> Arrow >> > > and >> > > > found supporting / opposing findings to this hypothesis? >> > > > >> > > > If this hypothesis holds (e.g. with a similar result against the >> Java >> > > > implementation of Avro), it imo puts arrow as a strong candidate for >> > the >> > > > default format of Avro implementations to deserialize into when >> using >> > it >> > > > in-memory, which could benefit both projects? >> > > > >> > > > Best, >> > > > Jorge >> > > > >> > > > [1] https://github.com/DataEngineeringLabs/arrow2-benches >> > > > >> > > > >> > > > >> > > >> > >> >