> > I haven't looked at it for a while but my recollection, at least in java, > is streaming process for each step outlined rather than a batch process > (i.e. decompress some bytes, then decode them lazily a "Next Row" is > called).
Sorry for the late reply, It took me a bit to go through the relevant parts of the Java implementation. I agree that the deserialization of items within a block is done on a per item, and can even re-use a previously allocated item [1]. From what I can read, the blocks are still read to memory as whole chunks via `nextRawBlock` [2]. I.e. from a row oriented processing, the stream is still composed of blocks that are first read into memory and then deserialized row by row (and item by item within a row). Do you have a target system in mind? As I said for columnar/arrow native > query engines this obviously sounds like a win, but for row oriented > processing engines, the transposition costs are going to eat into any gains. > I agree - I was thinking in terms of columnar query engines aiming at leveraging simd and data locality. That being said, I'd love to see real world ETL pipeline benchmarks :) > Definitely. This was an educational exercise. [1] https://github.com/apache/avro/blob/42822886c28ea74a744abb7e7a80a942c540faa5/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader12.java#L160 [2] https://github.com/apache/avro/blob/ 42822886c28ea74a744abb7e7a80a942c540faa5 /lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L213 On Tue, Nov 2, 2021 at 6:41 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Wrt to row iterations and native rows: my understanding is that even >> though most Avro APIs present themselves as iterators of rows, internally >> they read a whole compressed serialized block into memory, decompress it, >> and then deserialize item by item into a row ("read block -> decompress >> block -> decode item by item into rows -> read next block"). Avro is based >> on batches of rows (blocks) that are compressed individually (similar to >> parquet pages, but all column chunks are serialized in a single page within >> a row group). > > > I haven't looked at it for a while but my recollection, at least in java, > is streaming process for each step outlined rather than a batch process > (i.e. decompress some bytes, then decode them lazily a "Next Row" is > called). > > My hypothesis (we can bench this) is that if the user wants to perform any >> compute over the data, it is advantageous to load the block to arrow >> (decompressed block -> RecordBatch), benefiting from arrow's analytics >> performance instead, as opposed to using a native row-based format where we >> can't leverage SIMD/cache hits/must allocate and deallocate on every item. >> As usual, there are use-cases where this does not hold - I am thinking in >> terms of traditional ETL / CPU intensive stuff. > > > Do you have a target system in mind? As I said for columnar/arrow native > query engines this obviously sounds like a win, but for row oriented > processing engines, the transposition costs are going to eat into any > gains. There is also non-zero engineering effort to implement the necessary > filter/selection push down APIs that most of them provide. That being > said, I'd love to see real world ETL pipeline benchmarks :) > > > On Tue, Nov 2, 2021 at 4:39 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > >> Thank you all for all your comments. >> >> The first comments: thanks a lot for your suggestions. I tried with >> mimalloc and there is indeed a -25% improvement for avro-rs. =) >> >> This sentence is a little bit hard to parse. Is a row of 3 strings or a >>> row of 1 string consisting of 3 bytes? Was the example hard-coded? A >>> lot >>> of the complexity of parsing avro is the schema evolution rules, I >>> haven't >>> looked at whether the canonical implementations do any optimization for >>> the >>> happy case when reader and writer schema are the same. >>> >> >> The graph was for a single column of a constant string of 3 bytes ("foo") >> each divided into (avro) blocks of 4000 rows each (default block size of >> 16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an >> integer column, and compressed blocks (deflate): with equal speedups. >> Generic benchmarks are obviously catered for. I agree that schema evolution >> adds extra CPU time, and that this is the happy case; I have not >> benchmarked those yet. >> >> With respect to being a single column, I agree. The second bench that you >> saw is still a single column (of integers): I wanted to check whether the >> cost was the allocation of the strings, or the elements of the rows (the >> speedup is equivalent). >> >> However, I pushed a new bench where we are reading 6 columns [string, >> bool, int, string, string, string|null], speedup is 5x for mz-avro and 4x >> for avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]). >> [image: avro_read_mixed.png] >> >> Wrt to row iterations and native rows: my understanding is that even >> though most Avro APIs present themselves as iterators of rows, internally >> they read a whole compressed serialized block into memory, decompress it, >> and then deserialize item by item into a row ("read block -> decompress >> block -> decode item by item into rows -> read next block"). Avro is based >> on batches of rows (blocks) that are compressed individually (similar to >> parquet pages, but all column chunks are serialized in a single page within >> a row group). >> >> In this context, my thinking of Arrow vs Vec<Native> is that once loaded >> in memory, a block behaves like a serialized blob that we can deserialize >> to any in-memory format according to some rules. >> >> My hypothesis (we can bench this) is that if the user wants to perform >> any compute over the data, it is advantageous to load the block to arrow >> (decompressed block -> RecordBatch), benefiting from arrow's analytics >> performance instead, as opposed to using a native row-based format where we >> can't leverage SIMD/cache hits/must allocate and deallocate on every item. >> As usual, there are use-cases where this does not hold - I am thinking in >> terms of traditional ETL / CPU intensive stuff. >> >> My surprise is that even without the compute in mind, deserializing >> blocks to arrow is faster than I antecipated, and wanted to check if >> someone went through this exercise before trying more exotic benches. >> >> Best, >> Jorge >> >> [1] https://github.com/dataEngineeringLabs/arrow2-benches >> >> >> On Mon, Nov 1, 2021 at 3:37 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Hi Jorge, >>> >>> > The results are a bit surprising: reading 2^20 rows of 3 byte strings >>> is >>> > ~6x faster than the official Avro Rust implementation and ~20x faster >>> vs >>> > "fastavro" >>> >>> >>> This sentence is a little bit hard to parse. Is a row of 3 strings or a >>> row of 1 string consisting of 3 bytes? Was the example hard-coded? A >>> lot >>> of the complexity of parsing avro is the schema evolution rules, I >>> haven't >>> looked at whether the canonical implementations do any optimization for >>> the >>> happy case when reader and writer schema are the same. >>> >>> There is a "Java Avro -> Arrow" implementation checked but it is somewhat >>> broken today (I filed an issue on this a while ago) that delegates >>> parsing >>> the t/from the Avro java library. I also think there might be faster >>> implementations that aren't the canonical implementations (I seem to >>> recall >>> a JIT version for java for example and fastavro is another). For both >>> Java >>> and Python I'd imagine there would be some decent speed improvements >>> simply >>> by avoiding the "boxing" task of moving language primitive types to >>> native >>> memory. >>> >>> I was planning (and still might get to it sometime in 2022) to have a C++ >>> parser for Avro. Wes cross-posted this to the Avro mailing list when I >>> thought I had time to work on it a couple of years ago and I don't recall >>> any response to it. The Rust avro library I believe was also just >>> recently >>> adopted/donated into the Apache Avro project. >>> >>> Avro seems to be pretty common so having the ability to convert to and >>> from >>> it is I think is generally valuable. >>> >>> Cheers, >>> Micah >>> >>> >>> On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres <danielhe...@gmail.com> >>> wrote: >>> >>> > Rust allows to easily swap the global allocator to e.g. mimalloc or >>> > snmalloc, even without the library supporting to change the allocator. >>> In >>> > my experience this indeed helps with allocation heavy code (I have seen >>> > changes of up to 30%). >>> > >>> > Best regards, >>> > >>> > Daniël >>> > >>> > >>> > On Sun, Oct 31, 2021, 18:15 Adam Lippai <a...@rigo.sk> wrote: >>> > >>> > > Hi Jorge, >>> > > >>> > > Just an idea: Do the Avro libs support different allocators? Maybe >>> using >>> > a >>> > > different one (e.g. mimalloc) would yield more similar results by >>> working >>> > > around the fragmentation you described. >>> > > >>> > > This wouldn't change the fact that they are relatively slow, however >>> it >>> > > could allow you better apples to apples comparison thus better CPU >>> > > profiling and understanding of the nuances. >>> > > >>> > > Best regards, >>> > > Adam Lippai >>> > > >>> > > >>> > > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão < >>> > jorgecarlei...@gmail.com >>> > > > >>> > > wrote: >>> > > >>> > > > Hi, >>> > > > >>> > > > I am reporting back a conclusion that I recently arrived at when >>> adding >>> > > > support for reading Avro to Arrow. >>> > > > >>> > > > Avro is a storage format that does not have an associated in-memory >>> > > > format. In Rust, the official implementation deserializes an enum, >>> in >>> > > > Python to a vector of Object, and I suspect in Java to an >>> equivalent >>> > > vector >>> > > > of object. The important aspect is that all of them use fragmented >>> > memory >>> > > > regions (as opposed to what we do with e.g. one uint8 buffer for >>> > > > StringArray). >>> > > > >>> > > > I benchmarked reading to arrow vs reading via the official Avro >>> > > > implementations. The results are a bit surprising: reading 2^20 >>> rows >>> > of 3 >>> > > > byte strings is ~6x faster than the official Avro Rust >>> implementation >>> > and >>> > > > ~20x faster vs "fastavro", a C implementation with bindings for >>> Python >>> > > (pip >>> > > > install fastavro), all with a difference slope (see graph below or >>> > > numbers >>> > > > and used code here [1]). >>> > > > [image: avro_read.png] >>> > > > >>> > > > I found this a bit surprising because we need to read row by row >>> and >>> > > > perform a transpose of the data (from rows to columns) which is >>> usually >>> > > > expensive. Furthermore, reading strings can't be that much >>> optimized >>> > > after >>> > > > all. >>> > > > >>> > > > To investigate the root cause, I drilled down to the flamegraphs >>> for >>> > both >>> > > > the official avro rust implementation and the arrow2 >>> implementation: >>> > the >>> > > > majority of the time in the Avro implementation is spent allocating >>> > > > individual strings (to build the [str] - equivalents); the >>> majority of >>> > > the >>> > > > time in arrow2 is equally divided between zigzag decoding (to get >>> the >>> > > > length of the item), reallocs, and utf8 validation. >>> > > > >>> > > > My hypothesis is that the difference in performance is unrelated >>> to a >>> > > > particular implementation of arrow or avro, but to a general >>> concept of >>> > > > reading to [str] vs arrow. Specifically, the item by item >>> allocation >>> > > > strategy is far worse than what we do in Arrow with a single region >>> > which >>> > > > we reallocate from time to time with exponential growth. In some >>> > > > architectures we even benefit from the __memmove_avx_unaligned_erms >>> > > > instruction that makes it even cheaper to reallocate. >>> > > > >>> > > > Has anyone else performed such benchmarks or played with Avro -> >>> Arrow >>> > > and >>> > > > found supporting / opposing findings to this hypothesis? >>> > > > >>> > > > If this hypothesis holds (e.g. with a similar result against the >>> Java >>> > > > implementation of Avro), it imo puts arrow as a strong candidate >>> for >>> > the >>> > > > default format of Avro implementations to deserialize into when >>> using >>> > it >>> > > > in-memory, which could benefit both projects? >>> > > > >>> > > > Best, >>> > > > Jorge >>> > > > >>> > > > [1] https://github.com/DataEngineeringLabs/arrow2-benches >>> > > > >>> > > > >>> > > > >>> > > >>> > >>> >>