Re: Synergies with Apache Avro?

2021-11-16 Thread Jorge Cardoso Leitão
>
> I haven't looked at it for a while but my recollection, at least in java,
> is streaming process for each step outlined rather than a batch process
> (i.e. decompress some bytes, then decode them lazily a "Next Row" is
> called).


Sorry for the late reply, It took me a bit to go through the relevant parts
of the Java implementation. I agree that the deserialization of items
within a block is done on a per item, and can even re-use a previously
allocated item [1]. From what I can read, the blocks are still read to
memory as whole chunks via `nextRawBlock` [2]. I.e. from a row oriented
processing, the stream is still composed of blocks that are first read into
memory and then deserialized row by row (and item by item within a row).

Do you have a target system in mind?  As I said for columnar/arrow native
> query engines this obviously sounds like a win, but for row oriented
> processing engines, the transposition costs are going to eat into any gains.
>

I agree - I was thinking in terms of columnar query engines aiming at
leveraging simd and data locality.

That being said, I'd love to see real world ETL pipeline benchmarks :)
>

Definitely. This was an educational exercise.

[1]
https://github.com/apache/avro/blob/42822886c28ea74a744abb7e7a80a942c540faa5/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader12.java#L160
[2] https://github.com/apache/avro/blob/
42822886c28ea74a744abb7e7a80a942c540faa5
/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L213


On Tue, Nov 2, 2021 at 6:41 PM Micah Kornfield 
wrote:

> Wrt to row iterations and native rows: my understanding is that even
>> though most Avro APIs present themselves as iterators of rows, internally
>> they read a whole compressed serialized block into memory, decompress it,
>> and then deserialize item by item into a row ("read block -> decompress
>> block -> decode item by item into rows -> read next block"). Avro is based
>> on batches of rows (blocks) that are compressed individually (similar to
>> parquet pages, but all column chunks are serialized in a single page within
>> a row group).
>
>
> I haven't looked at it for a while but my recollection, at least in java,
> is streaming process for each step outlined rather than a batch process
> (i.e. decompress some bytes, then decode them lazily a "Next Row" is
> called).
>
> My hypothesis (we can bench this) is that if the user wants to perform any
>> compute over the data, it is advantageous to load the block to arrow
>> (decompressed block -> RecordBatch), benefiting from arrow's analytics
>> performance instead, as opposed to using a native row-based format where we
>> can't leverage SIMD/cache hits/must allocate and deallocate on every item.
>> As usual, there are use-cases where this does not hold - I am thinking in
>> terms of traditional ETL / CPU intensive stuff.
>
>
> Do you have a target system in mind?  As I said for columnar/arrow native
> query engines this obviously sounds like a win, but for row oriented
> processing engines, the transposition costs are going to eat into any
> gains. There is also non-zero engineering effort to implement the necessary
> filter/selection push down APIs that most of them provide.  That being
> said, I'd love to see real world ETL pipeline benchmarks :)
>
>
> On Tue, Nov 2, 2021 at 4:39 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
>> Thank you all for all your comments.
>>
>> The first comments: thanks a lot for your suggestions. I tried with
>> mimalloc and there is indeed a -25% improvement for avro-rs. =)
>>
>> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
>>> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A
>>> lot
>>> of the complexity of parsing avro is the schema evolution rules, I
>>> haven't
>>> looked at whether the canonical implementations do any optimization for
>>> the
>>> happy case when reader and writer schema are the same.
>>>
>>
>> The graph was for a single column of a constant string of 3 bytes ("foo")
>> each divided into (avro) blocks of 4000 rows each (default block size of
>> 16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an
>> integer column, and compressed blocks (deflate): with equal speedups.
>> Generic benchmarks are obviously catered for. I agree that schema evolution
>> adds extra CPU time, and that this is the happy case; I have not
>> benchmarked those yet.
>>
>> With respect to being a single column, I agree. The second bench that you
>> saw is still a single column (of integers): I wanted to check whether the
>> cost was the allocation of the strings, or the elements of the rows (the
>> speedup is equivalent).
>>
>> However, I pushed a new bench where we are reading 6 columns [string,
>> bool, int, string, string, string|null], speedup is 5x for mz-avro and 4x
>> for avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]).
>> [image: avro_read_mixed.png]
>>
>> Wrt to row iteratio

Re: Synergies with Apache Avro?

2021-11-02 Thread Micah Kornfield
>
> Wrt to row iterations and native rows: my understanding is that even
> though most Avro APIs present themselves as iterators of rows, internally
> they read a whole compressed serialized block into memory, decompress it,
> and then deserialize item by item into a row ("read block -> decompress
> block -> decode item by item into rows -> read next block"). Avro is based
> on batches of rows (blocks) that are compressed individually (similar to
> parquet pages, but all column chunks are serialized in a single page within
> a row group).


I haven't looked at it for a while but my recollection, at least in java,
is streaming process for each step outlined rather than a batch process
(i.e. decompress some bytes, then decode them lazily a "Next Row" is
called).

My hypothesis (we can bench this) is that if the user wants to perform any
> compute over the data, it is advantageous to load the block to arrow
> (decompressed block -> RecordBatch), benefiting from arrow's analytics
> performance instead, as opposed to using a native row-based format where we
> can't leverage SIMD/cache hits/must allocate and deallocate on every item.
> As usual, there are use-cases where this does not hold - I am thinking in
> terms of traditional ETL / CPU intensive stuff.


Do you have a target system in mind?  As I said for columnar/arrow native
query engines this obviously sounds like a win, but for row oriented
processing engines, the transposition costs are going to eat into any
gains. There is also non-zero engineering effort to implement the necessary
filter/selection push down APIs that most of them provide.  That being
said, I'd love to see real world ETL pipeline benchmarks :)


On Tue, Nov 2, 2021 at 4:39 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Thank you all for all your comments.
>
> The first comments: thanks a lot for your suggestions. I tried with
> mimalloc and there is indeed a -25% improvement for avro-rs. =)
>
> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
>> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
>> of the complexity of parsing avro is the schema evolution rules, I haven't
>> looked at whether the canonical implementations do any optimization for
>> the
>> happy case when reader and writer schema are the same.
>>
>
> The graph was for a single column of a constant string of 3 bytes ("foo")
> each divided into (avro) blocks of 4000 rows each (default block size of
> 16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an
> integer column, and compressed blocks (deflate): with equal speedups.
> Generic benchmarks are obviously catered for. I agree that schema evolution
> adds extra CPU time, and that this is the happy case; I have not
> benchmarked those yet.
>
> With respect to being a single column, I agree. The second bench that you
> saw is still a single column (of integers): I wanted to check whether the
> cost was the allocation of the strings, or the elements of the rows (the
> speedup is equivalent).
>
> However, I pushed a new bench where we are reading 6 columns [string,
> bool, int, string, string, string|null], speedup is 5x for mz-avro and 4x
> for avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]).
> [image: avro_read_mixed.png]
>
> Wrt to row iterations and native rows: my understanding is that even
> though most Avro APIs present themselves as iterators of rows, internally
> they read a whole compressed serialized block into memory, decompress it,
> and then deserialize item by item into a row ("read block -> decompress
> block -> decode item by item into rows -> read next block"). Avro is based
> on batches of rows (blocks) that are compressed individually (similar to
> parquet pages, but all column chunks are serialized in a single page within
> a row group).
>
> In this context, my thinking of Arrow vs Vec is that once loaded
> in memory, a block behaves like a serialized blob that we can deserialize
> to any in-memory format according to some rules.
>
> My hypothesis (we can bench this) is that if the user wants to perform any
> compute over the data, it is advantageous to load the block to arrow
> (decompressed block -> RecordBatch), benefiting from arrow's analytics
> performance instead, as opposed to using a native row-based format where we
> can't leverage SIMD/cache hits/must allocate and deallocate on every item.
> As usual, there are use-cases where this does not hold - I am thinking in
> terms of traditional ETL / CPU intensive stuff.
>
> My surprise is that even without the compute in mind, deserializing blocks
> to arrow is faster than I antecipated, and  wanted to check if someone went
> through this exercise before trying more exotic benches.
>
> Best,
> Jorge
>
> [1] https://github.com/dataEngineeringLabs/arrow2-benches
>
>
> On Mon, Nov 1, 2021 at 3:37 AM Micah Kornfield 
> wrote:
>
>> Hi Jorge,
>>
>> > The results are a bit surprising: reading 2^20 rows of 

Re: Synergies with Apache Avro?

2021-11-02 Thread Jorge Cardoso Leitão
Thank you all for all your comments.

The first comments: thanks a lot for your suggestions. I tried with
mimalloc and there is indeed a -25% improvement for avro-rs. =)

This sentence is a little bit hard to parse.  Is a row of 3 strings or a
> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
> of the complexity of parsing avro is the schema evolution rules, I haven't
> looked at whether the canonical implementations do any optimization for the
> happy case when reader and writer schema are the same.
>

The graph was for a single column of a constant string of 3 bytes ("foo")
each divided into (avro) blocks of 4000 rows each (default block size of
16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an
integer column, and compressed blocks (deflate): with equal speedups.
Generic benchmarks are obviously catered for. I agree that schema evolution
adds extra CPU time, and that this is the happy case; I have not
benchmarked those yet.

With respect to being a single column, I agree. The second bench that you
saw is still a single column (of integers): I wanted to check whether the
cost was the allocation of the strings, or the elements of the rows (the
speedup is equivalent).

However, I pushed a new bench where we are reading 6 columns [string, bool,
int, string, string, string|null], speedup is 5x for mz-avro and 4x for
avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]).
[image: avro_read_mixed.png]

Wrt to row iterations and native rows: my understanding is that even though
most Avro APIs present themselves as iterators of rows, internally they
read a whole compressed serialized block into memory, decompress it, and
then deserialize item by item into a row ("read block -> decompress block
-> decode item by item into rows -> read next block"). Avro is based on
batches of rows (blocks) that are compressed individually (similar to
parquet pages, but all column chunks are serialized in a single page within
a row group).

In this context, my thinking of Arrow vs Vec is that once loaded in
memory, a block behaves like a serialized blob that we can deserialize to
any in-memory format according to some rules.

My hypothesis (we can bench this) is that if the user wants to perform any
compute over the data, it is advantageous to load the block to arrow
(decompressed block -> RecordBatch), benefiting from arrow's analytics
performance instead, as opposed to using a native row-based format where we
can't leverage SIMD/cache hits/must allocate and deallocate on every item.
As usual, there are use-cases where this does not hold - I am thinking in
terms of traditional ETL / CPU intensive stuff.

My surprise is that even without the compute in mind, deserializing blocks
to arrow is faster than I antecipated, and  wanted to check if someone went
through this exercise before trying more exotic benches.

Best,
Jorge

[1] https://github.com/dataEngineeringLabs/arrow2-benches


On Mon, Nov 1, 2021 at 3:37 AM Micah Kornfield 
wrote:

> Hi Jorge,
>
> > The results are a bit surprising: reading 2^20 rows of 3 byte strings is
> > ~6x faster than the official Avro Rust implementation and ~20x faster vs
> > "fastavro"
>
>
> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
> of the complexity of parsing avro is the schema evolution rules, I haven't
> looked at whether the canonical implementations do any optimization for the
> happy case when reader and writer schema are the same.
>
> There is a "Java Avro -> Arrow" implementation checked but it is somewhat
> broken today (I filed an issue on this a while ago) that delegates parsing
> the t/from the Avro java library.  I also think there might be faster
> implementations that aren't the canonical implementations (I seem to recall
> a JIT version for java for example and fastavro is another).  For both Java
> and Python I'd imagine there would be some decent speed improvements simply
> by avoiding the "boxing" task of moving language primitive types to native
> memory.
>
> I was planning (and still might get to it sometime in 2022) to have a C++
> parser for Avro.  Wes cross-posted this to the Avro mailing list when I
> thought I had time to work on it a couple of years ago and I don't recall
> any response to it.  The Rust avro library I believe was also just recently
> adopted/donated into the Apache Avro project.
>
> Avro seems to be pretty common so having the ability to convert to and from
> it is I think is generally valuable.
>
> Cheers,
> Micah
>
>
> On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres 
> wrote:
>
> > Rust allows to easily swap the global allocator to e.g. mimalloc or
> > snmalloc, even without the library supporting to change the allocator. In
> > my experience this indeed helps with allocation heavy code (I have seen
> > changes of up to 30%).
> >
> > Best regards,
> >
> > Daniël
> >
> >
> > On Sun, Oct 31

Re: Synergies with Apache Avro?

2021-11-01 Thread Micah Kornfield
Hi Ismaël,

Apologies for the double post.

Avro is quite conservative about new features but we have support for
> experimental features [2] so backing the format with Arrow could be
> one. The only issue I see from the Java side is introducing the Arrow
> dependencies.

I think reducing dependencies is a good goal.   Arrow's Java integration
with Avro [1] lives in a separate module and hooks into lower level Avro
APIs.  If there is interest in experimentation it would be great to get
this library into a better state (and if there is interest in long term
maintainership in the Avro community, I for one would be happy to help
facilitate this).

[1]
https://github.com/apache/arrow/tree/master/java/adapter/avro/src/main/java/org/apache/arrow

On Mon, Nov 1, 2021 at 7:37 PM Micah Kornfield 
wrote:

> I am in awe that the 'extra
>> step' of moving from a row to columnar in memory representation has so
>> little overhead, or maybe we can only discover this with more complex
>> schemas.
>
>
> I read Jorge's original e-mail too quickly and didn't realize there were
> links to the benchmarks attached.  It looks like the benchmarks have been
> updated to have a string and int column (before there was only a string
> column populated with "foo", did I get that right Jorge?).  This  raises
> two points:
> 1.  The initial test really was more column->column rather than
> row->column (but again apologies if I misread).  I think this is still a
> good result with regards to memory allocation, and I can imagine
> the transposition to not necessarily be too expensive.
>
> 2.  While Avro->Arrow might yield faster parsing we should be careful to
> benchmark how consumers are going to use APIs we provide.  I imagine for
> DataFusion this would be a net win to have a native Avro->Arrow parser.
> But for consumers that require row based iteration, we need to ensure an
> optimized path from Arrow->Native language bindings as well.  As an
> example, my team at work recently benchmarked two scenarios: 1.  Parsing to
> python dicts per row using fast avro.  2.  Parsing Arrow and then
> converting to python dicts.  We found for primitive type data, #1 was
> actually faster then #2.  I think a large component of this is having to go
> through Arrow C++'s Scalar objects first, which I'm working on addressing,
> but it is a consideration for how and what APIs are potentially exposed.
>
> As I said before, I'm in favor of seeing transformers/parsers that go from
> Avro to Arrow, regardless any performance wins.  Performance wins would
> certainly be a nice benefit :)
>
> Cheers,
> Micah
>
> On Monday, November 1, 2021, Ismaël Mejía  wrote:
>
>> +d...@avro.apache.org
>>
>> Hello,
>>
>> Adding dev@avro for awareness.
>>
>> Thanks Jorge for exploring/reporting this. This is an exciting
>> development. I am not aware of any work in the Avro side on
>> optimizations of in-memory representation, so any improvements there
>> could be great. (The comment by Micah about boxing for Java is
>> definitely one, and there could be more). I am in awe that the 'extra
>> step' of moving from a row to columnar in memory representation has so
>> little overhead, or maybe we can only discover this with more complex
>> schemas.
>>
>> The Java implementation serializes to an array of Objects [1] (like
>> Python). Any needed changes to support a different in-memory
>> representation should be reasonable easy to plug, this should be an
>> internal detail that hopefully is not leaking through the user APIs.
>> Avro is quite conservative about new features but we have support for
>> experimental features [2] so backing the format with Arrow could be
>> one. The only issue I see from the Java side is introducing the Arrow
>> dependencies. Avro has fought a long battle to get rid of most of the
>> dependencies to simplify downstream use.
>>
>> For Rust, since the Rust APIs are not yet considered stable and
>> dependencies could be less of an issue I suppose we have 'carte
>> blanche' to back it internally with Arrow specially if it brings
>> performance advantages.
>>
>> There are some benchmarks of a Python version backed by the Rust
>> implementation that are faster than fastavro [3] so we could be into
>> something. Note that the python version on Apache is really slow
>> because it is pure python, but having a version backed by the rust one
>> (and the Arrow in memory improvements) could be a nice project
>> specially if improved by Arrow.
>>
>> Ismaël
>>
>> [1]
>> https://github.com/apache/avro/blob/a1fce29d9675b4dd95dfee9db32cc505d0b2227c/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L223
>> [2]
>> https://cwiki.apache.org/confluence/display/AVRO/Experimental+features+in+Avro
>> [3]
>> https://ep2018.europython.eu/media/conference/slides/how-to-write-rust-instead-of-c-and-get-away-with-it-yes-its-a-python-talk.pdf
>>
>>
>>
>> On Mon, Nov 1, 2021 at 3:36 AM Micah Kornfield 
>> wrote:
>> >
>> > Hi Jorge,
>> >>
>> >> The results are a b

Re: Synergies with Apache Avro?

2021-11-01 Thread Micah Kornfield
>
> I am in awe that the 'extra
> step' of moving from a row to columnar in memory representation has so
> little overhead, or maybe we can only discover this with more complex
> schemas.


I read Jorge's original e-mail too quickly and didn't realize there were
links to the benchmarks attached.  It looks like the benchmarks have been
updated to have a string and int column (before there was only a string
column populated with "foo", did I get that right Jorge?).  This  raises
two points:
1.  The initial test really was more column->column rather than row->column
(but again apologies if I misread).  I think this is still a good result
with regards to memory allocation, and I can imagine the transposition to
not necessarily be too expensive.

2.  While Avro->Arrow might yield faster parsing we should be careful to
benchmark how consumers are going to use APIs we provide.  I imagine for
DataFusion this would be a net win to have a native Avro->Arrow parser.
But for consumers that require row based iteration, we need to ensure an
optimized path from Arrow->Native language bindings as well.  As an
example, my team at work recently benchmarked two scenarios: 1.  Parsing to
python dicts per row using fast avro.  2.  Parsing Arrow and then
converting to python dicts.  We found for primitive type data, #1 was
actually faster then #2.  I think a large component of this is having to go
through Arrow C++'s Scalar objects first, which I'm working on addressing,
but it is a consideration for how and what APIs are potentially exposed.

As I said before, I'm in favor of seeing transformers/parsers that go from
Avro to Arrow, regardless any performance wins.  Performance wins would
certainly be a nice benefit :)

Cheers,
Micah

On Monday, November 1, 2021, Ismaël Mejía  wrote:

> +d...@avro.apache.org
>
> Hello,
>
> Adding dev@avro for awareness.
>
> Thanks Jorge for exploring/reporting this. This is an exciting
> development. I am not aware of any work in the Avro side on
> optimizations of in-memory representation, so any improvements there
> could be great. (The comment by Micah about boxing for Java is
> definitely one, and there could be more). I am in awe that the 'extra
> step' of moving from a row to columnar in memory representation has so
> little overhead, or maybe we can only discover this with more complex
> schemas.
>
> The Java implementation serializes to an array of Objects [1] (like
> Python). Any needed changes to support a different in-memory
> representation should be reasonable easy to plug, this should be an
> internal detail that hopefully is not leaking through the user APIs.
> Avro is quite conservative about new features but we have support for
> experimental features [2] so backing the format with Arrow could be
> one. The only issue I see from the Java side is introducing the Arrow
> dependencies. Avro has fought a long battle to get rid of most of the
> dependencies to simplify downstream use.
>
> For Rust, since the Rust APIs are not yet considered stable and
> dependencies could be less of an issue I suppose we have 'carte
> blanche' to back it internally with Arrow specially if it brings
> performance advantages.
>
> There are some benchmarks of a Python version backed by the Rust
> implementation that are faster than fastavro [3] so we could be into
> something. Note that the python version on Apache is really slow
> because it is pure python, but having a version backed by the rust one
> (and the Arrow in memory improvements) could be a nice project
> specially if improved by Arrow.
>
> Ismaël
>
> [1]
> https://github.com/apache/avro/blob/a1fce29d9675b4dd95dfee9db32cc505d0b2227c/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L223
> [2]
> https://cwiki.apache.org/confluence/display/AVRO/Experimental+features+in+Avro
> [3]
> https://ep2018.europython.eu/media/conference/slides/how-to-write-rust-instead-of-c-and-get-away-with-it-yes-its-a-python-talk.pdf
>
>
>
> On Mon, Nov 1, 2021 at 3:36 AM Micah Kornfield 
> wrote:
> >
> > Hi Jorge,
> >>
> >> The results are a bit surprising: reading 2^20 rows of 3 byte strings
> is ~6x faster than the official Avro Rust implementation and ~20x faster vs
> "fastavro"
> >
> >
> > This sentence is a little bit hard to parse.  Is a row of 3 strings or a
> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
> of the complexity of parsing avro is the schema evolution rules, I haven't
> looked at whether the canonical implementations do any optimization for the
> happy case when reader and writer schema are the same.
> >
> > There is a "Java Avro -> Arrow" implementation checked but it is
> somewhat broken today (I filed an issue on this a while ago) that delegates
> parsing the t/from the Avro java library.  I also think there might be
> faster implementations that aren't the canonical implementations (I seem to
> recall a JIT version for java for example and fastavro is another).  For
> both Java and Python I'd i

Re: Synergies with Apache Avro?

2021-11-01 Thread Ismaël Mejía
+d...@avro.apache.org

Hello,

Adding dev@avro for awareness.

Thanks Jorge for exploring/reporting this. This is an exciting
development. I am not aware of any work in the Avro side on
optimizations of in-memory representation, so any improvements there
could be great. (The comment by Micah about boxing for Java is
definitely one, and there could be more). I am in awe that the 'extra
step' of moving from a row to columnar in memory representation has so
little overhead, or maybe we can only discover this with more complex
schemas.

The Java implementation serializes to an array of Objects [1] (like
Python). Any needed changes to support a different in-memory
representation should be reasonable easy to plug, this should be an
internal detail that hopefully is not leaking through the user APIs.
Avro is quite conservative about new features but we have support for
experimental features [2] so backing the format with Arrow could be
one. The only issue I see from the Java side is introducing the Arrow
dependencies. Avro has fought a long battle to get rid of most of the
dependencies to simplify downstream use.

For Rust, since the Rust APIs are not yet considered stable and
dependencies could be less of an issue I suppose we have 'carte
blanche' to back it internally with Arrow specially if it brings
performance advantages.

There are some benchmarks of a Python version backed by the Rust
implementation that are faster than fastavro [3] so we could be into
something. Note that the python version on Apache is really slow
because it is pure python, but having a version backed by the rust one
(and the Arrow in memory improvements) could be a nice project
specially if improved by Arrow.

Ismaël

[1] 
https://github.com/apache/avro/blob/a1fce29d9675b4dd95dfee9db32cc505d0b2227c/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L223
[2] 
https://cwiki.apache.org/confluence/display/AVRO/Experimental+features+in+Avro
[3] 
https://ep2018.europython.eu/media/conference/slides/how-to-write-rust-instead-of-c-and-get-away-with-it-yes-its-a-python-talk.pdf



On Mon, Nov 1, 2021 at 3:36 AM Micah Kornfield  wrote:
>
> Hi Jorge,
>>
>> The results are a bit surprising: reading 2^20 rows of 3 byte strings is ~6x 
>> faster than the official Avro Rust implementation and ~20x faster vs 
>> "fastavro"
>
>
> This sentence is a little bit hard to parse.  Is a row of 3 strings or a row 
> of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot of the 
> complexity of parsing avro is the schema evolution rules, I haven't looked at 
> whether the canonical implementations do any optimization for the happy case 
> when reader and writer schema are the same.
>
> There is a "Java Avro -> Arrow" implementation checked but it is somewhat 
> broken today (I filed an issue on this a while ago) that delegates parsing 
> the t/from the Avro java library.  I also think there might be faster 
> implementations that aren't the canonical implementations (I seem to recall a 
> JIT version for java for example and fastavro is another).  For both Java and 
> Python I'd imagine there would be some decent speed improvements simply by 
> avoiding the "boxing" task of moving language primitive types to native 
> memory.
>
> I was planning (and still might get to it sometime in 2022) to have a C++ 
> parser for Avro.  Wes cross-posted this to the Avro mailing list when I 
> thought I had time to work on it a couple of years ago and I don't recall any 
> response to it.  The Rust avro library I believe was also just recently 
> adopted/donated into the Apache Avro project.
>
> Avro seems to be pretty common so having the ability to convert to and from 
> it is I think is generally valuable.
>
> Cheers,
> Micah
>
>
> On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres  wrote:
>>
>> Rust allows to easily swap the global allocator to e.g. mimalloc or
>> snmalloc, even without the library supporting to change the allocator. In
>> my experience this indeed helps with allocation heavy code (I have seen
>> changes of up to 30%).
>>
>> Best regards,
>>
>> Daniël
>>
>>
>> On Sun, Oct 31, 2021, 18:15 Adam Lippai  wrote:
>>
>> > Hi Jorge,
>> >
>> > Just an idea: Do the Avro libs support different allocators? Maybe using a
>> > different one (e.g. mimalloc) would yield more similar results by working
>> > around the fragmentation you described.
>> >
>> > This wouldn't change the fact that they are relatively slow, however it
>> > could allow you better apples to apples comparison thus better CPU
>> > profiling and understanding of the nuances.
>> >
>> > Best regards,
>> > Adam Lippai
>> >
>> >
>> > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão > > >
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I am reporting back a conclusion that I recently arrived at when adding
>> > > support for reading Avro to Arrow.
>> > >
>> > > Avro is a storage format that does not have an associated in-memory
>> > > format. In Rust, the official implementation dese

Re: Synergies with Apache Avro?

2021-10-31 Thread Micah Kornfield
Hi Jorge,

> The results are a bit surprising: reading 2^20 rows of 3 byte strings is
> ~6x faster than the official Avro Rust implementation and ~20x faster vs
> "fastavro"


This sentence is a little bit hard to parse.  Is a row of 3 strings or a
row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
of the complexity of parsing avro is the schema evolution rules, I haven't
looked at whether the canonical implementations do any optimization for the
happy case when reader and writer schema are the same.

There is a "Java Avro -> Arrow" implementation checked but it is somewhat
broken today (I filed an issue on this a while ago) that delegates parsing
the t/from the Avro java library.  I also think there might be faster
implementations that aren't the canonical implementations (I seem to recall
a JIT version for java for example and fastavro is another).  For both Java
and Python I'd imagine there would be some decent speed improvements simply
by avoiding the "boxing" task of moving language primitive types to native
memory.

I was planning (and still might get to it sometime in 2022) to have a C++
parser for Avro.  Wes cross-posted this to the Avro mailing list when I
thought I had time to work on it a couple of years ago and I don't recall
any response to it.  The Rust avro library I believe was also just recently
adopted/donated into the Apache Avro project.

Avro seems to be pretty common so having the ability to convert to and from
it is I think is generally valuable.

Cheers,
Micah


On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres  wrote:

> Rust allows to easily swap the global allocator to e.g. mimalloc or
> snmalloc, even without the library supporting to change the allocator. In
> my experience this indeed helps with allocation heavy code (I have seen
> changes of up to 30%).
>
> Best regards,
>
> Daniël
>
>
> On Sun, Oct 31, 2021, 18:15 Adam Lippai  wrote:
>
> > Hi Jorge,
> >
> > Just an idea: Do the Avro libs support different allocators? Maybe using
> a
> > different one (e.g. mimalloc) would yield more similar results by working
> > around the fragmentation you described.
> >
> > This wouldn't change the fact that they are relatively slow, however it
> > could allow you better apples to apples comparison thus better CPU
> > profiling and understanding of the nuances.
> >
> > Best regards,
> > Adam Lippai
> >
> >
> > On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > I am reporting back a conclusion that I recently arrived at when adding
> > > support for reading Avro to Arrow.
> > >
> > > Avro is a storage format that does not have an associated in-memory
> > > format. In Rust, the official implementation deserializes an enum, in
> > > Python to a vector of Object, and I suspect in Java to an equivalent
> > vector
> > > of object. The important aspect is that all of them use fragmented
> memory
> > > regions (as opposed to what we do with e.g. one uint8 buffer for
> > > StringArray).
> > >
> > > I benchmarked reading to arrow vs reading via the official Avro
> > > implementations. The results are a bit surprising: reading 2^20 rows
> of 3
> > > byte strings is ~6x faster than the official Avro Rust implementation
> and
> > > ~20x faster vs "fastavro", a C implementation with bindings for Python
> > (pip
> > > install fastavro), all with a difference slope (see graph below or
> > numbers
> > > and used code here [1]).
> > > [image: avro_read.png]
> > >
> > > I found this a bit surprising because we need to read row by row and
> > > perform a transpose of the data (from rows to columns) which is usually
> > > expensive. Furthermore, reading strings can't be that much optimized
> > after
> > > all.
> > >
> > > To investigate the root cause, I drilled down to the flamegraphs for
> both
> > > the official avro rust implementation and the arrow2 implementation:
> the
> > > majority of the time in the Avro implementation is spent allocating
> > > individual strings (to build the [str] - equivalents); the majority of
> > the
> > > time in arrow2 is equally divided between zigzag decoding (to get the
> > > length of the item), reallocs, and utf8 validation.
> > >
> > > My hypothesis is that the difference in performance is unrelated to a
> > > particular implementation of arrow or avro, but to a general concept of
> > > reading to [str] vs arrow. Specifically, the item by item allocation
> > > strategy is far worse than what we do in Arrow with a single region
> which
> > > we reallocate from time to time with exponential growth. In some
> > > architectures we even benefit from the __memmove_avx_unaligned_erms
> > > instruction that makes it even cheaper to reallocate.
> > >
> > > Has anyone else performed such benchmarks or played with Avro -> Arrow
> > and
> > > found supporting / opposing findings to this hypothesis?
> > >
> > > If this hypothesis holds (e.g. with a similar result against the Java
> > > implemen

Re: Synergies with Apache Avro?

2021-10-31 Thread Daniël Heres
Rust allows to easily swap the global allocator to e.g. mimalloc or
snmalloc, even without the library supporting to change the allocator. In
my experience this indeed helps with allocation heavy code (I have seen
changes of up to 30%).

Best regards,

Daniël


On Sun, Oct 31, 2021, 18:15 Adam Lippai  wrote:

> Hi Jorge,
>
> Just an idea: Do the Avro libs support different allocators? Maybe using a
> different one (e.g. mimalloc) would yield more similar results by working
> around the fragmentation you described.
>
> This wouldn't change the fact that they are relatively slow, however it
> could allow you better apples to apples comparison thus better CPU
> profiling and understanding of the nuances.
>
> Best regards,
> Adam Lippai
>
>
> On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão  >
> wrote:
>
> > Hi,
> >
> > I am reporting back a conclusion that I recently arrived at when adding
> > support for reading Avro to Arrow.
> >
> > Avro is a storage format that does not have an associated in-memory
> > format. In Rust, the official implementation deserializes an enum, in
> > Python to a vector of Object, and I suspect in Java to an equivalent
> vector
> > of object. The important aspect is that all of them use fragmented memory
> > regions (as opposed to what we do with e.g. one uint8 buffer for
> > StringArray).
> >
> > I benchmarked reading to arrow vs reading via the official Avro
> > implementations. The results are a bit surprising: reading 2^20 rows of 3
> > byte strings is ~6x faster than the official Avro Rust implementation and
> > ~20x faster vs "fastavro", a C implementation with bindings for Python
> (pip
> > install fastavro), all with a difference slope (see graph below or
> numbers
> > and used code here [1]).
> > [image: avro_read.png]
> >
> > I found this a bit surprising because we need to read row by row and
> > perform a transpose of the data (from rows to columns) which is usually
> > expensive. Furthermore, reading strings can't be that much optimized
> after
> > all.
> >
> > To investigate the root cause, I drilled down to the flamegraphs for both
> > the official avro rust implementation and the arrow2 implementation: the
> > majority of the time in the Avro implementation is spent allocating
> > individual strings (to build the [str] - equivalents); the majority of
> the
> > time in arrow2 is equally divided between zigzag decoding (to get the
> > length of the item), reallocs, and utf8 validation.
> >
> > My hypothesis is that the difference in performance is unrelated to a
> > particular implementation of arrow or avro, but to a general concept of
> > reading to [str] vs arrow. Specifically, the item by item allocation
> > strategy is far worse than what we do in Arrow with a single region which
> > we reallocate from time to time with exponential growth. In some
> > architectures we even benefit from the __memmove_avx_unaligned_erms
> > instruction that makes it even cheaper to reallocate.
> >
> > Has anyone else performed such benchmarks or played with Avro -> Arrow
> and
> > found supporting / opposing findings to this hypothesis?
> >
> > If this hypothesis holds (e.g. with a similar result against the Java
> > implementation of Avro), it imo puts arrow as a strong candidate for the
> > default format of Avro implementations to deserialize into when using it
> > in-memory, which could benefit both projects?
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/DataEngineeringLabs/arrow2-benches
> >
> >
> >
>


Re: Synergies with Apache Avro?

2021-10-31 Thread Adam Lippai
Hi Jorge,

Just an idea: Do the Avro libs support different allocators? Maybe using a
different one (e.g. mimalloc) would yield more similar results by working
around the fragmentation you described.

This wouldn't change the fact that they are relatively slow, however it
could allow you better apples to apples comparison thus better CPU
profiling and understanding of the nuances.

Best regards,
Adam Lippai


On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão 
wrote:

> Hi,
>
> I am reporting back a conclusion that I recently arrived at when adding
> support for reading Avro to Arrow.
>
> Avro is a storage format that does not have an associated in-memory
> format. In Rust, the official implementation deserializes an enum, in
> Python to a vector of Object, and I suspect in Java to an equivalent vector
> of object. The important aspect is that all of them use fragmented memory
> regions (as opposed to what we do with e.g. one uint8 buffer for
> StringArray).
>
> I benchmarked reading to arrow vs reading via the official Avro
> implementations. The results are a bit surprising: reading 2^20 rows of 3
> byte strings is ~6x faster than the official Avro Rust implementation and
> ~20x faster vs "fastavro", a C implementation with bindings for Python (pip
> install fastavro), all with a difference slope (see graph below or numbers
> and used code here [1]).
> [image: avro_read.png]
>
> I found this a bit surprising because we need to read row by row and
> perform a transpose of the data (from rows to columns) which is usually
> expensive. Furthermore, reading strings can't be that much optimized after
> all.
>
> To investigate the root cause, I drilled down to the flamegraphs for both
> the official avro rust implementation and the arrow2 implementation: the
> majority of the time in the Avro implementation is spent allocating
> individual strings (to build the [str] - equivalents); the majority of the
> time in arrow2 is equally divided between zigzag decoding (to get the
> length of the item), reallocs, and utf8 validation.
>
> My hypothesis is that the difference in performance is unrelated to a
> particular implementation of arrow or avro, but to a general concept of
> reading to [str] vs arrow. Specifically, the item by item allocation
> strategy is far worse than what we do in Arrow with a single region which
> we reallocate from time to time with exponential growth. In some
> architectures we even benefit from the __memmove_avx_unaligned_erms
> instruction that makes it even cheaper to reallocate.
>
> Has anyone else performed such benchmarks or played with Avro -> Arrow and
> found supporting / opposing findings to this hypothesis?
>
> If this hypothesis holds (e.g. with a similar result against the Java
> implementation of Avro), it imo puts arrow as a strong candidate for the
> default format of Avro implementations to deserialize into when using it
> in-memory, which could benefit both projects?
>
> Best,
> Jorge
>
> [1] https://github.com/DataEngineeringLabs/arrow2-benches
>
>
>