Hi all, I'd like to share a library I wrote to encode and decode Avro data in Python. It's called 'avroc', at https://github.com/spenczar/avroc, MIT-licensed. I wrote a bit about it here: https://journal.spencerwnelson.com/entries/avro.html.
Most runtime Avro decoders decode data by walking through a schema definition, traversing it in memory to know exactly how to interpret the next bytes step-by-step. They repeat this walking traversal for every message they decode, which can involve a lot of overhead - for example, there are often big switches on the "type" of a schema which must be evaluated for every node in the definition. Encoders work similarly. One way around this is code generation, which can produce efficient routines - but only if you know the schema in advance. And code generation doesn't work for schema resolution, where you want to read with a different schema than the writer used (perhaps to drop fields you don't care about). To get around those limitations, avroc uses a different strategy: it translates Avro schemas into a Python AST during runtime. That AST is then compiled into Python bytecode, which runs directly on the Python interpreter virtual machine. The results have been very promising. I'm able to outperform the Java implementation on some real-world test data (admittedly, my case uses a very deep and complicated schema), as measured by encoding and decoding throughput. I think this approach may be generally useful in other languages, so I wanted to present this as a possible design option for others. I think I'm not the first to think of this idea; "Avro schemas as LL(1) CFG definitions" ( http://avro.apache.org/docs/1.10.2/api/java/org/apache/avro/io/parsing/doc-files/parsing.html) is in the official docs and seems to be driving at a similar idea. If you're using Python and interested in a high-performance, pure-Python implementation, give avroc a shot, I'd love to hear from you. -Spencer
