Scott Carey wrote:
Decoding 3MB/sec seems rather slow to me (121MB log file instantiated to
objects in ~40 secs).  For comparison, creating tuple objects from a Hadoop
SequenceFile is ~5x faster.  Granted I'm comparing apples to oranges (my
objects in SequenceFile to Eelco's test in Avro).

This would depend on a lot on the objects themselves, the schema, and
generic vs. specific, etc.

FWIW, in microbenchmarks, accessing fields via reflection is around 100x slower than normal field access! That makes the reflect API generally much slower than generic and specific.

Reflect is also a bit tricky to use, since you need to define classes whose fields Avro knows how to serialize: the reflect API cannot infer an Avro schema for every Java class, but rather only for a stylized subset of classes (which needs to be better documented, AVRO-35).

I've found that generating classes with the specific API is both simpler and faster. In particular, if you have a set of related classes, use a method-free protocol file (.avpr) to define them. The Java classes are generated by an Ant task. For example, see the patch I attached to the following issue:

https://issues.apache.org/jira/browse/MAPREDUCE-157

The "schemata" Ant target generates a file under build/src named Events.java that contains nested classes for each type defined in Events.avpr. (That target would better be named "generate-avro-classes".)

Note that specific's generated code does not currently have constructors or accessor methods. Instead all fields are public, so, to build an instance you create it with something like 'Foo foo = new Foo();' then set all its fields with things like 'foo.a = ...;". If this proves too cumbersome, we could generate a constructor that includes all fields. I don't see a big need for accessor methods: a public setter and getter is equivalent to a public field. The only advantage accessors would add is if you might someday wish to replace the class with a non-Avro-generated implementation, change the fields, keep the accessor methods and serialize it manually or with reflection. This does not seem like a likely scenario to me, and it's nice to keep the generated code small.

The primary downside of using the specific API is that you can't add extra methods, etc. to the generated classes. You need to treat them just as dumb structs, and keep all application logic external to them. In practice I don't think this is a big limitation, however.

Doug

Reply via email to