Scott Carey wrote:
Decoding 3MB/sec seems rather slow to me (121MB log file instantiated to
objects in ~40 secs). For comparison, creating tuple objects from a Hadoop
SequenceFile is ~5x faster. Granted I'm comparing apples to oranges (my
objects in SequenceFile to Eelco's test in Avro).
This would depend on a lot on the objects themselves, the schema, and
generic vs. specific, etc.
FWIW, in microbenchmarks, accessing fields via reflection is around 100x
slower than normal field access! That makes the reflect API generally
much slower than generic and specific.
Reflect is also a bit tricky to use, since you need to define classes
whose fields Avro knows how to serialize: the reflect API cannot infer
an Avro schema for every Java class, but rather only for a stylized
subset of classes (which needs to be better documented, AVRO-35).
I've found that generating classes with the specific API is both simpler
and faster. In particular, if you have a set of related classes, use a
method-free protocol file (.avpr) to define them. The Java classes are
generated by an Ant task. For example, see the patch I attached to the
following issue:
https://issues.apache.org/jira/browse/MAPREDUCE-157
The "schemata" Ant target generates a file under build/src named
Events.java that contains nested classes for each type defined in
Events.avpr. (That target would better be named "generate-avro-classes".)
Note that specific's generated code does not currently have constructors
or accessor methods. Instead all fields are public, so, to build an
instance you create it with something like 'Foo foo = new Foo();' then
set all its fields with things like 'foo.a = ...;". If this proves too
cumbersome, we could generate a constructor that includes all fields. I
don't see a big need for accessor methods: a public setter and getter is
equivalent to a public field. The only advantage accessors would add is
if you might someday wish to replace the class with a non-Avro-generated
implementation, change the fields, keep the accessor methods and
serialize it manually or with reflection. This does not seem like a
likely scenario to me, and it's nice to keep the generated code small.
The primary downside of using the specific API is that you can't add
extra methods, etc. to the generated classes. You need to treat them
just as dumb structs, and keep all application logic external to them.
In practice I don't think this is a big limitation, however.
Doug