Re: user experience

Doug Cutting Wed, 02 Sep 2009 10:06:10 -0700

Scott Carey wrote:

Decoding 3MB/sec seems rather slow to me (121MB log file instantiated to
objects in ~40 secs).  For comparison, creating tuple objects from a Hadoop
SequenceFile is ~5x faster.  Granted I'm comparing apples to oranges (my
objects in SequenceFile to Eelco's test in Avro).


This would depend on a lot on the objects themselves, the schema, and
generic vs. specific, etc.

FWIW, in microbenchmarks, accessing fields via reflection is around 100xslower than normal field access! That makes the reflect API generallymuch slower than generic and specific.

Reflect is also a bit tricky to use, since you need to define classeswhose fields Avro knows how to serialize: the reflect API cannot inferan Avro schema for every Java class, but rather only for a stylizedsubset of classes (which needs to be better documented, AVRO-35).

I've found that generating classes with the specific API is both simplerand faster. In particular, if you have a set of related classes, use amethod-free protocol file (.avpr) to define them. The Java classes aregenerated by an Ant task. For example, see the patch I attached to thefollowing issue:


https://issues.apache.org/jira/browse/MAPREDUCE-157

The "schemata" Ant target generates a file under build/src namedEvents.java that contains nested classes for each type defined inEvents.avpr. (That target would better be named "generate-avro-classes".)

Note that specific's generated code does not currently have constructorsor accessor methods. Instead all fields are public, so, to build aninstance you create it with something like 'Foo foo = new Foo();' thenset all its fields with things like 'foo.a = ...;". If this proves toocumbersome, we could generate a constructor that includes all fields. Idon't see a big need for accessor methods: a public setter and getter isequivalent to a public field. The only advantage accessors would add isif you might someday wish to replace the class with a non-Avro-generatedimplementation, change the fields, keep the accessor methods andserialize it manually or with reflection. This does not seem like alikely scenario to me, and it's nice to keep the generated code small.

The primary downside of using the specific API is that you can't addextra methods, etc. to the generated classes. You need to treat themjust as dumb structs, and keep all application logic external to them.In practice I don't think this is a big limitation, however.


Doug

Re: user experience

Reply via email to