Hi, I'd like to share some results I'm having with using Avro. Just fyi :-)
We are using Avro to log 'audit events'. Audit events are basically simple Java objects with a few properties that describe the audit event. An example is class SiteNodeDeletedEvent with properties timeStamp, userId and siteNodeId. Most event classes have between 3 to 8 properties. What I like about doing audit logging like this rather than just logging string messages, is that it forces us to use data structures which will be easier to analyze later, and that it will be much easier to go through our code to find what kind of audit events we have (all events must extend the AuditEvent base class). We basically just use Avro to serialize these objects to rolling log files locally, which are put into HDFS by a daemon separately. We use Avro's reflection API so that we don't have to deal with code generation and keep our development model as simple as we can. Currently we write only eight different events to a database, and this so far has resulted in a bit over 12 million records. However, I hope to ramp up what we log, so expect we will soon have trillions of records. I'd much rather buy more disk space than having to worry about scaling our database, and I think audit logging is kind of a natural case for HDFS/ MR, but while I'm at it, why not just making the logging itself efficient, which is where Avro comes into play. I wrote a little framework for logging these events, and tested that with our current records. In that test, I roll over each file after a million records, so I end up with 13 files (last file only a quarter million), totaling 121 MB unpacked/ 36 MB gzipped (that framework typically gzips right after rolling over). So that's 10 MB unpacked/ 3 MB packed per million records. It writes those files, including reading the records from a local MySQL database and instantiating the event objects in 4.5 minutes on my MBP. Reading in and instantiating those events from the log files again costs 1.3 minutes. In my book, those are pretty good figures for my humble laptop! And keep in mind that I am using the reflection API; using specific records probably could eat quite a bit out of the processing time, at least when it comes to writing. Anyway, I'm sure I won't have any trouble selling Avro to my colleagues, and I just wanted to share my experiences in case anyone would be interested. It'd be awesome to read other's experiences as well. Now on to playing with MR and Pig etc. :-) Cheers, Eelco
