Hi folks, Will McQueen and I have been doing some Flume NG stress and performance testing, and we wanted to share some of our recent findings. The focus of the most recent tests has been on the syslog TCP source, memory channel, and HDFS sink.
I wrote some software to generate load in syslog format over TCP and to automate some of the analysis. The first thing we wanted to verify is that no data was lost during these tests (a.k.a. correctness), with a close second priority being of course throughput (performance). I used Pig and AvroStorage from piggybank in the data integrity analysis, and committed the compiled (0.11 trunk) piggybank jar so the load analysis scripts would be relatively easy to use. It seems to be compatible with Pig 0.8.1. I am a little wary of having to maintain that type of thing at the Apache org level so for now I have checked all the code in on Github under an ASL 2.0 license: https://github.com/mpercy/flume-load-gen I have created a Wiki page with the performance metrics we have come up with so far. The executive summary is that at the time of this writing, we have observed Flume NG on a single machine processing events at a throughput rate of 70,000+ events/sec with no data loss. https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements I have put more details on the wiki page itself. Please let me know if you want me to add more detail. I'll be looking into improving the performance of these components going forward, however we wanted to post these results to set a public performance baseline of Flume NG. If others have done performance testing, we would love to see your results if you can post the details. Regards, Mike
