Hi folks,
Will McQueen and I have been doing some Flume NG stress and performance 
testing, and we wanted to share some of our recent findings. The focus of the 
most recent tests has been on the syslog TCP source, memory channel, and HDFS 
sink.

I wrote some software to generate load in syslog format over TCP and to 
automate some of the analysis. The first thing we wanted to verify is that no 
data was lost during these tests (a.k.a. correctness), with a close second 
priority being of course throughput (performance). I used Pig and AvroStorage 
from piggybank in the data integrity analysis, and committed the compiled (0.11 
trunk) piggybank jar so the load analysis scripts would be relatively easy to 
use. It seems to be compatible with Pig 0.8.1. I am a little wary of having to 
maintain that type of thing at the Apache org level so for now I have checked 
all the code in on Github under an ASL 2.0 license:

https://github.com/mpercy/flume-load-gen

I have created a Wiki page with the performance metrics we have come up with so 
far. The executive summary is that at the time of this writing, we have 
observed Flume NG on a single machine processing events at a throughput rate of 
70,000+ events/sec with no data loss.

https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements

I have put more details on the wiki page itself. Please let me know if you want 
me to add more detail. I'll be looking into improving the performance of these 
components going forward, however we wanted to post these results to set a 
public performance baseline of Flume NG.

If others have done performance testing, we would love to see your results if 
you can post the details.

Regards,
Mike

Reply via email to