Sample Flume v1.4 Measurements for reference: Here are some sample measurements taken with a single agent and 500 byte events.
Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes). Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM. 1. File channel with HDFS Sink (Sequence File): Source: 4 x Exec Source, 100k batchSize HDFS Sink Batch size: 500,000 Channel: File Number of data dirs: 8 Events/Sec Sink Count 1 data dirs 2 data dirs 4 data dirs 6 data dirs 8 data dirs 10 data dirs 1 14.3 k 2 21.9 k 4 35.8 k 8 24.8 k 43.8 k 72.5 k 77 k 78.6 k 76.6 k 10 58 k 12 49.3 k 49 k Was looking for sweet spot in perf. So did not take measurements for all data points on grid. Only too for the ones that made sense. For example: when perf dropped by adding more sinks, did not take more measurements for those rows. 2. HDFS Sink: Channel: Memory # of HDFS Sinks Snappy BatchSz:1.2mill Snappy BatchSz:1.4mill Sequence File BatchSz:1.2mill 1 34.3 k 33 k 33 k 2 71 k 75 k 69 k 4 141 k 145 k 141 k 8 271 k 273 k 251 k 12 382 k 380 k 370 k 16 478 k 538 k 486 k Some simple observations : * increasing number of dataDirs helps FC perf even on single disk systems * Increasing number of sinks helps * Max throughput observed was about 538k events/sec for HDFS sink which is approx 240MB/s