Sample Flume v1.4 Measurements for reference:

Here are some sample measurements taken with a single agent and 500 byte events.

Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes).

Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.


1.     File channel with HDFS Sink (Sequence File):

Source: 4 x Exec Source, 100k batchSize

HDFS Sink Batch size: 500,000

Channel: File

Number of data dirs: 8






Events/Sec


Sink Count


1 data dirs


2 data dirs


4 data dirs


6 data dirs


8 data dirs


10 data dirs


1


14.3 k

















2


21.9 k

















4





35.8 k














8


24.8 k


43.8 k


72.5 k


77 k


78.6 k


76.6 k


10








58 k








12








49.3 k


49 k





Was looking for sweet spot in perf. So did not take measurements for all data  
points on grid. Only too for the ones that made sense. For example: when perf 
dropped by adding more sinks, did not take more measurements for those rows.


2.     HDFS Sink:

Channel: Memory



# of  HDFS

Sinks


Snappy

BatchSz:1.2mill


Snappy

BatchSz:1.4mill


Sequence File

BatchSz:1.2mill


1


34.3 k


33 k


33 k


2


71 k


75 k


69 k


4


141 k


145 k


141 k


8


271 k


273 k


251 k


12


382 k


380 k


370 k


16


478 k


538 k


486 k



Some simple observations :

  *   increasing number of dataDirs helps FC perf even on single disk systems
  *   Increasing  number of sinks helps
  *   Max throughput observed was about 538k events/sec for HDFS sink which is 
approx 240MB/s

Reply via email to