Re: Flume performance measurements

2015-04-08 Thread Arvind Prabhakar
Done. Please let me know if you run into any issues.

Regards,
Arvind

On Wed, Apr 8, 2015 at 3:58 PM, Roshan Naik ros...@hortonworks.com wrote:

 roshan_naik is my login to cwiki.apache.org




 On 4/8/15 3:55 PM, Arvind Prabhakar arv...@apache.org wrote:

 Added Hari to the wiki.
 
 Roshan, I could not look you up on the wiki users, can you please tell me
 your username? If you don't have one yet, please register and let me know.
 
 Regards,
 Arvind Prabhakar
 
 On Wed, Apr 8, 2015 at 3:26 PM, Roshan Naik ros...@hortonworks.com
 wrote:
 
  Arvind,
Please do let me know once  you have granted me permission to the
 wiki.
  -roshan
 
  From: Hari Shreedharan hshreedha...@cloudera.commailto:
  hshreedha...@cloudera.com
  Date: Thursday, April 2, 2015 3:06 PM
  To: Roshan Naik ros...@hortonworks.commailto:ros...@hortonworks.com
  Cc: dev@flume.apache.orgmailto:dev@flume.apache.org 
  dev@flume.apache.orgmailto:dev@flume.apache.org
  Subject: Re: Flume performance measurements
 
  Arvind - please could you grant Roshan access to the wiki.
 
  Thanks,
  Hari
 
 
 
  On Thu, Apr 2, 2015 at 3:04 PM, Roshan Naik ros...@hortonworks.com
  mailto:ros...@hortonworks.com wrote:
 
  Could u grant me write access to wiki ?
  username: roshannaik
 
 
 
  On 4/2/15 2:53 PM, Hari Shreedharan hshreedha...@cloudera.com
 mailto:
  hshreedha...@cloudera.com wrote:
 
  Roshan,
  
  
  
  
  Could you update the performance measurements page on our wiki with
 this
  info? That would be more useful to reference.
  
  
  
  
  Thanks, Hari
  
  On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com
  mailto:ros...@hortonworks.com
  wrote:
  
   Sample Flume v1.4 Measurements for reference:
   Here are some sample measurements taken with a single agent and 500
  byte events.
   Cluster Config: 20-node Hadoop cluster (1 name node and 19 data
 nodes).
   Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.
   1. File channel with HDFS Sink (Sequence File):
   Source: 4 x Exec Source, 100k batchSize
   HDFS Sink Batch size: 500,000
   Channel: File
   Number of data dirs: 8
   Events/Sec
   Sink Count
   1 data dirs
   2 data dirs
   4 data dirs
   6 data dirs
   8 data dirs
   10 data dirs
   1
   14.3 k
   2
   21.9 k
   4
   35.8 k
   8
   24.8 k
   43.8 k
   72.5 k
   77 k
   78.6 k
   76.6 k
   10
   58 k
   12
   49.3 k
   49 k
   Was looking for sweet spot in perf. So did not take measurements for
  all data points on grid. Only too for the ones that made sense. For
  example: when perf dropped by adding more sinks, did not take more
  measurements for those rows.
   2. HDFS Sink:
   Channel: Memory
   # of HDFS
   Sinks
   Snappy
   BatchSz:1.2mill
   Snappy
   BatchSz:1.4mill
   Sequence File
   BatchSz:1.2mill
   1
   34.3 k
   33 k
   33 k
   2
   71 k
   75 k
   69 k
   4
   141 k
   145 k
   141 k
   8
   271 k
   273 k
   251 k
   12
   382 k
   380 k
   370 k
   16
   478 k
   538 k
   486 k
   Some simple observations :
   * increasing number of dataDirs helps FC perf even on single disk
  systems
   * Increasing number of sinks helps
   * Max throughput observed was about 538k events/sec for HDFS sink
  which is approx 240MB/s
 
 
 




Re: Flume performance measurements

2015-04-08 Thread Arvind Prabhakar
Added Hari to the wiki.

Roshan, I could not look you up on the wiki users, can you please tell me
your username? If you don't have one yet, please register and let me know.

Regards,
Arvind Prabhakar

On Wed, Apr 8, 2015 at 3:26 PM, Roshan Naik ros...@hortonworks.com wrote:

 Arvind,
   Please do let me know once  you have granted me permission to the wiki.
 -roshan

 From: Hari Shreedharan hshreedha...@cloudera.commailto:
 hshreedha...@cloudera.com
 Date: Thursday, April 2, 2015 3:06 PM
 To: Roshan Naik ros...@hortonworks.commailto:ros...@hortonworks.com
 Cc: dev@flume.apache.orgmailto:dev@flume.apache.org 
 dev@flume.apache.orgmailto:dev@flume.apache.org
 Subject: Re: Flume performance measurements

 Arvind - please could you grant Roshan access to the wiki.

 Thanks,
 Hari



 On Thu, Apr 2, 2015 at 3:04 PM, Roshan Naik ros...@hortonworks.com
 mailto:ros...@hortonworks.com wrote:

 Could u grant me write access to wiki ?
 username: roshannaik



 On 4/2/15 2:53 PM, Hari Shreedharan hshreedha...@cloudera.commailto:
 hshreedha...@cloudera.com wrote:

 Roshan,
 
 
 
 
 Could you update the performance measurements page on our wiki with this
 info? That would be more useful to reference.
 
 
 
 
 Thanks, Hari
 
 On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com
 mailto:ros...@hortonworks.com
 wrote:
 
  Sample Flume v1.4 Measurements for reference:
  Here are some sample measurements taken with a single agent and 500
 byte events.
  Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes).
  Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.
  1. File channel with HDFS Sink (Sequence File):
  Source: 4 x Exec Source, 100k batchSize
  HDFS Sink Batch size: 500,000
  Channel: File
  Number of data dirs: 8
  Events/Sec
  Sink Count
  1 data dirs
  2 data dirs
  4 data dirs
  6 data dirs
  8 data dirs
  10 data dirs
  1
  14.3 k
  2
  21.9 k
  4
  35.8 k
  8
  24.8 k
  43.8 k
  72.5 k
  77 k
  78.6 k
  76.6 k
  10
  58 k
  12
  49.3 k
  49 k
  Was looking for sweet spot in perf. So did not take measurements for
 all data points on grid. Only too for the ones that made sense. For
 example: when perf dropped by adding more sinks, did not take more
 measurements for those rows.
  2. HDFS Sink:
  Channel: Memory
  # of HDFS
  Sinks
  Snappy
  BatchSz:1.2mill
  Snappy
  BatchSz:1.4mill
  Sequence File
  BatchSz:1.2mill
  1
  34.3 k
  33 k
  33 k
  2
  71 k
  75 k
  69 k
  4
  141 k
  145 k
  141 k
  8
  271 k
  273 k
  251 k
  12
  382 k
  380 k
  370 k
  16
  478 k
  538 k
  486 k
  Some simple observations :
  * increasing number of dataDirs helps FC perf even on single disk
 systems
  * Increasing number of sinks helps
  * Max throughput observed was about 538k events/sec for HDFS sink
 which is approx 240MB/s





Re: Flume performance measurements

2015-04-08 Thread Roshan Naik
roshan_naik is my login to cwiki.apache.org




On 4/8/15 3:55 PM, Arvind Prabhakar arv...@apache.org wrote:

Added Hari to the wiki.

Roshan, I could not look you up on the wiki users, can you please tell me
your username? If you don't have one yet, please register and let me know.

Regards,
Arvind Prabhakar

On Wed, Apr 8, 2015 at 3:26 PM, Roshan Naik ros...@hortonworks.com
wrote:

 Arvind,
   Please do let me know once  you have granted me permission to the
wiki.
 -roshan

 From: Hari Shreedharan hshreedha...@cloudera.commailto:
 hshreedha...@cloudera.com
 Date: Thursday, April 2, 2015 3:06 PM
 To: Roshan Naik ros...@hortonworks.commailto:ros...@hortonworks.com
 Cc: dev@flume.apache.orgmailto:dev@flume.apache.org 
 dev@flume.apache.orgmailto:dev@flume.apache.org
 Subject: Re: Flume performance measurements

 Arvind - please could you grant Roshan access to the wiki.

 Thanks,
 Hari



 On Thu, Apr 2, 2015 at 3:04 PM, Roshan Naik ros...@hortonworks.com
 mailto:ros...@hortonworks.com wrote:

 Could u grant me write access to wiki ?
 username: roshannaik



 On 4/2/15 2:53 PM, Hari Shreedharan hshreedha...@cloudera.commailto:
 hshreedha...@cloudera.com wrote:

 Roshan,
 
 
 
 
 Could you update the performance measurements page on our wiki with
this
 info? That would be more useful to reference.
 
 
 
 
 Thanks, Hari
 
 On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com
 mailto:ros...@hortonworks.com
 wrote:
 
  Sample Flume v1.4 Measurements for reference:
  Here are some sample measurements taken with a single agent and 500
 byte events.
  Cluster Config: 20-node Hadoop cluster (1 name node and 19 data
nodes).
  Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.
  1. File channel with HDFS Sink (Sequence File):
  Source: 4 x Exec Source, 100k batchSize
  HDFS Sink Batch size: 500,000
  Channel: File
  Number of data dirs: 8
  Events/Sec
  Sink Count
  1 data dirs
  2 data dirs
  4 data dirs
  6 data dirs
  8 data dirs
  10 data dirs
  1
  14.3 k
  2
  21.9 k
  4
  35.8 k
  8
  24.8 k
  43.8 k
  72.5 k
  77 k
  78.6 k
  76.6 k
  10
  58 k
  12
  49.3 k
  49 k
  Was looking for sweet spot in perf. So did not take measurements for
 all data points on grid. Only too for the ones that made sense. For
 example: when perf dropped by adding more sinks, did not take more
 measurements for those rows.
  2. HDFS Sink:
  Channel: Memory
  # of HDFS
  Sinks
  Snappy
  BatchSz:1.2mill
  Snappy
  BatchSz:1.4mill
  Sequence File
  BatchSz:1.2mill
  1
  34.3 k
  33 k
  33 k
  2
  71 k
  75 k
  69 k
  4
  141 k
  145 k
  141 k
  8
  271 k
  273 k
  251 k
  12
  382 k
  380 k
  370 k
  16
  478 k
  538 k
  486 k
  Some simple observations :
  * increasing number of dataDirs helps FC perf even on single disk
 systems
  * Increasing number of sinks helps
  * Max throughput observed was about 538k events/sec for HDFS sink
 which is approx 240MB/s






Flume performance measurements

2015-04-02 Thread Roshan Naik
Sample Flume v1.4 Measurements for reference:

Here are some sample measurements taken with a single agent and 500 byte events.

Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes).

Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.


1. File channel with HDFS Sink (Sequence File):

Source: 4 x Exec Source, 100k batchSize

HDFS Sink Batch size: 500,000

Channel: File

Number of data dirs: 8






Events/Sec


Sink Count


1 data dirs


2 data dirs


4 data dirs


6 data dirs


8 data dirs


10 data dirs


1


14.3 k

















2


21.9 k

















4





35.8 k














8


24.8 k


43.8 k


72.5 k


77 k


78.6 k


76.6 k


10








58 k








12








49.3 k


49 k





Was looking for sweet spot in perf. So did not take measurements for all data  
points on grid. Only too for the ones that made sense. For example: when perf 
dropped by adding more sinks, did not take more measurements for those rows.


2. HDFS Sink:

Channel: Memory



# of  HDFS

Sinks


Snappy

BatchSz:1.2mill


Snappy

BatchSz:1.4mill


Sequence File

BatchSz:1.2mill


1


34.3 k


33 k


33 k


2


71 k


75 k


69 k


4


141 k


145 k


141 k


8


271 k


273 k


251 k


12


382 k


380 k


370 k


16


478 k


538 k


486 k



Some simple observations :

  *   increasing number of dataDirs helps FC perf even on single disk systems
  *   Increasing  number of sinks helps
  *   Max throughput observed was about 538k events/sec for HDFS sink which is 
approx 240MB/s


Re: Flume performance measurements

2015-04-02 Thread Hari Shreedharan
Roshan, 




Could you update the performance measurements page on our wiki with this info? 
That would be more useful to reference.




Thanks, Hari

On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com
wrote:

 Sample Flume v1.4 Measurements for reference:
 Here are some sample measurements taken with a single agent and 500 byte 
 events.
 Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes).
 Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.
 1. File channel with HDFS Sink (Sequence File):
 Source: 4 x Exec Source, 100k batchSize
 HDFS Sink Batch size: 500,000
 Channel: File
 Number of data dirs: 8
 Events/Sec
 Sink Count
 1 data dirs
 2 data dirs
 4 data dirs
 6 data dirs
 8 data dirs
 10 data dirs
 1
 14.3 k
 2
 21.9 k
 4
 35.8 k
 8
 24.8 k
 43.8 k
 72.5 k
 77 k
 78.6 k
 76.6 k
 10
 58 k
 12
 49.3 k
 49 k
 Was looking for sweet spot in perf. So did not take measurements for all data 
  points on grid. Only too for the ones that made sense. For example: when 
 perf dropped by adding more sinks, did not take more measurements for those 
 rows.
 2. HDFS Sink:
 Channel: Memory
 # of  HDFS
 Sinks
 Snappy
 BatchSz:1.2mill
 Snappy
 BatchSz:1.4mill
 Sequence File
 BatchSz:1.2mill
 1
 34.3 k
 33 k
 33 k
 2
 71 k
 75 k
 69 k
 4
 141 k
 145 k
 141 k
 8
 271 k
 273 k
 251 k
 12
 382 k
 380 k
 370 k
 16
 478 k
 538 k
 486 k
 Some simple observations :
   *   increasing number of dataDirs helps FC perf even on single disk systems
   *   Increasing  number of sinks helps
   *   Max throughput observed was about 538k events/sec for HDFS sink which 
 is approx 240MB/s

Re: Flume performance measurements

2015-04-02 Thread Hari Shreedharan
Arvind - please could you grant Roshan access to the wiki. 




Thanks, Hari

On Thu, Apr 2, 2015 at 3:04 PM, Roshan Naik ros...@hortonworks.com
wrote:

 Could u grant me write access to wiki ?
 username: roshannaik
 On 4/2/15 2:53 PM, Hari Shreedharan hshreedha...@cloudera.com wrote:
Roshan, 




Could you update the performance measurements page on our wiki with this
info? That would be more useful to reference.




Thanks, Hari

On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com
wrote:

 Sample Flume v1.4 Measurements for reference:
 Here are some sample measurements taken with a single agent and 500
byte events.
 Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes).
 Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.
 1. File channel with HDFS Sink (Sequence File):
 Source: 4 x Exec Source, 100k batchSize
 HDFS Sink Batch size: 500,000
 Channel: File
 Number of data dirs: 8
 Events/Sec
 Sink Count
 1 data dirs
 2 data dirs
 4 data dirs
 6 data dirs
 8 data dirs
 10 data dirs
 1
 14.3 k
 2
 21.9 k
 4
 35.8 k
 8
 24.8 k
 43.8 k
 72.5 k
 77 k
 78.6 k
 76.6 k
 10
 58 k
 12
 49.3 k
 49 k
 Was looking for sweet spot in perf. So did not take measurements for
all data  points on grid. Only too for the ones that made sense. For
example: when perf dropped by adding more sinks, did not take more
measurements for those rows.
 2. HDFS Sink:
 Channel: Memory
 # of  HDFS
 Sinks
 Snappy
 BatchSz:1.2mill
 Snappy
 BatchSz:1.4mill
 Sequence File
 BatchSz:1.2mill
 1
 34.3 k
 33 k
 33 k
 2
 71 k
 75 k
 69 k
 4
 141 k
 145 k
 141 k
 8
 271 k
 273 k
 251 k
 12
 382 k
 380 k
 370 k
 16
 478 k
 538 k
 486 k
 Some simple observations :
   *   increasing number of dataDirs helps FC perf even on single disk
systems
   *   Increasing  number of sinks helps
   *   Max throughput observed was about 538k events/sec for HDFS sink
which is approx 240MB/s