Re: Flume performance measurements
Done. Please let me know if you run into any issues. Regards, Arvind On Wed, Apr 8, 2015 at 3:58 PM, Roshan Naik ros...@hortonworks.com wrote: roshan_naik is my login to cwiki.apache.org On 4/8/15 3:55 PM, Arvind Prabhakar arv...@apache.org wrote: Added Hari to the wiki. Roshan, I could not look you up on the wiki users, can you please tell me your username? If you don't have one yet, please register and let me know. Regards, Arvind Prabhakar On Wed, Apr 8, 2015 at 3:26 PM, Roshan Naik ros...@hortonworks.com wrote: Arvind, Please do let me know once you have granted me permission to the wiki. -roshan From: Hari Shreedharan hshreedha...@cloudera.commailto: hshreedha...@cloudera.com Date: Thursday, April 2, 2015 3:06 PM To: Roshan Naik ros...@hortonworks.commailto:ros...@hortonworks.com Cc: dev@flume.apache.orgmailto:dev@flume.apache.org dev@flume.apache.orgmailto:dev@flume.apache.org Subject: Re: Flume performance measurements Arvind - please could you grant Roshan access to the wiki. Thanks, Hari On Thu, Apr 2, 2015 at 3:04 PM, Roshan Naik ros...@hortonworks.com mailto:ros...@hortonworks.com wrote: Could u grant me write access to wiki ? username: roshannaik On 4/2/15 2:53 PM, Hari Shreedharan hshreedha...@cloudera.com mailto: hshreedha...@cloudera.com wrote: Roshan, Could you update the performance measurements page on our wiki with this info? That would be more useful to reference. Thanks, Hari On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com mailto:ros...@hortonworks.com wrote: Sample Flume v1.4 Measurements for reference: Here are some sample measurements taken with a single agent and 500 byte events. Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes). Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM. 1. File channel with HDFS Sink (Sequence File): Source: 4 x Exec Source, 100k batchSize HDFS Sink Batch size: 500,000 Channel: File Number of data dirs: 8 Events/Sec Sink Count 1 data dirs 2 data dirs 4 data dirs 6 data dirs 8 data dirs 10 data dirs 1 14.3 k 2 21.9 k 4 35.8 k 8 24.8 k 43.8 k 72.5 k 77 k 78.6 k 76.6 k 10 58 k 12 49.3 k 49 k Was looking for sweet spot in perf. So did not take measurements for all data points on grid. Only too for the ones that made sense. For example: when perf dropped by adding more sinks, did not take more measurements for those rows. 2. HDFS Sink: Channel: Memory # of HDFS Sinks Snappy BatchSz:1.2mill Snappy BatchSz:1.4mill Sequence File BatchSz:1.2mill 1 34.3 k 33 k 33 k 2 71 k 75 k 69 k 4 141 k 145 k 141 k 8 271 k 273 k 251 k 12 382 k 380 k 370 k 16 478 k 538 k 486 k Some simple observations : * increasing number of dataDirs helps FC perf even on single disk systems * Increasing number of sinks helps * Max throughput observed was about 538k events/sec for HDFS sink which is approx 240MB/s
Re: Flume performance measurements
Added Hari to the wiki. Roshan, I could not look you up on the wiki users, can you please tell me your username? If you don't have one yet, please register and let me know. Regards, Arvind Prabhakar On Wed, Apr 8, 2015 at 3:26 PM, Roshan Naik ros...@hortonworks.com wrote: Arvind, Please do let me know once you have granted me permission to the wiki. -roshan From: Hari Shreedharan hshreedha...@cloudera.commailto: hshreedha...@cloudera.com Date: Thursday, April 2, 2015 3:06 PM To: Roshan Naik ros...@hortonworks.commailto:ros...@hortonworks.com Cc: dev@flume.apache.orgmailto:dev@flume.apache.org dev@flume.apache.orgmailto:dev@flume.apache.org Subject: Re: Flume performance measurements Arvind - please could you grant Roshan access to the wiki. Thanks, Hari On Thu, Apr 2, 2015 at 3:04 PM, Roshan Naik ros...@hortonworks.com mailto:ros...@hortonworks.com wrote: Could u grant me write access to wiki ? username: roshannaik On 4/2/15 2:53 PM, Hari Shreedharan hshreedha...@cloudera.commailto: hshreedha...@cloudera.com wrote: Roshan, Could you update the performance measurements page on our wiki with this info? That would be more useful to reference. Thanks, Hari On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com mailto:ros...@hortonworks.com wrote: Sample Flume v1.4 Measurements for reference: Here are some sample measurements taken with a single agent and 500 byte events. Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes). Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM. 1. File channel with HDFS Sink (Sequence File): Source: 4 x Exec Source, 100k batchSize HDFS Sink Batch size: 500,000 Channel: File Number of data dirs: 8 Events/Sec Sink Count 1 data dirs 2 data dirs 4 data dirs 6 data dirs 8 data dirs 10 data dirs 1 14.3 k 2 21.9 k 4 35.8 k 8 24.8 k 43.8 k 72.5 k 77 k 78.6 k 76.6 k 10 58 k 12 49.3 k 49 k Was looking for sweet spot in perf. So did not take measurements for all data points on grid. Only too for the ones that made sense. For example: when perf dropped by adding more sinks, did not take more measurements for those rows. 2. HDFS Sink: Channel: Memory # of HDFS Sinks Snappy BatchSz:1.2mill Snappy BatchSz:1.4mill Sequence File BatchSz:1.2mill 1 34.3 k 33 k 33 k 2 71 k 75 k 69 k 4 141 k 145 k 141 k 8 271 k 273 k 251 k 12 382 k 380 k 370 k 16 478 k 538 k 486 k Some simple observations : * increasing number of dataDirs helps FC perf even on single disk systems * Increasing number of sinks helps * Max throughput observed was about 538k events/sec for HDFS sink which is approx 240MB/s
Re: Flume performance measurements
roshan_naik is my login to cwiki.apache.org On 4/8/15 3:55 PM, Arvind Prabhakar arv...@apache.org wrote: Added Hari to the wiki. Roshan, I could not look you up on the wiki users, can you please tell me your username? If you don't have one yet, please register and let me know. Regards, Arvind Prabhakar On Wed, Apr 8, 2015 at 3:26 PM, Roshan Naik ros...@hortonworks.com wrote: Arvind, Please do let me know once you have granted me permission to the wiki. -roshan From: Hari Shreedharan hshreedha...@cloudera.commailto: hshreedha...@cloudera.com Date: Thursday, April 2, 2015 3:06 PM To: Roshan Naik ros...@hortonworks.commailto:ros...@hortonworks.com Cc: dev@flume.apache.orgmailto:dev@flume.apache.org dev@flume.apache.orgmailto:dev@flume.apache.org Subject: Re: Flume performance measurements Arvind - please could you grant Roshan access to the wiki. Thanks, Hari On Thu, Apr 2, 2015 at 3:04 PM, Roshan Naik ros...@hortonworks.com mailto:ros...@hortonworks.com wrote: Could u grant me write access to wiki ? username: roshannaik On 4/2/15 2:53 PM, Hari Shreedharan hshreedha...@cloudera.commailto: hshreedha...@cloudera.com wrote: Roshan, Could you update the performance measurements page on our wiki with this info? That would be more useful to reference. Thanks, Hari On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com mailto:ros...@hortonworks.com wrote: Sample Flume v1.4 Measurements for reference: Here are some sample measurements taken with a single agent and 500 byte events. Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes). Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM. 1. File channel with HDFS Sink (Sequence File): Source: 4 x Exec Source, 100k batchSize HDFS Sink Batch size: 500,000 Channel: File Number of data dirs: 8 Events/Sec Sink Count 1 data dirs 2 data dirs 4 data dirs 6 data dirs 8 data dirs 10 data dirs 1 14.3 k 2 21.9 k 4 35.8 k 8 24.8 k 43.8 k 72.5 k 77 k 78.6 k 76.6 k 10 58 k 12 49.3 k 49 k Was looking for sweet spot in perf. So did not take measurements for all data points on grid. Only too for the ones that made sense. For example: when perf dropped by adding more sinks, did not take more measurements for those rows. 2. HDFS Sink: Channel: Memory # of HDFS Sinks Snappy BatchSz:1.2mill Snappy BatchSz:1.4mill Sequence File BatchSz:1.2mill 1 34.3 k 33 k 33 k 2 71 k 75 k 69 k 4 141 k 145 k 141 k 8 271 k 273 k 251 k 12 382 k 380 k 370 k 16 478 k 538 k 486 k Some simple observations : * increasing number of dataDirs helps FC perf even on single disk systems * Increasing number of sinks helps * Max throughput observed was about 538k events/sec for HDFS sink which is approx 240MB/s
Flume performance measurements
Sample Flume v1.4 Measurements for reference: Here are some sample measurements taken with a single agent and 500 byte events. Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes). Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM. 1. File channel with HDFS Sink (Sequence File): Source: 4 x Exec Source, 100k batchSize HDFS Sink Batch size: 500,000 Channel: File Number of data dirs: 8 Events/Sec Sink Count 1 data dirs 2 data dirs 4 data dirs 6 data dirs 8 data dirs 10 data dirs 1 14.3 k 2 21.9 k 4 35.8 k 8 24.8 k 43.8 k 72.5 k 77 k 78.6 k 76.6 k 10 58 k 12 49.3 k 49 k Was looking for sweet spot in perf. So did not take measurements for all data points on grid. Only too for the ones that made sense. For example: when perf dropped by adding more sinks, did not take more measurements for those rows. 2. HDFS Sink: Channel: Memory # of HDFS Sinks Snappy BatchSz:1.2mill Snappy BatchSz:1.4mill Sequence File BatchSz:1.2mill 1 34.3 k 33 k 33 k 2 71 k 75 k 69 k 4 141 k 145 k 141 k 8 271 k 273 k 251 k 12 382 k 380 k 370 k 16 478 k 538 k 486 k Some simple observations : * increasing number of dataDirs helps FC perf even on single disk systems * Increasing number of sinks helps * Max throughput observed was about 538k events/sec for HDFS sink which is approx 240MB/s
Re: Flume performance measurements
Roshan, Could you update the performance measurements page on our wiki with this info? That would be more useful to reference. Thanks, Hari On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com wrote: Sample Flume v1.4 Measurements for reference: Here are some sample measurements taken with a single agent and 500 byte events. Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes). Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM. 1. File channel with HDFS Sink (Sequence File): Source: 4 x Exec Source, 100k batchSize HDFS Sink Batch size: 500,000 Channel: File Number of data dirs: 8 Events/Sec Sink Count 1 data dirs 2 data dirs 4 data dirs 6 data dirs 8 data dirs 10 data dirs 1 14.3 k 2 21.9 k 4 35.8 k 8 24.8 k 43.8 k 72.5 k 77 k 78.6 k 76.6 k 10 58 k 12 49.3 k 49 k Was looking for sweet spot in perf. So did not take measurements for all data points on grid. Only too for the ones that made sense. For example: when perf dropped by adding more sinks, did not take more measurements for those rows. 2. HDFS Sink: Channel: Memory # of HDFS Sinks Snappy BatchSz:1.2mill Snappy BatchSz:1.4mill Sequence File BatchSz:1.2mill 1 34.3 k 33 k 33 k 2 71 k 75 k 69 k 4 141 k 145 k 141 k 8 271 k 273 k 251 k 12 382 k 380 k 370 k 16 478 k 538 k 486 k Some simple observations : * increasing number of dataDirs helps FC perf even on single disk systems * Increasing number of sinks helps * Max throughput observed was about 538k events/sec for HDFS sink which is approx 240MB/s
Re: Flume performance measurements
Arvind - please could you grant Roshan access to the wiki. Thanks, Hari On Thu, Apr 2, 2015 at 3:04 PM, Roshan Naik ros...@hortonworks.com wrote: Could u grant me write access to wiki ? username: roshannaik On 4/2/15 2:53 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Roshan, Could you update the performance measurements page on our wiki with this info? That would be more useful to reference. Thanks, Hari On Thu, Apr 2, 2015 at 2:34 PM, Roshan Naik ros...@hortonworks.com wrote: Sample Flume v1.4 Measurements for reference: Here are some sample measurements taken with a single agent and 500 byte events. Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes). Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM. 1. File channel with HDFS Sink (Sequence File): Source: 4 x Exec Source, 100k batchSize HDFS Sink Batch size: 500,000 Channel: File Number of data dirs: 8 Events/Sec Sink Count 1 data dirs 2 data dirs 4 data dirs 6 data dirs 8 data dirs 10 data dirs 1 14.3 k 2 21.9 k 4 35.8 k 8 24.8 k 43.8 k 72.5 k 77 k 78.6 k 76.6 k 10 58 k 12 49.3 k 49 k Was looking for sweet spot in perf. So did not take measurements for all data points on grid. Only too for the ones that made sense. For example: when perf dropped by adding more sinks, did not take more measurements for those rows. 2. HDFS Sink: Channel: Memory # of HDFS Sinks Snappy BatchSz:1.2mill Snappy BatchSz:1.4mill Sequence File BatchSz:1.2mill 1 34.3 k 33 k 33 k 2 71 k 75 k 69 k 4 141 k 145 k 141 k 8 271 k 273 k 251 k 12 382 k 380 k 370 k 16 478 k 538 k 486 k Some simple observations : * increasing number of dataDirs helps FC perf even on single disk systems * Increasing number of sinks helps * Max throughput observed was about 538k events/sec for HDFS sink which is approx 240MB/s