Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time

2014-02-10 Thread Guang Yang
Thanks all for the help.

We finally identified the root cause of the issue was due to a lock contention 
happening at folder splitting and here is a tracking ticket (thanks Inktank for 
the fix!): http://tracker.ceph.com/issues/7207

Thanks,
Guang


On Tuesday, December 31, 2013 8:22 AM, Guang Yang  wrote:
 
Thanks Wido, my comments inline...

>Date: Mon, 30 Dec 2013 14:04:35 +0100
>From: Wido den Hollander 
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
>    after running some time

>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.246379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.

Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-31 Thread Guang Yang
Thanks Mark, my comments inline...

Date: Mon, 30 Dec 2013 07:36:56 -0600
From: Mark Nelson 
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
    after running some time

On 12/30/2013 05:45 AM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.246379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
> 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?

You may want to use the dump_historic_ops command in the admin socket 
for the slow OSDs.  That will give you some clues regarding whe

Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-31 Thread Guang Yang
Thanks Wido, my comments inline...

>Date: Mon, 30 Dec 2013 14:04:35 +0100
>From: Wido den Hollander 
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
>    after running some time

>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.246379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
> 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?
>

What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over t

Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time

2013-12-30 Thread Guang Yang
Thanks Wido, my comments inline...

>Date: Mon, 30 Dec 2013 14:04:35 +0100
>From: Wido den Hollander 
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw)
>    after running some time

>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.246379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
> 33.0    389.4  78.8      3597.6  54.2    1.4      12.1    4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?
>

What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.

[yguang] We are running on xfs, journal and data share the same disk with 
different partitions.

Wido

> Thanks,___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-30 Thread Mark Nelson

On 12/30/2013 05:45 AM, Guang wrote:

Hi ceph-users and ceph-devel,
Merry Christmas and Happy New Year!

We have a ceph cluster with radosgw, our customer is using S3 API to
access the cluster.

The basic information of the cluster is:
bash-4.1$ ceph -s
   cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
monmap e1: 3 mons at
{osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
election epoch 40, quorum 0,1,2 osd151,osd152,osd153
osdmap e129885: 787 osds: 758 up, 758 in
 pgmap v1884502: 22203 pgs: 22125 active+clean, 1
active+clean+scrubbing, 1 active+clean+inconsistent, 76
active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
TB avail
mdsmap e1: 0/0/1 up

#When the latency peak happened, there was no scrubbing, recovering or
backfilling at the moment.#

While the performance of the cluster (only with WRITE traffic) is stable
until Dec 25th, our monitoring (for radosgw access log) shows a
significant increase of average latency and 99% latency.

And then I chose one OSD and try to grep slow requests logs and find
that most of the slow requests were waiting for subop, I take osd22 for
example.

osd[561-571] are hosted by osd22.
-bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
~/slow_osd.txt
-bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort –nr
3586 656,598
 289 467,629
 284 598,763
 279 584,598
 203 172,598
 182 598,6
 155 629,646
  83 631,598
  65 631,593
  21 616,629
  20 609,671
  20 609,390
  13 609,254
  12 702,629
  12 629,641
  11 665,613
  11 593,724
  11 361,591
  10 591,709
   9 681,609
   9 609,595
   9 591,772
   8 613,662
   8 575,591
   7 674,722
   7 609,603
   6 585,605
   5 613,691
   5 293,629
   4 774,591
   4 717,591
   4 613,776
   4 538,629
   4 485,629
   3 702,641
   3 608,629
   3 593,580
   3 591,676

It turns out most of the slow requests were waiting for osd 598, 629, I
ran the procedure on another host osd22 and got the same pattern.

Then I turned to the host having osd598 and dump the perf counter to do
comparision.

-bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
/var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
op_latency,subop_latency,total_ops
0.192097526753471,0.0344513450167198,7549045
1.99137797628122,1.42198426157216,9184472
0.198062399664129,0.0387090378926376,6305973
0.621697271315762,0.396549768986993,9726679
29.5222496247375,18.246379615, 10860858
0.229250239525916,0.0557482067611005,8149691
0.208981698303654,0.0375553180438224,6623842
0.47474766302086,0.292583928601509,9838777
0.339477790083925,0.101288409388438,9340212
0.186448840141895,0.0327296517417626,7081410
0.807598201207144,0.0139762289702332,6093531
(osd 598 is op hotspot as well)

This double confirmed that osd 598 was having some performance issues
(it has around *30 seconds average op latency*!).
sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
latency difference is not as significant as we saw from osd perf.
reads   kbread writes  kbwrite %busy  avgqu  await  svctm
37.3459.989.8 4106.9   61.8 1.6   12.24.9
42.3545.891.8 4296.3   69.7 2.4   17.65.2
42.0483.893.1 4263.6   68.8 1.8   13.35.1
39.7425.589.4 4327.0   68.5 1.8   14.05.3

Another disk at the same time for comparison (/dev/sdb).
reads   kbread writes  kbwrite %busy  avgqu  await  svctm
34.2502.680.13524.353.4 1.3 11.8  4.7
35.3560.983.73742.056.0 1.2 9.8   4.7
30.4371.5   78.8 3631.452.2 1.7 15.8 4.8
33.0389.4   78.8  3597.6   54.2 1.4  12.14.8

Any idea why a couple of OSDs are so slow that impact the performance of
the entire cluster?


You may want to use the dump_historic_ops command in the admin socket 
for the slow OSDs.  That will give you some clues regarding where the 
ops are hanging up in the OSD.  You can also crank the osd debugging way 
up on that node and search through the logs to see if there are any 
patterns or trends (consistent slowness, pauses, etc).  It may also be 
useful to look and see if that OSD is pegging CPU and if so attach 
strace or perf to it and see what it's doing.


Normally in this situation I'd say to be wary of the disk going bad, but 
in this case it may be something else.




Thanks,
Guang


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time

2013-12-30 Thread Wido den Hollander

On 12/30/2013 12:45 PM, Guang wrote:

Hi ceph-users and ceph-devel,
Merry Christmas and Happy New Year!

We have a ceph cluster with radosgw, our customer is using S3 API to
access the cluster.

The basic information of the cluster is:
bash-4.1$ ceph -s
   cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
monmap e1: 3 mons at
{osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
election epoch 40, quorum 0,1,2 osd151,osd152,osd153
osdmap e129885: 787 osds: 758 up, 758 in
 pgmap v1884502: 22203 pgs: 22125 active+clean, 1
active+clean+scrubbing, 1 active+clean+inconsistent, 76
active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
TB avail
mdsmap e1: 0/0/1 up

#When the latency peak happened, there was no scrubbing, recovering or
backfilling at the moment.#

While the performance of the cluster (only with WRITE traffic) is stable
until Dec 25th, our monitoring (for radosgw access log) shows a
significant increase of average latency and 99% latency.

And then I chose one OSD and try to grep slow requests logs and find
that most of the slow requests were waiting for subop, I take osd22 for
example.

osd[561-571] are hosted by osd22.
-bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
~/slow_osd.txt
-bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort –nr
3586 656,598
 289 467,629
 284 598,763
 279 584,598
 203 172,598
 182 598,6
 155 629,646
  83 631,598
  65 631,593
  21 616,629
  20 609,671
  20 609,390
  13 609,254
  12 702,629
  12 629,641
  11 665,613
  11 593,724
  11 361,591
  10 591,709
   9 681,609
   9 609,595
   9 591,772
   8 613,662
   8 575,591
   7 674,722
   7 609,603
   6 585,605
   5 613,691
   5 293,629
   4 774,591
   4 717,591
   4 613,776
   4 538,629
   4 485,629
   3 702,641
   3 608,629
   3 593,580
   3 591,676

It turns out most of the slow requests were waiting for osd 598, 629, I
ran the procedure on another host osd22 and got the same pattern.

Then I turned to the host having osd598 and dump the perf counter to do
comparision.

-bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
/var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
op_latency,subop_latency,total_ops
0.192097526753471,0.0344513450167198,7549045
1.99137797628122,1.42198426157216,9184472
0.198062399664129,0.0387090378926376,6305973
0.621697271315762,0.396549768986993,9726679
29.5222496247375,18.246379615, 10860858
0.229250239525916,0.0557482067611005,8149691
0.208981698303654,0.0375553180438224,6623842
0.47474766302086,0.292583928601509,9838777
0.339477790083925,0.101288409388438,9340212
0.186448840141895,0.0327296517417626,7081410
0.807598201207144,0.0139762289702332,6093531
(osd 598 is op hotspot as well)

This double confirmed that osd 598 was having some performance issues
(it has around *30 seconds average op latency*!).
sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
latency difference is not as significant as we saw from osd perf.
reads   kbread writes  kbwrite %busy  avgqu  await  svctm
37.3459.989.8 4106.9   61.8 1.6   12.24.9
42.3545.891.8 4296.3   69.7 2.4   17.65.2
42.0483.893.1 4263.6   68.8 1.8   13.35.1
39.7425.589.4 4327.0   68.5 1.8   14.05.3

Another disk at the same time for comparison (/dev/sdb).
reads   kbread writes  kbwrite %busy  avgqu  await  svctm
34.2502.680.13524.353.4 1.3 11.8  4.7
35.3560.983.73742.056.0 1.2 9.8   4.7
30.4371.5   78.8 3631.452.2 1.7 15.8 4.8
33.0389.4   78.8  3597.6   54.2 1.4  12.14.8

Any idea why a couple of OSDs are so slow that impact the performance of
the entire cluster?



What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.


Wido


Thanks,
Guang


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-08 Thread Mark Nelson

On 11/08/2013 12:59 PM, Gruher, Joseph R wrote:

-Original Message-
From: Dinu Vlad [mailto:dinuvla...@gmail.com]
Sent: Thursday, November 07, 2013 10:37 AM
To: ja...@peacon.co.uk; Gruher, Joseph R; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster performance

I was under the same impression - using a small portion of the SSD via
partitioning (in my case - 30 gigs out of 240) would have the same effect as
activating the HPA explicitly.

Am I wrong?


I pinged a guy on the SSD team here at Intel and he confirmed - if you have a 
new drive (or freshly secure erased drive) and you only use the subset of the 
capacity (such as by creating a small partition) you effectively get the same 
benefits as overprovisioning the hidden area of the drive (or underprovisioning 
the available capacity if you prefer to look at it that way).  It's really all 
about maintaining a larger area of cells where the SSDs knows it does not have 
to preserve the data, one way or the other.


That was my understanding as well, but it's great to have confirmation 
from Intel!  Thanks Joseph!


Mark


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-08 Thread Gruher, Joseph R
>-Original Message-
>From: Dinu Vlad [mailto:dinuvla...@gmail.com]
>Sent: Thursday, November 07, 2013 10:37 AM
>To: ja...@peacon.co.uk; Gruher, Joseph R; ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] ceph cluster performance
>
>I was under the same impression - using a small portion of the SSD via
>partitioning (in my case - 30 gigs out of 240) would have the same effect as
>activating the HPA explicitly.
>
>Am I wrong?

I pinged a guy on the SSD team here at Intel and he confirmed - if you have a 
new drive (or freshly secure erased drive) and you only use the subset of the 
capacity (such as by creating a small partition) you effectively get the same 
benefits as overprovisioning the hidden area of the drive (or underprovisioning 
the available capacity if you prefer to look at it that way).  It's really all 
about maintaining a larger area of cells where the SSDs knows it does not have 
to preserve the data, one way or the other.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Dinu Vlad
I have 2 SSDs (same model, smaller capacity) for / connected on the mainboard. 
Their sync write performance is also poor - less than 600 iops, 4k blocks. 

On Nov 7, 2013, at 9:44 PM, Kyle Bader  wrote:

>> ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.
> 
> The problem might be SATA transport protocol overhead at the expander.
> Have you tried directly connecting the SSDs to SATA2/3 ports on the
> mainboard?
> 
> -- 
> 
> Kyle
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Kyle Bader
> ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.

The problem might be SATA transport protocol overhead at the expander.
Have you tried directly connecting the SSDs to SATA2/3 ports on the
mainboard?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Dinu Vlad
I was under the same impression - using a small portion of the SSD via 
partitioning (in my case - 30 gigs out of 240) would have the same effect as 
activating the HPA explicitly. 

Am I wrong? 


On Nov 7, 2013, at 8:16 PM, ja...@peacon.co.uk wrote:

> On 2013-11-07 17:47, Gruher, Joseph R wrote:
> 
>> I wonder how effective trim would be on a Ceph journal area.
>> If the journal empties and is then trimmed the next write cycle should
>> be faster, but if the journal is active all the time the benefits
>> would be lost almost immediately, as those cells are going to receive
>> data again almost immediately and go back to an "untrimmed" state
>> until the next trim occurs.
> 
> If it's under-provisioned (so the device knows there are unused cells), the 
> device would simply write to an empty cell and flag the old cell for erasing, 
> so there should be no change.  Latency would rise when sustained write rate 
> exceeded the devices' ability to clear cells, so eventually the stock of 
> ready cells would be depleted.
> 
> FWIW, I think there is considerable mileage in the larger-consumer grade 
> argument.  Assuming drives will be half the price in a years time, so 
> selecting devices that can last only a year is preferable to spending 3x the 
> price on one that can survive three.  That though opens the tin of worms that 
> is SMART reporting and moving journals at some future point mind.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-07 Thread james

On 2013-11-07 17:47, Gruher, Joseph R wrote:


I wonder how effective trim would be on a Ceph journal area.
If the journal empties and is then trimmed the next write cycle 
should

be faster, but if the journal is active all the time the benefits
would be lost almost immediately, as those cells are going to receive
data again almost immediately and go back to an "untrimmed" state
until the next trim occurs.


If it's under-provisioned (so the device knows there are unused cells), 
the device would simply write to an empty cell and flag the old cell for 
erasing, so there should be no change.  Latency would rise when 
sustained write rate exceeded the devices' ability to clear cells, so 
eventually the stock of ready cells would be depleted.


FWIW, I think there is considerable mileage in the larger-consumer 
grade argument.  Assuming drives will be half the price in a years time, 
so selecting devices that can last only a year is preferable to spending 
3x the price on one that can survive three.  That though opens the tin 
of worms that is SMART reporting and moving journals at some future 
point mind.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Mark Nelson

On 11/07/2013 11:47 AM, Gruher, Joseph R wrote:

-Original Message-
From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
boun...@lists.ceph.com] On Behalf Of Dinu Vlad
Sent: Thursday, November 07, 2013 3:30 AM
To: ja...@peacon.co.uk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster performance



In this case however, the SSDs were only used for journals and I don't know if
ceph-osd sends TRIM to the drive in the process of journaling over a block
device. They were also under-subscribed, with just 3 x 10G partitions out of
240 GB raw capacity. I did a manual trim, but it hasn't changed anything.


If your SSD capacity is well in excess of your journal capacity requirements you could 
consider overprovisioning the SSD.  Overprovisioning should increase SSD performance and 
lifetime.  This achieves the same effect as trim to some degree (lets the SSD better 
understand what cells have real data and which can be treated as free).  I wonder how 
effective trim would be on a Ceph journal area.  If the journal empties and is then 
trimmed the next write cycle should be faster, but if the journal is active all the time 
the benefits would be lost almost immediately, as those cells are going to receive data 
again almost immediately and go back to an "untrimmed" state until the next 
trim occurs.


over-provisioning is definitely something to consider, especially if you 
aren't buying SSDs with high write endurance.  The more cells you can 
spread the load out over the better.  We've had some interesting 
conversations on here in the past about whether or not it's more cost 
effective to buy large capacity consumer grade SSDs with more cells or 
shell out for smaller capacity enterprise grade drives.  My personal 
opinion is that it's worth paying a bit extra for a drive that employs 
something like MLC-HET, but there's a lot of "enterprise" grade drives 
out there with low write endurance that you really have to watch out 
for.  If you are going to pay extra, at least get something with high 
write endurance and reasonable write speeds.


Mark



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Gruher, Joseph R
>-Original Message-
>From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
>boun...@lists.ceph.com] On Behalf Of Dinu Vlad
>Sent: Thursday, November 07, 2013 3:30 AM
>To: ja...@peacon.co.uk; ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] ceph cluster performance

>In this case however, the SSDs were only used for journals and I don't know if
>ceph-osd sends TRIM to the drive in the process of journaling over a block
>device. They were also under-subscribed, with just 3 x 10G partitions out of
>240 GB raw capacity. I did a manual trim, but it hasn't changed anything.

If your SSD capacity is well in excess of your journal capacity requirements 
you could consider overprovisioning the SSD.  Overprovisioning should increase 
SSD performance and lifetime.  This achieves the same effect as trim to some 
degree (lets the SSD better understand what cells have real data and which can 
be treated as free).  I wonder how effective trim would be on a Ceph journal 
area.  If the journal empties and is then trimmed the next write cycle should 
be faster, but if the journal is active all the time the benefits would be lost 
almost immediately, as those cells are going to receive data again almost 
immediately and go back to an "untrimmed" state until the next trim occurs.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Dinu Vlad
I had great results from the older 530 series too. 

In this case however, the SSDs were only used for journals and I don't know if 
ceph-osd sends TRIM to the drive in the process of journaling over a block 
device. They were also under-subscribed, with just 3 x 10G partitions out of 
240 GB raw capacity. I did a manual trim, but it hasn't changed anything.  

I'm still having fun with the configuration so I'll be able to use Mike 
Dawson's suggested tools to check for latencies. 

On Nov 6, 2013, at 11:35 PM, ja...@peacon.co.uk wrote:

> On 2013-11-06 20:25, Mike Dawson wrote:
> 
>>   We just fixed a performance issue on our cluster related to spikes of high 
>> latency on some of our SSDs used for osd journals. In our case, the slow 
>> SSDs showed spikes of 100x higher latency than expected.
> 
> 
> Many SSDs show this behaviour when 100% provisioned and/or never TRIM'd, 
> since the pool of ready erased cells is quickly depleted under steady write 
> workload, so it has to wait for cells to charge to accommodate the write.
> 
> The Intel 3700 SSDs look to have some of the best consistency ratings of any 
> of the more reasonably priced drives at the moment, and good IOPS too:
> 
> http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html
> 
> Obviously the quoted IOPS numbers are dependent on quite a deep queue mind.
> 
> There is a big range of performance in the market currently; some Enterprise 
> SSDs are quoted at just 4,000 IOPS yet cost as many pounds!
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Mark Nelson

On 11/06/2013 03:35 PM, ja...@peacon.co.uk wrote:

On 2013-11-06 20:25, Mike Dawson wrote:


   We just fixed a performance issue on our cluster related to spikes
of high latency on some of our SSDs used for osd journals. In our
case, the slow SSDs showed spikes of 100x higher latency than expected.



Many SSDs show this behaviour when 100% provisioned and/or never TRIM'd,
since the pool of ready erased cells is quickly depleted under steady
write workload, so it has to wait for cells to charge to accommodate the
write.

The Intel 3700 SSDs look to have some of the best consistency ratings of
any of the more reasonably priced drives at the moment, and good IOPS too:

http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html


Obviously the quoted IOPS numbers are dependent on quite a deep queue mind.

There is a big range of performance in the market currently; some
Enterprise SSDs are quoted at just 4,000 IOPS yet cost as many pounds!


Most vendors won't give you DC S3700s by default, but if you put your 
foot down most of them seem to have SKUs for them lurking around 
somewhere.  Right now they are the first drive I recommend for journals, 
though I believe some of the other vendors may have some interesting 
options in the future too.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-06 Thread james

On 2013-11-06 20:25, Mike Dawson wrote:

   We just fixed a performance issue on our cluster related to spikes 
of high latency on some of our SSDs used for osd journals. In our case, 
the slow SSDs showed spikes of 100x higher latency than expected.



Many SSDs show this behaviour when 100% provisioned and/or never 
TRIM'd, since the pool of ready erased cells is quickly depleted under 
steady write workload, so it has to wait for cells to charge to 
accommodate the write.


The Intel 3700 SSDs look to have some of the best consistency ratings 
of any of the more reasonably priced drives at the moment, and good IOPS 
too:


http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html

Obviously the quoted IOPS numbers are dependent on quite a deep queue 
mind.


There is a big range of performance in the market currently; some 
Enterprise SSDs are quoted at just 4,000 IOPS yet cost as many pounds!



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Mike Dawson
No, in our case flashing the firmware to the latest release cured the 
problem.


If you build a new cluster with the slow SSDs, I'd be interested in the 
results of ioping[0] or fsync-tester[1]. I theorize that you may see 
spikes of high latency.


[0] https://code.google.com/p/ioping/
[1] https://github.com/gregsfortytwo/fsync-tester

Thanks,
Mike Dawson


On 11/6/2013 4:18 PM, Dinu Vlad wrote:

ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.

By "fixed" - you mean replaced the SSDs?

Thanks,
Dinu

On Nov 6, 2013, at 10:25 PM, Mike Dawson  wrote:


We just fixed a performance issue on our cluster related to spikes of high 
latency on some of our SSDs used for osd journals. In our case, the slow SSDs 
showed spikes of 100x higher latency than expected.

What SSDs were you using that were so slow?

Cheers,
Mike

On 11/6/2013 12:39 PM, Dinu Vlad wrote:

I'm using the latest 3.8.0 branch from raring. Is there a more recent/better 
kernel recommended?

Meanwhile, I think I might have identified the culprit - my SSD drives are 
extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By 
comparison, an Intel 530 in another server (also installed behind a SAS 
expander is doing the same test with ~ 8k iops. I guess I'm good for replacing 
them.

Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s 
throughput under the same conditions (only mechanical drives, journal on a 
separate partition on each one, 8 rados bench processes, 16 threads each).


On Nov 5, 2013, at 4:38 PM, Mark Nelson  wrote:


Ok, some more thoughts:

1) What kernel are you using?

2) Mixing SATA and SAS on an expander backplane can some times have bad 
effects.  We don't really know how bad this is and in what circumstances, but 
the Nexenta folks have seen problems with ZFS on solaris and it's not 
impossible linux may suffer too:

http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

3) If you are doing tests and look at disk throughput with something like "collectl 
-sD -oT"  do the writes look balanced across the spinning disks?  Do any devices 
have much really high service times or queue times?

4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} 
dump_historic_ops \; > foo

and then grep for "duration" in foo.  You'll get a list of the slowest 
operations over the last 10 minutes from every osd on the node.  Once you identify a slow 
duration, you can go back and in an editor search for the slow duration and look at where 
in the OSD it hung up.  That might tell us more about slow/latent operations.

5) Something interesting here is that I've heard from another party that in a 
36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a 
SAS9207-8i controller and were pushing significantly faster throughput than you 
are seeing (even given the greater number of drives).  So it's very interesting 
to me that you are pushing so much less.  The 36 drive supermicro chassis I 
have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a 
bunch of 9207-8i controllers and XFS (no replication).

Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph 
settings I was able to get 440 MB/s from 8 rados bench instances, over a single 
osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches 
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!


On Oct 31, 2013, at 6:35 PM, Dinu Vlad  wrote:



I tested the osd performance from a single node. For this purpose I deployed a new cluster 
(using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 
1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster 
configuration stayed "default", with the same additions about xfs mount & 
mkfs.xfs as before.

With a single host, the pgs were "stuck unclean" (active only, not 
active+clean):

# ceph -s
  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
   health HEALTH_WARN 1800 pgs stuck unclean
   monmap e1: 3 mons at 
{cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
 election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
   osdmap e101: 18 osds: 18 up, 18 in
pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 
16759 GB avail
   mdsmap e1: 0/0/1 up


Test results:
Local test, 1 process, 16 threads: 241.7 MB/s
Local test, 8 processes, 128 threads: 374.8 MB/s
Remote test, 1 process, 16 threads: 231.8 MB/s
Remote test, 8 processes, 128 threads: 366.1 MB/s

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu


On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:


On 10/30/2013 01:51 PM, Dinu Vlad wrote:

Mark,

The SSDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterpris

Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Dinu Vlad
ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. 

By "fixed" - you mean replaced the SSDs?  

Thanks,
Dinu

On Nov 6, 2013, at 10:25 PM, Mike Dawson  wrote:

> We just fixed a performance issue on our cluster related to spikes of high 
> latency on some of our SSDs used for osd journals. In our case, the slow SSDs 
> showed spikes of 100x higher latency than expected.
> 
> What SSDs were you using that were so slow?
> 
> Cheers,
> Mike
> 
> On 11/6/2013 12:39 PM, Dinu Vlad wrote:
>> I'm using the latest 3.8.0 branch from raring. Is there a more recent/better 
>> kernel recommended?
>> 
>> Meanwhile, I think I might have identified the culprit - my SSD drives are 
>> extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By 
>> comparison, an Intel 530 in another server (also installed behind a SAS 
>> expander is doing the same test with ~ 8k iops. I guess I'm good for 
>> replacing them.
>> 
>> Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s 
>> throughput under the same conditions (only mechanical drives, journal on a 
>> separate partition on each one, 8 rados bench processes, 16 threads each).
>> 
>> 
>> On Nov 5, 2013, at 4:38 PM, Mark Nelson  wrote:
>> 
>>> Ok, some more thoughts:
>>> 
>>> 1) What kernel are you using?
>>> 
>>> 2) Mixing SATA and SAS on an expander backplane can some times have bad 
>>> effects.  We don't really know how bad this is and in what circumstances, 
>>> but the Nexenta folks have seen problems with ZFS on solaris and it's not 
>>> impossible linux may suffer too:
>>> 
>>> http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html
>>> 
>>> 3) If you are doing tests and look at disk throughput with something like 
>>> "collectl -sD -oT"  do the writes look balanced across the spinning disks?  
>>> Do any devices have much really high service times or queue times?
>>> 
>>> 4) Also, after the test is done, you can try:
>>> 
>>> find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} 
>>> dump_historic_ops \; > foo
>>> 
>>> and then grep for "duration" in foo.  You'll get a list of the slowest 
>>> operations over the last 10 minutes from every osd on the node.  Once you 
>>> identify a slow duration, you can go back and in an editor search for the 
>>> slow duration and look at where in the OSD it hung up.  That might tell us 
>>> more about slow/latent operations.
>>> 
>>> 5) Something interesting here is that I've heard from another party that in 
>>> a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 
>>> SSDs on a SAS9207-8i controller and were pushing significantly faster 
>>> throughput than you are seeing (even given the greater number of drives).  
>>> So it's very interesting to me that you are pushing so much less.  The 36 
>>> drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs 
>>> can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no 
>>> replication).
>>> 
>>> Mark
>>> 
>>> On 11/05/2013 05:15 AM, Dinu Vlad wrote:
 Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* 
 ceph settings I was able to get 440 MB/s from 8 rados bench instances, 
 over a single osd node (pool pg_num = 1800, size = 1)
 
 This still looks awfully slow to me - fio throughput across all disks 
 reaches 2.8 GB/s!!
 
 I'd appreciate any suggestion, where to look for the issue. Thanks!
 
 
 On Oct 31, 2013, at 6:35 PM, Dinu Vlad  wrote:
 
> 
> I tested the osd performance from a single node. For this purpose I 
> deployed a new cluster (using ceph-deploy, as before) and on 
> fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the 
> rados bench both on the osd server and on a remote one. Cluster 
> configuration stayed "default", with the same additions about xfs mount & 
> mkfs.xfs as before.
> 
> With a single host, the pgs were "stuck unclean" (active only, not 
> active+clean):
> 
> # ceph -s
>  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
>   health HEALTH_WARN 1800 pgs stuck unclean
>   monmap e1: 3 mons at 
> {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
>  election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
>   osdmap e101: 18 osds: 18 up, 18 in
>pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 
> GB / 16759 GB avail
>   mdsmap e1: 0/0/1 up
> 
> 
> Test results:
> Local test, 1 process, 16 threads: 241.7 MB/s
> Local test, 8 processes, 128 threads: 374.8 MB/s
> Remote test, 1 process, 16 threads: 231.8 MB/s
> Remote test, 8 processes, 128 threads: 366.1 MB/s
> 
> Maybe it's just me, but it seems on the low side too.
> 
> Thanks,
> Dinu
> 
> 
> On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:
> 
>> On 10/30/2013 01:51 PM, Dinu Vlad wrote:
>

Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Mike Dawson
We just fixed a performance issue on our cluster related to spikes of 
high latency on some of our SSDs used for osd journals. In our case, the 
slow SSDs showed spikes of 100x higher latency than expected.


What SSDs were you using that were so slow?

Cheers,
Mike

On 11/6/2013 12:39 PM, Dinu Vlad wrote:

I'm using the latest 3.8.0 branch from raring. Is there a more recent/better 
kernel recommended?

Meanwhile, I think I might have identified the culprit - my SSD drives are 
extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By 
comparison, an Intel 530 in another server (also installed behind a SAS 
expander is doing the same test with ~ 8k iops. I guess I'm good for replacing 
them.

Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s 
throughput under the same conditions (only mechanical drives, journal on a 
separate partition on each one, 8 rados bench processes, 16 threads each).


On Nov 5, 2013, at 4:38 PM, Mark Nelson  wrote:


Ok, some more thoughts:

1) What kernel are you using?

2) Mixing SATA and SAS on an expander backplane can some times have bad 
effects.  We don't really know how bad this is and in what circumstances, but 
the Nexenta folks have seen problems with ZFS on solaris and it's not 
impossible linux may suffer too:

http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

3) If you are doing tests and look at disk throughput with something like "collectl 
-sD -oT"  do the writes look balanced across the spinning disks?  Do any devices 
have much really high service times or queue times?

4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} 
dump_historic_ops \; > foo

and then grep for "duration" in foo.  You'll get a list of the slowest 
operations over the last 10 minutes from every osd on the node.  Once you identify a slow 
duration, you can go back and in an editor search for the slow duration and look at where 
in the OSD it hung up.  That might tell us more about slow/latent operations.

5) Something interesting here is that I've heard from another party that in a 
36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a 
SAS9207-8i controller and were pushing significantly faster throughput than you 
are seeing (even given the greater number of drives).  So it's very interesting 
to me that you are pushing so much less.  The 36 drive supermicro chassis I 
have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a 
bunch of 9207-8i controllers and XFS (no replication).

Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph 
settings I was able to get 440 MB/s from 8 rados bench instances, over a single 
osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches 
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!


On Oct 31, 2013, at 6:35 PM, Dinu Vlad  wrote:



I tested the osd performance from a single node. For this purpose I deployed a new cluster 
(using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 
1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster 
configuration stayed "default", with the same additions about xfs mount & 
mkfs.xfs as before.

With a single host, the pgs were "stuck unclean" (active only, not 
active+clean):

# ceph -s
  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
   health HEALTH_WARN 1800 pgs stuck unclean
   monmap e1: 3 mons at 
{cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
 election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
   osdmap e101: 18 osds: 18 up, 18 in
pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 
16759 GB avail
   mdsmap e1: 0/0/1 up


Test results:
Local test, 1 process, 16 threads: 241.7 MB/s
Local test, 8 processes, 128 threads: 374.8 MB/s
Remote test, 1 process, 16 threads: 231.8 MB/s
Remote test, 8 processes, 128 threads: 366.1 MB/s

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu


On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:


On 10/30/2013 01:51 PM, Dinu Vlad wrote:

Mark,

The SSDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
 and the HDDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.

The chasis is a "SiliconMechanics C602" - but I don't have the exact model. 
It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander.

I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to 
what the driver reports in dmesg). here are the results (filtered):

Sequential:
Run status group 0 (all jobs):
  WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, m

Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Mark Nelson

On 11/06/2013 11:39 AM, Dinu Vlad wrote:

I'm using the latest 3.8.0 branch from raring. Is there a more recent/better 
kernel recommended?


I've been using the 3.8 kernel in the precise repo effectively, so I 
suspect it should be ok.




Meanwhile, I think I might have identified the culprit - my SSD drives are 
extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By 
comparison, an Intel 530 in another server (also installed behind a SAS 
expander is doing the same test with ~ 8k iops. I guess I'm good for replacing 
them.


Very interesting!



Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s 
throughput under the same conditions (only mechanical drives, journal on a 
separate partition on each one, 8 rados bench processes, 16 threads each).


Ok, so you went from like 300MB/s to ~600MB/s by removing the SSDs and 
just using spinners?  That's pretty crazy!  In any event, 600MB/s from 
18 disks with journal writes is like 66MB/s per disk.  That's not 
particularly great, but if it's on the 9207-8i with no cache might be 
about right since journal and fs writes will be in more contention.  I'd 
be curious what you'd see with DC S3700s for journals.





On Nov 5, 2013, at 4:38 PM, Mark Nelson  wrote:


Ok, some more thoughts:

1) What kernel are you using?

2) Mixing SATA and SAS on an expander backplane can some times have bad 
effects.  We don't really know how bad this is and in what circumstances, but 
the Nexenta folks have seen problems with ZFS on solaris and it's not 
impossible linux may suffer too:

http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

3) If you are doing tests and look at disk throughput with something like "collectl 
-sD -oT"  do the writes look balanced across the spinning disks?  Do any devices 
have much really high service times or queue times?

4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} 
dump_historic_ops \; > foo

and then grep for "duration" in foo.  You'll get a list of the slowest 
operations over the last 10 minutes from every osd on the node.  Once you identify a slow 
duration, you can go back and in an editor search for the slow duration and look at where 
in the OSD it hung up.  That might tell us more about slow/latent operations.

5) Something interesting here is that I've heard from another party that in a 
36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a 
SAS9207-8i controller and were pushing significantly faster throughput than you 
are seeing (even given the greater number of drives).  So it's very interesting 
to me that you are pushing so much less.  The 36 drive supermicro chassis I 
have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a 
bunch of 9207-8i controllers and XFS (no replication).

Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph 
settings I was able to get 440 MB/s from 8 rados bench instances, over a single 
osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches 
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!


On Oct 31, 2013, at 6:35 PM, Dinu Vlad  wrote:



I tested the osd performance from a single node. For this purpose I deployed a new cluster 
(using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 
1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster 
configuration stayed "default", with the same additions about xfs mount & 
mkfs.xfs as before.

With a single host, the pgs were "stuck unclean" (active only, not 
active+clean):

# ceph -s
  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
   health HEALTH_WARN 1800 pgs stuck unclean
   monmap e1: 3 mons at 
{cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
 election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
   osdmap e101: 18 osds: 18 up, 18 in
pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 
16759 GB avail
   mdsmap e1: 0/0/1 up


Test results:
Local test, 1 process, 16 threads: 241.7 MB/s
Local test, 8 processes, 128 threads: 374.8 MB/s
Remote test, 1 process, 16 threads: 231.8 MB/s
Remote test, 8 processes, 128 threads: 366.1 MB/s

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu


On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:


On 10/30/2013 01:51 PM, Dinu Vlad wrote:

Mark,

The SSDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
 and the HDDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.

The chasis is a "SiliconMechanics C602" - but I don't have the exact model. 
It's based on Supermicro, has 24 slots front and 2 in the back

Re: [ceph-users] ceph cluster performance

2013-11-06 Thread Dinu Vlad
I'm using the latest 3.8.0 branch from raring. Is there a more recent/better 
kernel recommended? 

Meanwhile, I think I might have identified the culprit - my SSD drives are 
extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By 
comparison, an Intel 530 in another server (also installed behind a SAS 
expander is doing the same test with ~ 8k iops. I guess I'm good for replacing 
them. 

Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s 
throughput under the same conditions (only mechanical drives, journal on a 
separate partition on each one, 8 rados bench processes, 16 threads each).  


On Nov 5, 2013, at 4:38 PM, Mark Nelson  wrote:

> Ok, some more thoughts:
> 
> 1) What kernel are you using?
> 
> 2) Mixing SATA and SAS on an expander backplane can some times have bad 
> effects.  We don't really know how bad this is and in what circumstances, but 
> the Nexenta folks have seen problems with ZFS on solaris and it's not 
> impossible linux may suffer too:
> 
> http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html
> 
> 3) If you are doing tests and look at disk throughput with something like 
> "collectl -sD -oT"  do the writes look balanced across the spinning disks?  
> Do any devices have much really high service times or queue times?
> 
> 4) Also, after the test is done, you can try:
> 
> find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} 
> dump_historic_ops \; > foo
> 
> and then grep for "duration" in foo.  You'll get a list of the slowest 
> operations over the last 10 minutes from every osd on the node.  Once you 
> identify a slow duration, you can go back and in an editor search for the 
> slow duration and look at where in the OSD it hung up.  That might tell us 
> more about slow/latent operations.
> 
> 5) Something interesting here is that I've heard from another party that in a 
> 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on 
> a SAS9207-8i controller and were pushing significantly faster throughput than 
> you are seeing (even given the greater number of drives).  So it's very 
> interesting to me that you are pushing so much less.  The 36 drive supermicro 
> chassis I have with no expanders and 30 drives with 6 SSDs can push about 
> 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication).
> 
> Mark
> 
> On 11/05/2013 05:15 AM, Dinu Vlad wrote:
>> Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* 
>> ceph settings I was able to get 440 MB/s from 8 rados bench instances, over 
>> a single osd node (pool pg_num = 1800, size = 1)
>> 
>> This still looks awfully slow to me - fio throughput across all disks 
>> reaches 2.8 GB/s!!
>> 
>> I'd appreciate any suggestion, where to look for the issue. Thanks!
>> 
>> 
>> On Oct 31, 2013, at 6:35 PM, Dinu Vlad  wrote:
>> 
>>> 
>>> I tested the osd performance from a single node. For this purpose I 
>>> deployed a new cluster (using ceph-deploy, as before) and on 
>>> fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the 
>>> rados bench both on the osd server and on a remote one. Cluster 
>>> configuration stayed "default", with the same additions about xfs mount & 
>>> mkfs.xfs as before.
>>> 
>>> With a single host, the pgs were "stuck unclean" (active only, not 
>>> active+clean):
>>> 
>>> # ceph -s
>>>  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
>>>   health HEALTH_WARN 1800 pgs stuck unclean
>>>   monmap e1: 3 mons at 
>>> {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
>>>  election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
>>>   osdmap e101: 18 osds: 18 up, 18 in
>>>pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB 
>>> / 16759 GB avail
>>>   mdsmap e1: 0/0/1 up
>>> 
>>> 
>>> Test results:
>>> Local test, 1 process, 16 threads: 241.7 MB/s
>>> Local test, 8 processes, 128 threads: 374.8 MB/s
>>> Remote test, 1 process, 16 threads: 231.8 MB/s
>>> Remote test, 8 processes, 128 threads: 366.1 MB/s
>>> 
>>> Maybe it's just me, but it seems on the low side too.
>>> 
>>> Thanks,
>>> Dinu
>>> 
>>> 
>>> On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:
>>> 
 On 10/30/2013 01:51 PM, Dinu Vlad wrote:
> Mark,
> 
> The SSDs are 
> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
>  and the HDDs are 
> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.
> 
> The chasis is a "SiliconMechanics C602" - but I don't have the exact 
> model. It's based on Supermicro, has 24 slots front and 2 in the back and 
> a SAS expander.
> 
> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out 
> according to what the driver reports in dmesg). here are the results 
> (filtered):
> 
> Sequential:
> Run status group 0 (all jobs):
>

Re: [ceph-users] ceph cluster performance

2013-11-05 Thread Mark Nelson

Ok, some more thoughts:

1) What kernel are you using?

2) Mixing SATA and SAS on an expander backplane can some times have bad 
effects.  We don't really know how bad this is and in what 
circumstances, but the Nexenta folks have seen problems with ZFS on 
solaris and it's not impossible linux may suffer too:


http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

3) If you are doing tests and look at disk throughput with something 
like "collectl -sD -oT"  do the writes look balanced across the spinning 
disks?  Do any devices have much really high service times or queue times?


4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} 
dump_historic_ops \; > foo


and then grep for "duration" in foo.  You'll get a list of the slowest 
operations over the last 10 minutes from every osd on the node.  Once 
you identify a slow duration, you can go back and in an editor search 
for the slow duration and look at where in the OSD it hung up.  That 
might tell us more about slow/latent operations.


5) Something interesting here is that I've heard from another party that 
in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 
6 SSDs on a SAS9207-8i controller and were pushing significantly faster 
throughput than you are seeing (even given the greater number of 
drives).  So it's very interesting to me that you are pushing so much 
less.  The 36 drive supermicro chassis I have with no expanders and 30 
drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i 
controllers and XFS (no replication).


Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph 
settings I was able to get 440 MB/s from 8 rados bench instances, over a single 
osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches 
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!


On Oct 31, 2013, at 6:35 PM, Dinu Vlad  wrote:



I tested the osd performance from a single node. For this purpose I deployed a new cluster 
(using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 
1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster 
configuration stayed "default", with the same additions about xfs mount & 
mkfs.xfs as before.

With a single host, the pgs were "stuck unclean" (active only, not 
active+clean):

# ceph -s
  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
   health HEALTH_WARN 1800 pgs stuck unclean
   monmap e1: 3 mons at 
{cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
 election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
   osdmap e101: 18 osds: 18 up, 18 in
pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 
16759 GB avail
   mdsmap e1: 0/0/1 up


Test results:
Local test, 1 process, 16 threads: 241.7 MB/s
Local test, 8 processes, 128 threads: 374.8 MB/s
Remote test, 1 process, 16 threads: 231.8 MB/s
Remote test, 8 processes, 128 threads: 366.1 MB/s

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu


On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:


On 10/30/2013 01:51 PM, Dinu Vlad wrote:

Mark,

The SSDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
 and the HDDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.

The chasis is a "SiliconMechanics C602" - but I don't have the exact model. 
It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander.

I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to 
what the driver reports in dmesg). here are the results (filtered):

Sequential:
Run status group 0 (all jobs):
  WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, 
mint=60444msec, maxt=61463msec

Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 
MB/s


Ok, that looks like what I'd expect to see given the controller being used.  
SSDs are probably limited by total aggregate throughput.



Random:
Run status group 0 (all jobs):
  WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, 
mint=60404msec, maxt=61875msec

Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 
6 doing 101)

This is on just one of the osd servers.


Where the ceph tests to one OSD server or across all servers?  It might be 
worth trying tests against a single server with no replication using multiple 
rados bench instances and just seeing what happens.



Thanks,
Dinu


On Oct 30, 2013, at 6:38 PM, Mark Nelson  wrote:


On 10/30/2013 09:05 AM, Dinu Vlad wrote:

Hello,

I've been doing some tests on a newly installed ceph cluster:

# ceph osd create bench1 2048 2048
# ceph osd 

Re: [ceph-users] ceph cluster performance

2013-11-05 Thread Dinu Vlad
Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph 
settings I was able to get 440 MB/s from 8 rados bench instances, over a single 
osd node (pool pg_num = 1800, size = 1) 

This still looks awfully slow to me - fio throughput across all disks reaches 
2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!


On Oct 31, 2013, at 6:35 PM, Dinu Vlad  wrote:

> 
> I tested the osd performance from a single node. For this purpose I deployed 
> a new cluster (using ceph-deploy, as before) and on fresh/repartitioned 
> drives. I created a single pool, 1800 pgs. I ran the rados bench both on the 
> osd server and on a remote one. Cluster configuration stayed "default", with 
> the same additions about xfs mount & mkfs.xfs as before. 
> 
> With a single host, the pgs were "stuck unclean" (active only, not 
> active+clean):
> 
> # ceph -s
>  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
>   health HEALTH_WARN 1800 pgs stuck unclean
>   monmap e1: 3 mons at 
> {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
>  election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
>   osdmap e101: 18 osds: 18 up, 18 in
>pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 
> 16759 GB avail
>   mdsmap e1: 0/0/1 up
> 
> 
> Test results: 
> Local test, 1 process, 16 threads: 241.7 MB/s
> Local test, 8 processes, 128 threads: 374.8 MB/s
> Remote test, 1 process, 16 threads: 231.8 MB/s
> Remote test, 8 processes, 128 threads: 366.1 MB/s
> 
> Maybe it's just me, but it seems on the low side too. 
> 
> Thanks,
> Dinu
> 
> 
> On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:
> 
>> On 10/30/2013 01:51 PM, Dinu Vlad wrote:
>>> Mark,
>>> 
>>> The SSDs are 
>>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
>>>  and the HDDs are 
>>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.
>>> 
>>> The chasis is a "SiliconMechanics C602" - but I don't have the exact model. 
>>> It's based on Supermicro, has 24 slots front and 2 in the back and a SAS 
>>> expander.
>>> 
>>> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according 
>>> to what the driver reports in dmesg). here are the results (filtered):
>>> 
>>> Sequential:
>>> Run status group 0 (all jobs):
>>>  WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, 
>>> mint=60444msec, maxt=61463msec
>>> 
>>> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 
>>> 153:189 MB/s
>> 
>> Ok, that looks like what I'd expect to see given the controller being used.  
>> SSDs are probably limited by total aggregate throughput.
>> 
>>> 
>>> Random:
>>> Run status group 0 (all jobs):
>>>  WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, 
>>> mint=60404msec, maxt=61875msec
>>> 
>>> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one 
>>> out of 6 doing 101)
>>> 
>>> This is on just one of the osd servers.
>> 
>> Where the ceph tests to one OSD server or across all servers?  It might be 
>> worth trying tests against a single server with no replication using 
>> multiple rados bench instances and just seeing what happens.
>> 
>>> 
>>> Thanks,
>>> Dinu
>>> 
>>> 
>>> On Oct 30, 2013, at 6:38 PM, Mark Nelson  wrote:
>>> 
 On 10/30/2013 09:05 AM, Dinu Vlad wrote:
> Hello,
> 
> I've been doing some tests on a newly installed ceph cluster:
> 
> # ceph osd create bench1 2048 2048
> # ceph osd create bench2 2048 2048
> # rbd -p bench1 create test
> # rbd -p bench1 bench-write test --io-pattern rand
> elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 2220781.36
> 
> # rados -p bench2 bench 300 write --show-time
> # (run 1)
> Total writes made:  20665
> Write size: 4194304
> Bandwidth (MB/sec): 274.923
> 
> Stddev Bandwidth:   96.3316
> Max bandwidth (MB/sec): 748
> Min bandwidth (MB/sec): 0
> Average Latency:0.23273
> Stddev Latency: 0.262043
> Max latency:1.69475
> Min latency:0.057293
> 
> These results seem to be quite poor for the configuration:
> 
> MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
> OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board 
> controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for 
> journal, attached to a LSI 9207-8i controller.
> All servers have dual 10GE network cards, connected to a pair of 
> dedicated switches. Each SSD has 3 10 GB partitions for journals.
 
 Agreed, you should see much higher throughput with that kind of storage 
 setup.  What brand/model SSDs are these?  Also, what brand and model of 
 chassis?  With 24 drives and 8 SSDs I could push 2GB/s (no replication 
 though) with

Re: [ceph-users] ceph cluster performance

2013-11-02 Thread Dinu Vlad
Any other options or ideas? 

Thanks,
Dinu 


On Oct 31, 2013, at 6:35 PM, Dinu Vlad  wrote:

> 
> I tested the osd performance from a single node. For this purpose I deployed 
> a new cluster (using ceph-deploy, as before) and on fresh/repartitioned 
> drives. I created a single pool, 1800 pgs. I ran the rados bench both on the 
> osd server and on a remote one. Cluster configuration stayed "default", with 
> the same additions about xfs mount & mkfs.xfs as before. 
> 
> With a single host, the pgs were "stuck unclean" (active only, not 
> active+clean):
> 
> # ceph -s
>  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
>   health HEALTH_WARN 1800 pgs stuck unclean
>   monmap e1: 3 mons at 
> {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
>  election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
>   osdmap e101: 18 osds: 18 up, 18 in
>pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 
> 16759 GB avail
>   mdsmap e1: 0/0/1 up
> 
> 
> Test results: 
> Local test, 1 process, 16 threads: 241.7 MB/s
> Local test, 8 processes, 128 threads: 374.8 MB/s
> Remote test, 1 process, 16 threads: 231.8 MB/s
> Remote test, 8 processes, 128 threads: 366.1 MB/s
> 
> Maybe it's just me, but it seems on the low side too. 
> 
> Thanks,
> Dinu
> 
> 
> On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:
> 
>> On 10/30/2013 01:51 PM, Dinu Vlad wrote:
>>> Mark,
>>> 
>>> The SSDs are 
>>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
>>>  and the HDDs are 
>>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.
>>> 
>>> The chasis is a "SiliconMechanics C602" - but I don't have the exact model. 
>>> It's based on Supermicro, has 24 slots front and 2 in the back and a SAS 
>>> expander.
>>> 
>>> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according 
>>> to what the driver reports in dmesg). here are the results (filtered):
>>> 
>>> Sequential:
>>> Run status group 0 (all jobs):
>>>  WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, 
>>> mint=60444msec, maxt=61463msec
>>> 
>>> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 
>>> 153:189 MB/s
>> 
>> Ok, that looks like what I'd expect to see given the controller being used.  
>> SSDs are probably limited by total aggregate throughput.
>> 
>>> 
>>> Random:
>>> Run status group 0 (all jobs):
>>>  WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, 
>>> mint=60404msec, maxt=61875msec
>>> 
>>> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one 
>>> out of 6 doing 101)
>>> 
>>> This is on just one of the osd servers.
>> 
>> Where the ceph tests to one OSD server or across all servers?  It might be 
>> worth trying tests against a single server with no replication using 
>> multiple rados bench instances and just seeing what happens.
>> 
>>> 
>>> Thanks,
>>> Dinu
>>> 
>>> 
>>> On Oct 30, 2013, at 6:38 PM, Mark Nelson  wrote:
>>> 
 On 10/30/2013 09:05 AM, Dinu Vlad wrote:
> Hello,
> 
> I've been doing some tests on a newly installed ceph cluster:
> 
> # ceph osd create bench1 2048 2048
> # ceph osd create bench2 2048 2048
> # rbd -p bench1 create test
> # rbd -p bench1 bench-write test --io-pattern rand
> elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 2220781.36
> 
> # rados -p bench2 bench 300 write --show-time
> # (run 1)
> Total writes made:  20665
> Write size: 4194304
> Bandwidth (MB/sec): 274.923
> 
> Stddev Bandwidth:   96.3316
> Max bandwidth (MB/sec): 748
> Min bandwidth (MB/sec): 0
> Average Latency:0.23273
> Stddev Latency: 0.262043
> Max latency:1.69475
> Min latency:0.057293
> 
> These results seem to be quite poor for the configuration:
> 
> MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
> OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board 
> controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for 
> journal, attached to a LSI 9207-8i controller.
> All servers have dual 10GE network cards, connected to a pair of 
> dedicated switches. Each SSD has 3 10 GB partitions for journals.
 
 Agreed, you should see much higher throughput with that kind of storage 
 setup.  What brand/model SSDs are these?  Also, what brand and model of 
 chassis?  With 24 drives and 8 SSDs I could push 2GB/s (no replication 
 though) with a couple of concurrent rados bench processes going on our 
 SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs 
 is definitely on the low side.
 
 I'm actually not too familiar with what the RBD benchmarking commands are 
 doing behind the scenes.  Typically I've tested fio 

Re: [ceph-users] ceph cluster performance

2013-10-31 Thread Dinu Vlad

I tested the osd performance from a single node. For this purpose I deployed a 
new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I 
created a single pool, 1800 pgs. I ran the rados bench both on the osd server 
and on a remote one. Cluster configuration stayed "default", with the same 
additions about xfs mount & mkfs.xfs as before. 

With a single host, the pgs were "stuck unclean" (active only, not 
active+clean):

# ceph -s
  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
   health HEALTH_WARN 1800 pgs stuck unclean
   monmap e1: 3 mons at 
{cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
 election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
   osdmap e101: 18 osds: 18 up, 18 in
pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 
16759 GB avail
   mdsmap e1: 0/0/1 up


Test results: 
Local test, 1 process, 16 threads: 241.7 MB/s
Local test, 8 processes, 128 threads: 374.8 MB/s
Remote test, 1 process, 16 threads: 231.8 MB/s
Remote test, 8 processes, 128 threads: 366.1 MB/s

Maybe it's just me, but it seems on the low side too. 

Thanks,
Dinu


On Oct 30, 2013, at 8:59 PM, Mark Nelson  wrote:

> On 10/30/2013 01:51 PM, Dinu Vlad wrote:
>> Mark,
>> 
>> The SSDs are 
>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
>>  and the HDDs are 
>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.
>> 
>> The chasis is a "SiliconMechanics C602" - but I don't have the exact model. 
>> It's based on Supermicro, has 24 slots front and 2 in the back and a SAS 
>> expander.
>> 
>> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according 
>> to what the driver reports in dmesg). here are the results (filtered):
>> 
>> Sequential:
>> Run status group 0 (all jobs):
>>   WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, 
>> mint=60444msec, maxt=61463msec
>> 
>> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 
>> 153:189 MB/s
> 
> Ok, that looks like what I'd expect to see given the controller being used.  
> SSDs are probably limited by total aggregate throughput.
> 
>> 
>> Random:
>> Run status group 0 (all jobs):
>>   WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, 
>> mint=60404msec, maxt=61875msec
>> 
>> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out 
>> of 6 doing 101)
>> 
>> This is on just one of the osd servers.
> 
> Where the ceph tests to one OSD server or across all servers?  It might be 
> worth trying tests against a single server with no replication using multiple 
> rados bench instances and just seeing what happens.
> 
>> 
>> Thanks,
>> Dinu
>> 
>> 
>> On Oct 30, 2013, at 6:38 PM, Mark Nelson  wrote:
>> 
>>> On 10/30/2013 09:05 AM, Dinu Vlad wrote:
 Hello,
 
 I've been doing some tests on a newly installed ceph cluster:
 
 # ceph osd create bench1 2048 2048
 # ceph osd create bench2 2048 2048
 # rbd -p bench1 create test
 # rbd -p bench1 bench-write test --io-pattern rand
 elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 2220781.36
 
 # rados -p bench2 bench 300 write --show-time
 # (run 1)
 Total writes made:  20665
 Write size: 4194304
 Bandwidth (MB/sec): 274.923
 
 Stddev Bandwidth:   96.3316
 Max bandwidth (MB/sec): 748
 Min bandwidth (MB/sec): 0
 Average Latency:0.23273
 Stddev Latency: 0.262043
 Max latency:1.69475
 Min latency:0.057293
 
 These results seem to be quite poor for the configuration:
 
 MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
 OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board 
 controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for 
 journal, attached to a LSI 9207-8i controller.
 All servers have dual 10GE network cards, connected to a pair of dedicated 
 switches. Each SSD has 3 10 GB partitions for journals.
>>> 
>>> Agreed, you should see much higher throughput with that kind of storage 
>>> setup.  What brand/model SSDs are these?  Also, what brand and model of 
>>> chassis?  With 24 drives and 8 SSDs I could push 2GB/s (no replication 
>>> though) with a couple of concurrent rados bench processes going on our 
>>> SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs 
>>> is definitely on the low side.
>>> 
>>> I'm actually not too familiar with what the RBD benchmarking commands are 
>>> doing behind the scenes.  Typically I've tested fio on top of a filesystem 
>>> on RBD.
>>> 
 
 Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was 
 installed using ceph-deploy. ceph.conf pretty much out of the box (diff 
 from default follows)
 
 osd_journa

Re: [ceph-users] ceph cluster performance

2013-10-30 Thread Mark Nelson

On 10/30/2013 01:51 PM, Dinu Vlad wrote:

Mark,

The SSDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
 and the HDDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.

The chasis is a "SiliconMechanics C602" - but I don't have the exact model. 
It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander.

I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to 
what the driver reports in dmesg). here are the results (filtered):

Sequential:
Run status group 0 (all jobs):
   WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, 
mint=60444msec, maxt=61463msec

Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 
MB/s


Ok, that looks like what I'd expect to see given the controller being 
used.  SSDs are probably limited by total aggregate throughput.




Random:
Run status group 0 (all jobs):
   WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, 
mint=60404msec, maxt=61875msec

Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 
6 doing 101)

This is on just one of the osd servers.


Where the ceph tests to one OSD server or across all servers?  It might 
be worth trying tests against a single server with no replication using 
multiple rados bench instances and just seeing what happens.




Thanks,
Dinu


On Oct 30, 2013, at 6:38 PM, Mark Nelson  wrote:


On 10/30/2013 09:05 AM, Dinu Vlad wrote:

Hello,

I've been doing some tests on a newly installed ceph cluster:

# ceph osd create bench1 2048 2048
# ceph osd create bench2 2048 2048
# rbd -p bench1 create test
# rbd -p bench1 bench-write test --io-pattern rand
elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 2220781.36

# rados -p bench2 bench 300 write --show-time
# (run 1)
Total writes made:  20665
Write size: 4194304
Bandwidth (MB/sec): 274.923

Stddev Bandwidth:   96.3316
Max bandwidth (MB/sec): 748
Min bandwidth (MB/sec): 0
Average Latency:0.23273
Stddev Latency: 0.262043
Max latency:1.69475
Min latency:0.057293

These results seem to be quite poor for the configuration:

MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board 
controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for 
journal, attached to a LSI 9207-8i controller.
All servers have dual 10GE network cards, connected to a pair of dedicated 
switches. Each SSD has 3 10 GB partitions for journals.


Agreed, you should see much higher throughput with that kind of storage setup.  
What brand/model SSDs are these?  Also, what brand and model of chassis?  With 
24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple 
of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s 
aggregate throughput for 18 drives and 6 SSDs is definitely on the low side.

I'm actually not too familiar with what the RBD benchmarking commands are doing 
behind the scenes.  Typically I've tested fio on top of a filesystem on RBD.



Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed 
using ceph-deploy. ceph.conf pretty much out of the box (diff from default 
follows)

osd_journal_size = 10240
osd mount options xfs = "rw,noatime,nobarrier,inode64"
osd mkfs options xfs = "-f -i size=2048"

[osd]
public network = 10.4.0.0/24
cluster network = 10.254.254.0/24

All tests were run from a server outside the cluster, connected to the storage 
network with 2x 10 GE nics.

I've done a few other tests of the individual components:
- network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000)
- md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput
- fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS


What you might want to try doing is 4M direct IO writes using libaio and a high 
iodepth to all drives (spinning disks and SSDs) concurrently and see how both 
the per-drive and aggregate throughput is.

With just SSDs, I've been able to push the 9207-8i up to around 3GB/s with Ceph 
writes (1.5GB/s if you don't count journal writes), but perhaps there is 
something interesting about the way the hardware is setup on your system.



I'd appreciate any suggestion that might help improve the performance or 
identify a bottleneck.

Thanks
Dinu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___

Re: [ceph-users] ceph cluster performance

2013-10-30 Thread Dinu Vlad
Mark,

The SSDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
 and the HDDs are 
http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.
 

The chasis is a "SiliconMechanics C602" - but I don't have the exact model. 
It's based on Supermicro, has 24 slots front and 2 in the back and a SAS 
expander. 

I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to 
what the driver reports in dmesg). here are the results (filtered): 

Sequential: 
Run status group 0 (all jobs):
  WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, 
mint=60444msec, maxt=61463msec

Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 
MB/s 

Random: 
Run status group 0 (all jobs):
  WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, 
mint=60404msec, maxt=61875msec

Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 
6 doing 101)

This is on just one of the osd servers.

Thanks,
Dinu


On Oct 30, 2013, at 6:38 PM, Mark Nelson  wrote:

> On 10/30/2013 09:05 AM, Dinu Vlad wrote:
>> Hello,
>> 
>> I've been doing some tests on a newly installed ceph cluster:
>> 
>> # ceph osd create bench1 2048 2048
>> # ceph osd create bench2 2048 2048
>> # rbd -p bench1 create test
>> # rbd -p bench1 bench-write test --io-pattern rand
>> elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 2220781.36
>> 
>> # rados -p bench2 bench 300 write --show-time
>> # (run 1)
>> Total writes made:  20665
>> Write size: 4194304
>> Bandwidth (MB/sec): 274.923
>> 
>> Stddev Bandwidth:   96.3316
>> Max bandwidth (MB/sec): 748
>> Min bandwidth (MB/sec): 0
>> Average Latency:0.23273
>> Stddev Latency: 0.262043
>> Max latency:1.69475
>> Min latency:0.057293
>> 
>> These results seem to be quite poor for the configuration:
>> 
>> MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
>> OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board 
>> controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for 
>> journal, attached to a LSI 9207-8i controller.
>> All servers have dual 10GE network cards, connected to a pair of dedicated 
>> switches. Each SSD has 3 10 GB partitions for journals.
> 
> Agreed, you should see much higher throughput with that kind of storage 
> setup.  What brand/model SSDs are these?  Also, what brand and model of 
> chassis?  With 24 drives and 8 SSDs I could push 2GB/s (no replication 
> though) with a couple of concurrent rados bench processes going on our SC847A 
> chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is 
> definitely on the low side.
> 
> I'm actually not too familiar with what the RBD benchmarking commands are 
> doing behind the scenes.  Typically I've tested fio on top of a filesystem on 
> RBD.
> 
>> 
>> Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was 
>> installed using ceph-deploy. ceph.conf pretty much out of the box (diff from 
>> default follows)
>> 
>> osd_journal_size = 10240
>> osd mount options xfs = "rw,noatime,nobarrier,inode64"
>> osd mkfs options xfs = "-f -i size=2048"
>> 
>> [osd]
>> public network = 10.4.0.0/24
>> cluster network = 10.254.254.0/24
>> 
>> All tests were run from a server outside the cluster, connected to the 
>> storage network with 2x 10 GE nics.
>> 
>> I've done a few other tests of the individual components:
>> - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000)
>> - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput
>> - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS
> 
> What you might want to try doing is 4M direct IO writes using libaio and a 
> high iodepth to all drives (spinning disks and SSDs) concurrently and see how 
> both the per-drive and aggregate throughput is.
> 
> With just SSDs, I've been able to push the 9207-8i up to around 3GB/s with 
> Ceph writes (1.5GB/s if you don't count journal writes), but perhaps there is 
> something interesting about the way the hardware is setup on your system.
> 
>> 
>> I'd appreciate any suggestion that might help improve the performance or 
>> identify a bottleneck.
>> 
>> Thanks
>> Dinu
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-10-30 Thread Mark Nelson

On 10/30/2013 09:05 AM, Dinu Vlad wrote:

Hello,

I've been doing some tests on a newly installed ceph cluster:

# ceph osd create bench1 2048 2048
# ceph osd create bench2 2048 2048
# rbd -p bench1 create test
# rbd -p bench1 bench-write test --io-pattern rand
elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 2220781.36

# rados -p bench2 bench 300 write --show-time
# (run 1)
Total writes made:  20665
Write size: 4194304
Bandwidth (MB/sec): 274.923

Stddev Bandwidth:   96.3316
Max bandwidth (MB/sec): 748
Min bandwidth (MB/sec): 0
Average Latency:0.23273
Stddev Latency: 0.262043
Max latency:1.69475
Min latency:0.057293

These results seem to be quite poor for the configuration:

MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board 
controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for 
journal, attached to a LSI 9207-8i controller.
All servers have dual 10GE network cards, connected to a pair of dedicated 
switches. Each SSD has 3 10 GB partitions for journals.


Agreed, you should see much higher throughput with that kind of storage 
setup.  What brand/model SSDs are these?  Also, what brand and model of 
chassis?  With 24 drives and 8 SSDs I could push 2GB/s (no replication 
though) with a couple of concurrent rados bench processes going on our 
SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 
SSDs is definitely on the low side.


I'm actually not too familiar with what the RBD benchmarking commands 
are doing behind the scenes.  Typically I've tested fio on top of a 
filesystem on RBD.




Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed 
using ceph-deploy. ceph.conf pretty much out of the box (diff from default 
follows)

osd_journal_size = 10240
osd mount options xfs = "rw,noatime,nobarrier,inode64"
osd mkfs options xfs = "-f -i size=2048"

[osd]
public network = 10.4.0.0/24
cluster network = 10.254.254.0/24

All tests were run from a server outside the cluster, connected to the storage 
network with 2x 10 GE nics.

I've done a few other tests of the individual components:
- network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000)
- md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput
- fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS


What you might want to try doing is 4M direct IO writes using libaio and 
a high iodepth to all drives (spinning disks and SSDs) concurrently and 
see how both the per-drive and aggregate throughput is.


With just SSDs, I've been able to push the 9207-8i up to around 3GB/s 
with Ceph writes (1.5GB/s if you don't count journal writes), but 
perhaps there is something interesting about the way the hardware is 
setup on your system.




I'd appreciate any suggestion that might help improve the performance or 
identify a bottleneck.

Thanks
Dinu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com