Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time
Thanks all for the help. We finally identified the root cause of the issue was due to a lock contention happening at folder splitting and here is a tracking ticket (thanks Inktank for the fix!): http://tracker.ceph.com/issues/7207 Thanks, Guang On Tuesday, December 31, 2013 8:22 AM, Guang Yang wrote: Thanks Wido, my comments inline... >Date: Mon, 30 Dec 2013 14:04:35 +0100 >From: Wido den Hollander >To: ceph-users@lists.ceph.com >Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time >On 12/30/2013 12:45 PM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.246379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Thanks Mark, my comments inline... Date: Mon, 30 Dec 2013 07:36:56 -0600 From: Mark Nelson To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time On 12/30/2013 05:45 AM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.246379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 > 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 > > Any idea why a couple of OSDs are so slow that impact the performance of > the entire cluster? You may want to use the dump_historic_ops command in the admin socket for the slow OSDs. That will give you some clues regarding whe
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Thanks Wido, my comments inline... >Date: Mon, 30 Dec 2013 14:04:35 +0100 >From: Wido den Hollander >To: ceph-users@lists.ceph.com >Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time >On 12/30/2013 12:45 PM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.246379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 > 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 > > Any idea why a couple of OSDs are so slow that impact the performance of > the entire cluster? > What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over t
Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time
Thanks Wido, my comments inline... >Date: Mon, 30 Dec 2013 14:04:35 +0100 >From: Wido den Hollander >To: ceph-users@lists.ceph.com >Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time >On 12/30/2013 12:45 PM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.246379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 > 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 > > Any idea why a couple of OSDs are so slow that impact the performance of > the entire cluster? > What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over time. So if you run btrfs, that might be the problem. [yguang] We are running on xfs, journal and data share the same disk with different partitions. Wido > Thanks,___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
On 12/30/2013 05:45 AM, Guang wrote: Hi ceph-users and ceph-devel, Merry Christmas and Happy New Year! We have a ceph cluster with radosgw, our customer is using S3 API to access the cluster. The basic information of the cluster is: bash-4.1$ ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 40, quorum 0,1,2 osd151,osd152,osd153 osdmap e129885: 787 osds: 758 up, 758 in pgmap v1884502: 22203 pgs: 22125 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent, 76 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 TB avail mdsmap e1: 0/0/1 up #When the latency peak happened, there was no scrubbing, recovering or backfilling at the moment.# While the performance of the cluster (only with WRITE traffic) is stable until Dec 25th, our monitoring (for radosgw access log) shows a significant increase of average latency and 99% latency. And then I chose one OSD and try to grep slow requests logs and find that most of the slow requests were waiting for subop, I take osd22 for example. osd[561-571] are hosted by osd22. -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > ~/slow_osd.txt -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort –nr 3586 656,598 289 467,629 284 598,763 279 584,598 203 172,598 182 598,6 155 629,646 83 631,598 65 631,593 21 616,629 20 609,671 20 609,390 13 609,254 12 702,629 12 629,641 11 665,613 11 593,724 11 361,591 10 591,709 9 681,609 9 609,595 9 591,772 8 613,662 8 575,591 7 674,722 7 609,603 6 585,605 5 613,691 5 293,629 4 774,591 4 717,591 4 613,776 4 538,629 4 485,629 3 702,641 3 608,629 3 593,580 3 591,676 It turns out most of the slow requests were waiting for osd 598, 629, I ran the procedure on another host osd22 and got the same pattern. Then I turned to the host having osd598 and dump the perf counter to do comparision. -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done op_latency,subop_latency,total_ops 0.192097526753471,0.0344513450167198,7549045 1.99137797628122,1.42198426157216,9184472 0.198062399664129,0.0387090378926376,6305973 0.621697271315762,0.396549768986993,9726679 29.5222496247375,18.246379615, 10860858 0.229250239525916,0.0557482067611005,8149691 0.208981698303654,0.0375553180438224,6623842 0.47474766302086,0.292583928601509,9838777 0.339477790083925,0.101288409388438,9340212 0.186448840141895,0.0327296517417626,7081410 0.807598201207144,0.0139762289702332,6093531 (osd 598 is op hotspot as well) This double confirmed that osd 598 was having some performance issues (it has around *30 seconds average op latency*!). sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the latency difference is not as significant as we saw from osd perf. reads kbread writes kbwrite %busy avgqu await svctm 37.3459.989.8 4106.9 61.8 1.6 12.24.9 42.3545.891.8 4296.3 69.7 2.4 17.65.2 42.0483.893.1 4263.6 68.8 1.8 13.35.1 39.7425.589.4 4327.0 68.5 1.8 14.05.3 Another disk at the same time for comparison (/dev/sdb). reads kbread writes kbwrite %busy avgqu await svctm 34.2502.680.13524.353.4 1.3 11.8 4.7 35.3560.983.73742.056.0 1.2 9.8 4.7 30.4371.5 78.8 3631.452.2 1.7 15.8 4.8 33.0389.4 78.8 3597.6 54.2 1.4 12.14.8 Any idea why a couple of OSDs are so slow that impact the performance of the entire cluster? You may want to use the dump_historic_ops command in the admin socket for the slow OSDs. That will give you some clues regarding where the ops are hanging up in the OSD. You can also crank the osd debugging way up on that node and search through the logs to see if there are any patterns or trends (consistent slowness, pauses, etc). It may also be useful to look and see if that OSD is pegging CPU and if so attach strace or perf to it and see what it's doing. Normally in this situation I'd say to be wary of the disk going bad, but in this case it may be something else. Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
On 12/30/2013 12:45 PM, Guang wrote: Hi ceph-users and ceph-devel, Merry Christmas and Happy New Year! We have a ceph cluster with radosgw, our customer is using S3 API to access the cluster. The basic information of the cluster is: bash-4.1$ ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 40, quorum 0,1,2 osd151,osd152,osd153 osdmap e129885: 787 osds: 758 up, 758 in pgmap v1884502: 22203 pgs: 22125 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent, 76 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 TB avail mdsmap e1: 0/0/1 up #When the latency peak happened, there was no scrubbing, recovering or backfilling at the moment.# While the performance of the cluster (only with WRITE traffic) is stable until Dec 25th, our monitoring (for radosgw access log) shows a significant increase of average latency and 99% latency. And then I chose one OSD and try to grep slow requests logs and find that most of the slow requests were waiting for subop, I take osd22 for example. osd[561-571] are hosted by osd22. -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > ~/slow_osd.txt -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort –nr 3586 656,598 289 467,629 284 598,763 279 584,598 203 172,598 182 598,6 155 629,646 83 631,598 65 631,593 21 616,629 20 609,671 20 609,390 13 609,254 12 702,629 12 629,641 11 665,613 11 593,724 11 361,591 10 591,709 9 681,609 9 609,595 9 591,772 8 613,662 8 575,591 7 674,722 7 609,603 6 585,605 5 613,691 5 293,629 4 774,591 4 717,591 4 613,776 4 538,629 4 485,629 3 702,641 3 608,629 3 593,580 3 591,676 It turns out most of the slow requests were waiting for osd 598, 629, I ran the procedure on another host osd22 and got the same pattern. Then I turned to the host having osd598 and dump the perf counter to do comparision. -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done op_latency,subop_latency,total_ops 0.192097526753471,0.0344513450167198,7549045 1.99137797628122,1.42198426157216,9184472 0.198062399664129,0.0387090378926376,6305973 0.621697271315762,0.396549768986993,9726679 29.5222496247375,18.246379615, 10860858 0.229250239525916,0.0557482067611005,8149691 0.208981698303654,0.0375553180438224,6623842 0.47474766302086,0.292583928601509,9838777 0.339477790083925,0.101288409388438,9340212 0.186448840141895,0.0327296517417626,7081410 0.807598201207144,0.0139762289702332,6093531 (osd 598 is op hotspot as well) This double confirmed that osd 598 was having some performance issues (it has around *30 seconds average op latency*!). sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the latency difference is not as significant as we saw from osd perf. reads kbread writes kbwrite %busy avgqu await svctm 37.3459.989.8 4106.9 61.8 1.6 12.24.9 42.3545.891.8 4296.3 69.7 2.4 17.65.2 42.0483.893.1 4263.6 68.8 1.8 13.35.1 39.7425.589.4 4327.0 68.5 1.8 14.05.3 Another disk at the same time for comparison (/dev/sdb). reads kbread writes kbwrite %busy avgqu await svctm 34.2502.680.13524.353.4 1.3 11.8 4.7 35.3560.983.73742.056.0 1.2 9.8 4.7 30.4371.5 78.8 3631.452.2 1.7 15.8 4.8 33.0389.4 78.8 3597.6 54.2 1.4 12.14.8 Any idea why a couple of OSDs are so slow that impact the performance of the entire cluster? What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over time. So if you run btrfs, that might be the problem. Wido Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
On 11/08/2013 12:59 PM, Gruher, Joseph R wrote: -Original Message- From: Dinu Vlad [mailto:dinuvla...@gmail.com] Sent: Thursday, November 07, 2013 10:37 AM To: ja...@peacon.co.uk; Gruher, Joseph R; ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph cluster performance I was under the same impression - using a small portion of the SSD via partitioning (in my case - 30 gigs out of 240) would have the same effect as activating the HPA explicitly. Am I wrong? I pinged a guy on the SSD team here at Intel and he confirmed - if you have a new drive (or freshly secure erased drive) and you only use the subset of the capacity (such as by creating a small partition) you effectively get the same benefits as overprovisioning the hidden area of the drive (or underprovisioning the available capacity if you prefer to look at it that way). It's really all about maintaining a larger area of cells where the SSDs knows it does not have to preserve the data, one way or the other. That was my understanding as well, but it's great to have confirmation from Intel! Thanks Joseph! Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
>-Original Message- >From: Dinu Vlad [mailto:dinuvla...@gmail.com] >Sent: Thursday, November 07, 2013 10:37 AM >To: ja...@peacon.co.uk; Gruher, Joseph R; ceph-users@lists.ceph.com >Subject: Re: [ceph-users] ceph cluster performance > >I was under the same impression - using a small portion of the SSD via >partitioning (in my case - 30 gigs out of 240) would have the same effect as >activating the HPA explicitly. > >Am I wrong? I pinged a guy on the SSD team here at Intel and he confirmed - if you have a new drive (or freshly secure erased drive) and you only use the subset of the capacity (such as by creating a small partition) you effectively get the same benefits as overprovisioning the hidden area of the drive (or underprovisioning the available capacity if you prefer to look at it that way). It's really all about maintaining a larger area of cells where the SSDs knows it does not have to preserve the data, one way or the other. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
I have 2 SSDs (same model, smaller capacity) for / connected on the mainboard. Their sync write performance is also poor - less than 600 iops, 4k blocks. On Nov 7, 2013, at 9:44 PM, Kyle Bader wrote: >> ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. > > The problem might be SATA transport protocol overhead at the expander. > Have you tried directly connecting the SSDs to SATA2/3 ports on the > mainboard? > > -- > > Kyle > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
> ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. The problem might be SATA transport protocol overhead at the expander. Have you tried directly connecting the SSDs to SATA2/3 ports on the mainboard? -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
I was under the same impression - using a small portion of the SSD via partitioning (in my case - 30 gigs out of 240) would have the same effect as activating the HPA explicitly. Am I wrong? On Nov 7, 2013, at 8:16 PM, ja...@peacon.co.uk wrote: > On 2013-11-07 17:47, Gruher, Joseph R wrote: > >> I wonder how effective trim would be on a Ceph journal area. >> If the journal empties and is then trimmed the next write cycle should >> be faster, but if the journal is active all the time the benefits >> would be lost almost immediately, as those cells are going to receive >> data again almost immediately and go back to an "untrimmed" state >> until the next trim occurs. > > If it's under-provisioned (so the device knows there are unused cells), the > device would simply write to an empty cell and flag the old cell for erasing, > so there should be no change. Latency would rise when sustained write rate > exceeded the devices' ability to clear cells, so eventually the stock of > ready cells would be depleted. > > FWIW, I think there is considerable mileage in the larger-consumer grade > argument. Assuming drives will be half the price in a years time, so > selecting devices that can last only a year is preferable to spending 3x the > price on one that can survive three. That though opens the tin of worms that > is SMART reporting and moving journals at some future point mind. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
On 2013-11-07 17:47, Gruher, Joseph R wrote: I wonder how effective trim would be on a Ceph journal area. If the journal empties and is then trimmed the next write cycle should be faster, but if the journal is active all the time the benefits would be lost almost immediately, as those cells are going to receive data again almost immediately and go back to an "untrimmed" state until the next trim occurs. If it's under-provisioned (so the device knows there are unused cells), the device would simply write to an empty cell and flag the old cell for erasing, so there should be no change. Latency would rise when sustained write rate exceeded the devices' ability to clear cells, so eventually the stock of ready cells would be depleted. FWIW, I think there is considerable mileage in the larger-consumer grade argument. Assuming drives will be half the price in a years time, so selecting devices that can last only a year is preferable to spending 3x the price on one that can survive three. That though opens the tin of worms that is SMART reporting and moving journals at some future point mind. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
On 11/07/2013 11:47 AM, Gruher, Joseph R wrote: -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- boun...@lists.ceph.com] On Behalf Of Dinu Vlad Sent: Thursday, November 07, 2013 3:30 AM To: ja...@peacon.co.uk; ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph cluster performance In this case however, the SSDs were only used for journals and I don't know if ceph-osd sends TRIM to the drive in the process of journaling over a block device. They were also under-subscribed, with just 3 x 10G partitions out of 240 GB raw capacity. I did a manual trim, but it hasn't changed anything. If your SSD capacity is well in excess of your journal capacity requirements you could consider overprovisioning the SSD. Overprovisioning should increase SSD performance and lifetime. This achieves the same effect as trim to some degree (lets the SSD better understand what cells have real data and which can be treated as free). I wonder how effective trim would be on a Ceph journal area. If the journal empties and is then trimmed the next write cycle should be faster, but if the journal is active all the time the benefits would be lost almost immediately, as those cells are going to receive data again almost immediately and go back to an "untrimmed" state until the next trim occurs. over-provisioning is definitely something to consider, especially if you aren't buying SSDs with high write endurance. The more cells you can spread the load out over the better. We've had some interesting conversations on here in the past about whether or not it's more cost effective to buy large capacity consumer grade SSDs with more cells or shell out for smaller capacity enterprise grade drives. My personal opinion is that it's worth paying a bit extra for a drive that employs something like MLC-HET, but there's a lot of "enterprise" grade drives out there with low write endurance that you really have to watch out for. If you are going to pay extra, at least get something with high write endurance and reasonable write speeds. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
>-Original Message- >From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- >boun...@lists.ceph.com] On Behalf Of Dinu Vlad >Sent: Thursday, November 07, 2013 3:30 AM >To: ja...@peacon.co.uk; ceph-users@lists.ceph.com >Subject: Re: [ceph-users] ceph cluster performance >In this case however, the SSDs were only used for journals and I don't know if >ceph-osd sends TRIM to the drive in the process of journaling over a block >device. They were also under-subscribed, with just 3 x 10G partitions out of >240 GB raw capacity. I did a manual trim, but it hasn't changed anything. If your SSD capacity is well in excess of your journal capacity requirements you could consider overprovisioning the SSD. Overprovisioning should increase SSD performance and lifetime. This achieves the same effect as trim to some degree (lets the SSD better understand what cells have real data and which can be treated as free). I wonder how effective trim would be on a Ceph journal area. If the journal empties and is then trimmed the next write cycle should be faster, but if the journal is active all the time the benefits would be lost almost immediately, as those cells are going to receive data again almost immediately and go back to an "untrimmed" state until the next trim occurs. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
I had great results from the older 530 series too. In this case however, the SSDs were only used for journals and I don't know if ceph-osd sends TRIM to the drive in the process of journaling over a block device. They were also under-subscribed, with just 3 x 10G partitions out of 240 GB raw capacity. I did a manual trim, but it hasn't changed anything. I'm still having fun with the configuration so I'll be able to use Mike Dawson's suggested tools to check for latencies. On Nov 6, 2013, at 11:35 PM, ja...@peacon.co.uk wrote: > On 2013-11-06 20:25, Mike Dawson wrote: > >> We just fixed a performance issue on our cluster related to spikes of high >> latency on some of our SSDs used for osd journals. In our case, the slow >> SSDs showed spikes of 100x higher latency than expected. > > > Many SSDs show this behaviour when 100% provisioned and/or never TRIM'd, > since the pool of ready erased cells is quickly depleted under steady write > workload, so it has to wait for cells to charge to accommodate the write. > > The Intel 3700 SSDs look to have some of the best consistency ratings of any > of the more reasonably priced drives at the moment, and good IOPS too: > > http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html > > Obviously the quoted IOPS numbers are dependent on quite a deep queue mind. > > There is a big range of performance in the market currently; some Enterprise > SSDs are quoted at just 4,000 IOPS yet cost as many pounds! > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
On 11/06/2013 03:35 PM, ja...@peacon.co.uk wrote: On 2013-11-06 20:25, Mike Dawson wrote: We just fixed a performance issue on our cluster related to spikes of high latency on some of our SSDs used for osd journals. In our case, the slow SSDs showed spikes of 100x higher latency than expected. Many SSDs show this behaviour when 100% provisioned and/or never TRIM'd, since the pool of ready erased cells is quickly depleted under steady write workload, so it has to wait for cells to charge to accommodate the write. The Intel 3700 SSDs look to have some of the best consistency ratings of any of the more reasonably priced drives at the moment, and good IOPS too: http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html Obviously the quoted IOPS numbers are dependent on quite a deep queue mind. There is a big range of performance in the market currently; some Enterprise SSDs are quoted at just 4,000 IOPS yet cost as many pounds! Most vendors won't give you DC S3700s by default, but if you put your foot down most of them seem to have SKUs for them lurking around somewhere. Right now they are the first drive I recommend for journals, though I believe some of the other vendors may have some interesting options in the future too. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
On 2013-11-06 20:25, Mike Dawson wrote: We just fixed a performance issue on our cluster related to spikes of high latency on some of our SSDs used for osd journals. In our case, the slow SSDs showed spikes of 100x higher latency than expected. Many SSDs show this behaviour when 100% provisioned and/or never TRIM'd, since the pool of ready erased cells is quickly depleted under steady write workload, so it has to wait for cells to charge to accommodate the write. The Intel 3700 SSDs look to have some of the best consistency ratings of any of the more reasonably priced drives at the moment, and good IOPS too: http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html Obviously the quoted IOPS numbers are dependent on quite a deep queue mind. There is a big range of performance in the market currently; some Enterprise SSDs are quoted at just 4,000 IOPS yet cost as many pounds! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
No, in our case flashing the firmware to the latest release cured the problem. If you build a new cluster with the slow SSDs, I'd be interested in the results of ioping[0] or fsync-tester[1]. I theorize that you may see spikes of high latency. [0] https://code.google.com/p/ioping/ [1] https://github.com/gregsfortytwo/fsync-tester Thanks, Mike Dawson On 11/6/2013 4:18 PM, Dinu Vlad wrote: ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. By "fixed" - you mean replaced the SSDs? Thanks, Dinu On Nov 6, 2013, at 10:25 PM, Mike Dawson wrote: We just fixed a performance issue on our cluster related to spikes of high latency on some of our SSDs used for osd journals. In our case, the slow SSDs showed spikes of 100x higher latency than expected. What SSDs were you using that were so slow? Cheers, Mike On 11/6/2013 12:39 PM, Dinu Vlad wrote: I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended? Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them. Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each). On Nov 5, 2013, at 4:38 PM, Mark Nelson wrote: Ok, some more thoughts: 1) What kernel are you using? 2) Mixing SATA and SAS on an expander backplane can some times have bad effects. We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html 3) If you are doing tests and look at disk throughput with something like "collectl -sD -oT" do the writes look balanced across the spinning disks? Do any devices have much really high service times or queue times? 4) Also, after the test is done, you can try: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; > foo and then grep for "duration" in foo. You'll get a list of the slowest operations over the last 10 minutes from every osd on the node. Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up. That might tell us more about slow/latent operations. 5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives). So it's very interesting to me that you are pushing so much less. The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). Mark On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed "default", with the same additions about xfs mount & mkfs.xfs as before. With a single host, the pgs were "stuck unclean" (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterpris
Re: [ceph-users] ceph cluster performance
ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. By "fixed" - you mean replaced the SSDs? Thanks, Dinu On Nov 6, 2013, at 10:25 PM, Mike Dawson wrote: > We just fixed a performance issue on our cluster related to spikes of high > latency on some of our SSDs used for osd journals. In our case, the slow SSDs > showed spikes of 100x higher latency than expected. > > What SSDs were you using that were so slow? > > Cheers, > Mike > > On 11/6/2013 12:39 PM, Dinu Vlad wrote: >> I'm using the latest 3.8.0 branch from raring. Is there a more recent/better >> kernel recommended? >> >> Meanwhile, I think I might have identified the culprit - my SSD drives are >> extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By >> comparison, an Intel 530 in another server (also installed behind a SAS >> expander is doing the same test with ~ 8k iops. I guess I'm good for >> replacing them. >> >> Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s >> throughput under the same conditions (only mechanical drives, journal on a >> separate partition on each one, 8 rados bench processes, 16 threads each). >> >> >> On Nov 5, 2013, at 4:38 PM, Mark Nelson wrote: >> >>> Ok, some more thoughts: >>> >>> 1) What kernel are you using? >>> >>> 2) Mixing SATA and SAS on an expander backplane can some times have bad >>> effects. We don't really know how bad this is and in what circumstances, >>> but the Nexenta folks have seen problems with ZFS on solaris and it's not >>> impossible linux may suffer too: >>> >>> http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html >>> >>> 3) If you are doing tests and look at disk throughput with something like >>> "collectl -sD -oT" do the writes look balanced across the spinning disks? >>> Do any devices have much really high service times or queue times? >>> >>> 4) Also, after the test is done, you can try: >>> >>> find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} >>> dump_historic_ops \; > foo >>> >>> and then grep for "duration" in foo. You'll get a list of the slowest >>> operations over the last 10 minutes from every osd on the node. Once you >>> identify a slow duration, you can go back and in an editor search for the >>> slow duration and look at where in the OSD it hung up. That might tell us >>> more about slow/latent operations. >>> >>> 5) Something interesting here is that I've heard from another party that in >>> a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 >>> SSDs on a SAS9207-8i controller and were pushing significantly faster >>> throughput than you are seeing (even given the greater number of drives). >>> So it's very interesting to me that you are pushing so much less. The 36 >>> drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs >>> can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no >>> replication). >>> >>> Mark >>> >>> On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote: > > I tested the osd performance from a single node. For this purpose I > deployed a new cluster (using ceph-deploy, as before) and on > fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the > rados bench both on the osd server and on a remote one. Cluster > configuration stayed "default", with the same additions about xfs mount & > mkfs.xfs as before. > > With a single host, the pgs were "stuck unclean" (active only, not > active+clean): > > # ceph -s > cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 > health HEALTH_WARN 1800 pgs stuck unclean > monmap e1: 3 mons at > {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, > election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 > osdmap e101: 18 osds: 18 up, 18 in >pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 > GB / 16759 GB avail > mdsmap e1: 0/0/1 up > > > Test results: > Local test, 1 process, 16 threads: 241.7 MB/s > Local test, 8 processes, 128 threads: 374.8 MB/s > Remote test, 1 process, 16 threads: 231.8 MB/s > Remote test, 8 processes, 128 threads: 366.1 MB/s > > Maybe it's just me, but it seems on the low side too. > > Thanks, > Dinu > > > On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: > >> On 10/30/2013 01:51 PM, Dinu Vlad wrote: >
Re: [ceph-users] ceph cluster performance
We just fixed a performance issue on our cluster related to spikes of high latency on some of our SSDs used for osd journals. In our case, the slow SSDs showed spikes of 100x higher latency than expected. What SSDs were you using that were so slow? Cheers, Mike On 11/6/2013 12:39 PM, Dinu Vlad wrote: I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended? Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them. Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each). On Nov 5, 2013, at 4:38 PM, Mark Nelson wrote: Ok, some more thoughts: 1) What kernel are you using? 2) Mixing SATA and SAS on an expander backplane can some times have bad effects. We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html 3) If you are doing tests and look at disk throughput with something like "collectl -sD -oT" do the writes look balanced across the spinning disks? Do any devices have much really high service times or queue times? 4) Also, after the test is done, you can try: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; > foo and then grep for "duration" in foo. You'll get a list of the slowest operations over the last 10 minutes from every osd on the node. Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up. That might tell us more about slow/latent operations. 5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives). So it's very interesting to me that you are pushing so much less. The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). Mark On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed "default", with the same additions about xfs mount & mkfs.xfs as before. With a single host, the pgs were "stuck unclean" (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a "SiliconMechanics C602" - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, m
Re: [ceph-users] ceph cluster performance
On 11/06/2013 11:39 AM, Dinu Vlad wrote: I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended? I've been using the 3.8 kernel in the precise repo effectively, so I suspect it should be ok. Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them. Very interesting! Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each). Ok, so you went from like 300MB/s to ~600MB/s by removing the SSDs and just using spinners? That's pretty crazy! In any event, 600MB/s from 18 disks with journal writes is like 66MB/s per disk. That's not particularly great, but if it's on the 9207-8i with no cache might be about right since journal and fs writes will be in more contention. I'd be curious what you'd see with DC S3700s for journals. On Nov 5, 2013, at 4:38 PM, Mark Nelson wrote: Ok, some more thoughts: 1) What kernel are you using? 2) Mixing SATA and SAS on an expander backplane can some times have bad effects. We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html 3) If you are doing tests and look at disk throughput with something like "collectl -sD -oT" do the writes look balanced across the spinning disks? Do any devices have much really high service times or queue times? 4) Also, after the test is done, you can try: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; > foo and then grep for "duration" in foo. You'll get a list of the slowest operations over the last 10 minutes from every osd on the node. Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up. That might tell us more about slow/latent operations. 5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives). So it's very interesting to me that you are pushing so much less. The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). Mark On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed "default", with the same additions about xfs mount & mkfs.xfs as before. With a single host, the pgs were "stuck unclean" (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a "SiliconMechanics C602" - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back
Re: [ceph-users] ceph cluster performance
I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended? Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them. Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each). On Nov 5, 2013, at 4:38 PM, Mark Nelson wrote: > Ok, some more thoughts: > > 1) What kernel are you using? > > 2) Mixing SATA and SAS on an expander backplane can some times have bad > effects. We don't really know how bad this is and in what circumstances, but > the Nexenta folks have seen problems with ZFS on solaris and it's not > impossible linux may suffer too: > > http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html > > 3) If you are doing tests and look at disk throughput with something like > "collectl -sD -oT" do the writes look balanced across the spinning disks? > Do any devices have much really high service times or queue times? > > 4) Also, after the test is done, you can try: > > find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} > dump_historic_ops \; > foo > > and then grep for "duration" in foo. You'll get a list of the slowest > operations over the last 10 minutes from every osd on the node. Once you > identify a slow duration, you can go back and in an editor search for the > slow duration and look at where in the OSD it hung up. That might tell us > more about slow/latent operations. > > 5) Something interesting here is that I've heard from another party that in a > 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on > a SAS9207-8i controller and were pushing significantly faster throughput than > you are seeing (even given the greater number of drives). So it's very > interesting to me that you are pushing so much less. The 36 drive supermicro > chassis I have with no expanders and 30 drives with 6 SSDs can push about > 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). > > Mark > > On 11/05/2013 05:15 AM, Dinu Vlad wrote: >> Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* >> ceph settings I was able to get 440 MB/s from 8 rados bench instances, over >> a single osd node (pool pg_num = 1800, size = 1) >> >> This still looks awfully slow to me - fio throughput across all disks >> reaches 2.8 GB/s!! >> >> I'd appreciate any suggestion, where to look for the issue. Thanks! >> >> >> On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote: >> >>> >>> I tested the osd performance from a single node. For this purpose I >>> deployed a new cluster (using ceph-deploy, as before) and on >>> fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the >>> rados bench both on the osd server and on a remote one. Cluster >>> configuration stayed "default", with the same additions about xfs mount & >>> mkfs.xfs as before. >>> >>> With a single host, the pgs were "stuck unclean" (active only, not >>> active+clean): >>> >>> # ceph -s >>> cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 >>> health HEALTH_WARN 1800 pgs stuck unclean >>> monmap e1: 3 mons at >>> {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, >>> election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 >>> osdmap e101: 18 osds: 18 up, 18 in >>>pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB >>> / 16759 GB avail >>> mdsmap e1: 0/0/1 up >>> >>> >>> Test results: >>> Local test, 1 process, 16 threads: 241.7 MB/s >>> Local test, 8 processes, 128 threads: 374.8 MB/s >>> Remote test, 1 process, 16 threads: 231.8 MB/s >>> Remote test, 8 processes, 128 threads: 366.1 MB/s >>> >>> Maybe it's just me, but it seems on the low side too. >>> >>> Thanks, >>> Dinu >>> >>> >>> On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: >>> On 10/30/2013 01:51 PM, Dinu Vlad wrote: > Mark, > > The SSDs are > http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 > and the HDDs are > http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. > > The chasis is a "SiliconMechanics C602" - but I don't have the exact > model. It's based on Supermicro, has 24 slots front and 2 in the back and > a SAS expander. > > I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out > according to what the driver reports in dmesg). here are the results > (filtered): > > Sequential: > Run status group 0 (all jobs): >
Re: [ceph-users] ceph cluster performance
Ok, some more thoughts: 1) What kernel are you using? 2) Mixing SATA and SAS on an expander backplane can some times have bad effects. We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html 3) If you are doing tests and look at disk throughput with something like "collectl -sD -oT" do the writes look balanced across the spinning disks? Do any devices have much really high service times or queue times? 4) Also, after the test is done, you can try: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; > foo and then grep for "duration" in foo. You'll get a list of the slowest operations over the last 10 minutes from every osd on the node. Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up. That might tell us more about slow/latent operations. 5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives). So it's very interesting to me that you are pushing so much less. The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). Mark On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed "default", with the same additions about xfs mount & mkfs.xfs as before. With a single host, the pgs were "stuck unclean" (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a "SiliconMechanics C602" - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s Ok, that looks like what I'd expect to see given the controller being used. SSDs are probably limited by total aggregate throughput. Random: Run status group 0 (all jobs): WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101) This is on just one of the osd servers. Where the ceph tests to one OSD server or across all servers? It might be worth trying tests against a single server with no replication using multiple rados bench instances and just seeing what happens. Thanks, Dinu On Oct 30, 2013, at 6:38 PM, Mark Nelson wrote: On 10/30/2013 09:05 AM, Dinu Vlad wrote: Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd
Re: [ceph-users] ceph cluster performance
Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote: > > I tested the osd performance from a single node. For this purpose I deployed > a new cluster (using ceph-deploy, as before) and on fresh/repartitioned > drives. I created a single pool, 1800 pgs. I ran the rados bench both on the > osd server and on a remote one. Cluster configuration stayed "default", with > the same additions about xfs mount & mkfs.xfs as before. > > With a single host, the pgs were "stuck unclean" (active only, not > active+clean): > > # ceph -s > cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 > health HEALTH_WARN 1800 pgs stuck unclean > monmap e1: 3 mons at > {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, > election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 > osdmap e101: 18 osds: 18 up, 18 in >pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / > 16759 GB avail > mdsmap e1: 0/0/1 up > > > Test results: > Local test, 1 process, 16 threads: 241.7 MB/s > Local test, 8 processes, 128 threads: 374.8 MB/s > Remote test, 1 process, 16 threads: 231.8 MB/s > Remote test, 8 processes, 128 threads: 366.1 MB/s > > Maybe it's just me, but it seems on the low side too. > > Thanks, > Dinu > > > On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: > >> On 10/30/2013 01:51 PM, Dinu Vlad wrote: >>> Mark, >>> >>> The SSDs are >>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 >>> and the HDDs are >>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. >>> >>> The chasis is a "SiliconMechanics C602" - but I don't have the exact model. >>> It's based on Supermicro, has 24 slots front and 2 in the back and a SAS >>> expander. >>> >>> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according >>> to what the driver reports in dmesg). here are the results (filtered): >>> >>> Sequential: >>> Run status group 0 (all jobs): >>> WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, >>> mint=60444msec, maxt=61463msec >>> >>> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave >>> 153:189 MB/s >> >> Ok, that looks like what I'd expect to see given the controller being used. >> SSDs are probably limited by total aggregate throughput. >> >>> >>> Random: >>> Run status group 0 (all jobs): >>> WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, >>> mint=60404msec, maxt=61875msec >>> >>> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one >>> out of 6 doing 101) >>> >>> This is on just one of the osd servers. >> >> Where the ceph tests to one OSD server or across all servers? It might be >> worth trying tests against a single server with no replication using >> multiple rados bench instances and just seeing what happens. >> >>> >>> Thanks, >>> Dinu >>> >>> >>> On Oct 30, 2013, at 6:38 PM, Mark Nelson wrote: >>> On 10/30/2013 09:05 AM, Dinu Vlad wrote: > Hello, > > I've been doing some tests on a newly installed ceph cluster: > > # ceph osd create bench1 2048 2048 > # ceph osd create bench2 2048 2048 > # rbd -p bench1 create test > # rbd -p bench1 bench-write test --io-pattern rand > elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 > > # rados -p bench2 bench 300 write --show-time > # (run 1) > Total writes made: 20665 > Write size: 4194304 > Bandwidth (MB/sec): 274.923 > > Stddev Bandwidth: 96.3316 > Max bandwidth (MB/sec): 748 > Min bandwidth (MB/sec): 0 > Average Latency:0.23273 > Stddev Latency: 0.262043 > Max latency:1.69475 > Min latency:0.057293 > > These results seem to be quite poor for the configuration: > > MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS > OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board > controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for > journal, attached to a LSI 9207-8i controller. > All servers have dual 10GE network cards, connected to a pair of > dedicated switches. Each SSD has 3 10 GB partitions for journals. Agreed, you should see much higher throughput with that kind of storage setup. What brand/model SSDs are these? Also, what brand and model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with
Re: [ceph-users] ceph cluster performance
Any other options or ideas? Thanks, Dinu On Oct 31, 2013, at 6:35 PM, Dinu Vlad wrote: > > I tested the osd performance from a single node. For this purpose I deployed > a new cluster (using ceph-deploy, as before) and on fresh/repartitioned > drives. I created a single pool, 1800 pgs. I ran the rados bench both on the > osd server and on a remote one. Cluster configuration stayed "default", with > the same additions about xfs mount & mkfs.xfs as before. > > With a single host, the pgs were "stuck unclean" (active only, not > active+clean): > > # ceph -s > cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 > health HEALTH_WARN 1800 pgs stuck unclean > monmap e1: 3 mons at > {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, > election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 > osdmap e101: 18 osds: 18 up, 18 in >pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / > 16759 GB avail > mdsmap e1: 0/0/1 up > > > Test results: > Local test, 1 process, 16 threads: 241.7 MB/s > Local test, 8 processes, 128 threads: 374.8 MB/s > Remote test, 1 process, 16 threads: 231.8 MB/s > Remote test, 8 processes, 128 threads: 366.1 MB/s > > Maybe it's just me, but it seems on the low side too. > > Thanks, > Dinu > > > On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: > >> On 10/30/2013 01:51 PM, Dinu Vlad wrote: >>> Mark, >>> >>> The SSDs are >>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 >>> and the HDDs are >>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. >>> >>> The chasis is a "SiliconMechanics C602" - but I don't have the exact model. >>> It's based on Supermicro, has 24 slots front and 2 in the back and a SAS >>> expander. >>> >>> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according >>> to what the driver reports in dmesg). here are the results (filtered): >>> >>> Sequential: >>> Run status group 0 (all jobs): >>> WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, >>> mint=60444msec, maxt=61463msec >>> >>> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave >>> 153:189 MB/s >> >> Ok, that looks like what I'd expect to see given the controller being used. >> SSDs are probably limited by total aggregate throughput. >> >>> >>> Random: >>> Run status group 0 (all jobs): >>> WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, >>> mint=60404msec, maxt=61875msec >>> >>> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one >>> out of 6 doing 101) >>> >>> This is on just one of the osd servers. >> >> Where the ceph tests to one OSD server or across all servers? It might be >> worth trying tests against a single server with no replication using >> multiple rados bench instances and just seeing what happens. >> >>> >>> Thanks, >>> Dinu >>> >>> >>> On Oct 30, 2013, at 6:38 PM, Mark Nelson wrote: >>> On 10/30/2013 09:05 AM, Dinu Vlad wrote: > Hello, > > I've been doing some tests on a newly installed ceph cluster: > > # ceph osd create bench1 2048 2048 > # ceph osd create bench2 2048 2048 > # rbd -p bench1 create test > # rbd -p bench1 bench-write test --io-pattern rand > elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 > > # rados -p bench2 bench 300 write --show-time > # (run 1) > Total writes made: 20665 > Write size: 4194304 > Bandwidth (MB/sec): 274.923 > > Stddev Bandwidth: 96.3316 > Max bandwidth (MB/sec): 748 > Min bandwidth (MB/sec): 0 > Average Latency:0.23273 > Stddev Latency: 0.262043 > Max latency:1.69475 > Min latency:0.057293 > > These results seem to be quite poor for the configuration: > > MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS > OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board > controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for > journal, attached to a LSI 9207-8i controller. > All servers have dual 10GE network cards, connected to a pair of > dedicated switches. Each SSD has 3 10 GB partitions for journals. Agreed, you should see much higher throughput with that kind of storage setup. What brand/model SSDs are these? Also, what brand and model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side. I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes. Typically I've tested fio
Re: [ceph-users] ceph cluster performance
I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed "default", with the same additions about xfs mount & mkfs.xfs as before. With a single host, the pgs were "stuck unclean" (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson wrote: > On 10/30/2013 01:51 PM, Dinu Vlad wrote: >> Mark, >> >> The SSDs are >> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 >> and the HDDs are >> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. >> >> The chasis is a "SiliconMechanics C602" - but I don't have the exact model. >> It's based on Supermicro, has 24 slots front and 2 in the back and a SAS >> expander. >> >> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according >> to what the driver reports in dmesg). here are the results (filtered): >> >> Sequential: >> Run status group 0 (all jobs): >> WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, >> mint=60444msec, maxt=61463msec >> >> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave >> 153:189 MB/s > > Ok, that looks like what I'd expect to see given the controller being used. > SSDs are probably limited by total aggregate throughput. > >> >> Random: >> Run status group 0 (all jobs): >> WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, >> mint=60404msec, maxt=61875msec >> >> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out >> of 6 doing 101) >> >> This is on just one of the osd servers. > > Where the ceph tests to one OSD server or across all servers? It might be > worth trying tests against a single server with no replication using multiple > rados bench instances and just seeing what happens. > >> >> Thanks, >> Dinu >> >> >> On Oct 30, 2013, at 6:38 PM, Mark Nelson wrote: >> >>> On 10/30/2013 09:05 AM, Dinu Vlad wrote: Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd create bench2 2048 2048 # rbd -p bench1 create test # rbd -p bench1 bench-write test --io-pattern rand elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 # rados -p bench2 bench 300 write --show-time # (run 1) Total writes made: 20665 Write size: 4194304 Bandwidth (MB/sec): 274.923 Stddev Bandwidth: 96.3316 Max bandwidth (MB/sec): 748 Min bandwidth (MB/sec): 0 Average Latency:0.23273 Stddev Latency: 0.262043 Max latency:1.69475 Min latency:0.057293 These results seem to be quite poor for the configuration: MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller. All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals. >>> >>> Agreed, you should see much higher throughput with that kind of storage >>> setup. What brand/model SSDs are these? Also, what brand and model of >>> chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication >>> though) with a couple of concurrent rados bench processes going on our >>> SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs >>> is definitely on the low side. >>> >>> I'm actually not too familiar with what the RBD benchmarking commands are >>> doing behind the scenes. Typically I've tested fio on top of a filesystem >>> on RBD. >>> Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows) osd_journa
Re: [ceph-users] ceph cluster performance
On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a "SiliconMechanics C602" - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s Ok, that looks like what I'd expect to see given the controller being used. SSDs are probably limited by total aggregate throughput. Random: Run status group 0 (all jobs): WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101) This is on just one of the osd servers. Where the ceph tests to one OSD server or across all servers? It might be worth trying tests against a single server with no replication using multiple rados bench instances and just seeing what happens. Thanks, Dinu On Oct 30, 2013, at 6:38 PM, Mark Nelson wrote: On 10/30/2013 09:05 AM, Dinu Vlad wrote: Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd create bench2 2048 2048 # rbd -p bench1 create test # rbd -p bench1 bench-write test --io-pattern rand elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 # rados -p bench2 bench 300 write --show-time # (run 1) Total writes made: 20665 Write size: 4194304 Bandwidth (MB/sec): 274.923 Stddev Bandwidth: 96.3316 Max bandwidth (MB/sec): 748 Min bandwidth (MB/sec): 0 Average Latency:0.23273 Stddev Latency: 0.262043 Max latency:1.69475 Min latency:0.057293 These results seem to be quite poor for the configuration: MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller. All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals. Agreed, you should see much higher throughput with that kind of storage setup. What brand/model SSDs are these? Also, what brand and model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side. I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes. Typically I've tested fio on top of a filesystem on RBD. Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows) osd_journal_size = 10240 osd mount options xfs = "rw,noatime,nobarrier,inode64" osd mkfs options xfs = "-f -i size=2048" [osd] public network = 10.4.0.0/24 cluster network = 10.254.254.0/24 All tests were run from a server outside the cluster, connected to the storage network with 2x 10 GE nics. I've done a few other tests of the individual components: - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000) - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS What you might want to try doing is 4M direct IO writes using libaio and a high iodepth to all drives (spinning disks and SSDs) concurrently and see how both the per-drive and aggregate throughput is. With just SSDs, I've been able to push the 9207-8i up to around 3GB/s with Ceph writes (1.5GB/s if you don't count journal writes), but perhaps there is something interesting about the way the hardware is setup on your system. I'd appreciate any suggestion that might help improve the performance or identify a bottleneck. Thanks Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___
Re: [ceph-users] ceph cluster performance
Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a "SiliconMechanics C602" - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s Random: Run status group 0 (all jobs): WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101) This is on just one of the osd servers. Thanks, Dinu On Oct 30, 2013, at 6:38 PM, Mark Nelson wrote: > On 10/30/2013 09:05 AM, Dinu Vlad wrote: >> Hello, >> >> I've been doing some tests on a newly installed ceph cluster: >> >> # ceph osd create bench1 2048 2048 >> # ceph osd create bench2 2048 2048 >> # rbd -p bench1 create test >> # rbd -p bench1 bench-write test --io-pattern rand >> elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 >> >> # rados -p bench2 bench 300 write --show-time >> # (run 1) >> Total writes made: 20665 >> Write size: 4194304 >> Bandwidth (MB/sec): 274.923 >> >> Stddev Bandwidth: 96.3316 >> Max bandwidth (MB/sec): 748 >> Min bandwidth (MB/sec): 0 >> Average Latency:0.23273 >> Stddev Latency: 0.262043 >> Max latency:1.69475 >> Min latency:0.057293 >> >> These results seem to be quite poor for the configuration: >> >> MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS >> OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board >> controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for >> journal, attached to a LSI 9207-8i controller. >> All servers have dual 10GE network cards, connected to a pair of dedicated >> switches. Each SSD has 3 10 GB partitions for journals. > > Agreed, you should see much higher throughput with that kind of storage > setup. What brand/model SSDs are these? Also, what brand and model of > chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication > though) with a couple of concurrent rados bench processes going on our SC847A > chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is > definitely on the low side. > > I'm actually not too familiar with what the RBD benchmarking commands are > doing behind the scenes. Typically I've tested fio on top of a filesystem on > RBD. > >> >> Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was >> installed using ceph-deploy. ceph.conf pretty much out of the box (diff from >> default follows) >> >> osd_journal_size = 10240 >> osd mount options xfs = "rw,noatime,nobarrier,inode64" >> osd mkfs options xfs = "-f -i size=2048" >> >> [osd] >> public network = 10.4.0.0/24 >> cluster network = 10.254.254.0/24 >> >> All tests were run from a server outside the cluster, connected to the >> storage network with 2x 10 GE nics. >> >> I've done a few other tests of the individual components: >> - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000) >> - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput >> - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS > > What you might want to try doing is 4M direct IO writes using libaio and a > high iodepth to all drives (spinning disks and SSDs) concurrently and see how > both the per-drive and aggregate throughput is. > > With just SSDs, I've been able to push the 9207-8i up to around 3GB/s with > Ceph writes (1.5GB/s if you don't count journal writes), but perhaps there is > something interesting about the way the hardware is setup on your system. > >> >> I'd appreciate any suggestion that might help improve the performance or >> identify a bottleneck. >> >> Thanks >> Dinu >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
On 10/30/2013 09:05 AM, Dinu Vlad wrote: Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd create bench2 2048 2048 # rbd -p bench1 create test # rbd -p bench1 bench-write test --io-pattern rand elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 # rados -p bench2 bench 300 write --show-time # (run 1) Total writes made: 20665 Write size: 4194304 Bandwidth (MB/sec): 274.923 Stddev Bandwidth: 96.3316 Max bandwidth (MB/sec): 748 Min bandwidth (MB/sec): 0 Average Latency:0.23273 Stddev Latency: 0.262043 Max latency:1.69475 Min latency:0.057293 These results seem to be quite poor for the configuration: MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller. All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals. Agreed, you should see much higher throughput with that kind of storage setup. What brand/model SSDs are these? Also, what brand and model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side. I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes. Typically I've tested fio on top of a filesystem on RBD. Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows) osd_journal_size = 10240 osd mount options xfs = "rw,noatime,nobarrier,inode64" osd mkfs options xfs = "-f -i size=2048" [osd] public network = 10.4.0.0/24 cluster network = 10.254.254.0/24 All tests were run from a server outside the cluster, connected to the storage network with 2x 10 GE nics. I've done a few other tests of the individual components: - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000) - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS What you might want to try doing is 4M direct IO writes using libaio and a high iodepth to all drives (spinning disks and SSDs) concurrently and see how both the per-drive and aggregate throughput is. With just SSDs, I've been able to push the 9207-8i up to around 3GB/s with Ceph writes (1.5GB/s if you don't count journal writes), but perhaps there is something interesting about the way the hardware is setup on your system. I'd appreciate any suggestion that might help improve the performance or identify a bottleneck. Thanks Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com