Thanks Wido, my comments inline... >Date: Mon, 30 Dec 2013 14:04:35 +0100 >From: Wido den Hollander <w...@42on.com> >To: ceph-users@lists.ceph.com >Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time
>On 12/30/2013 12:45 PM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.2888846379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 > 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 > > Any idea why a couple of OSDs are so slow that impact the performance of > the entire cluster? > What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over time. So if you run btrfs, that might be the problem. [yguang] We are running on xfs, journal and data share the same disk with different partitions. Wido > Thanks, > Guang > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com