Re: [ceph-users] Cluster unusable after 50% full, even with index sharding
Hello, On Fri, 13 Apr 2018 11:59:01 -0500 Robert Stanford wrote: > I have 65TB stored on 24 OSDs on 3 hosts (8 OSDs per host). SSD journals > and spinning disks. Our performance before was acceptable for our purposes > - 300+MB/s simultaneous transmit and receive. Now that we're up to about > 50% of our total storage capacity (65/120TB, say), the write performance is > still ok, but the read performance is unworkable (35MB/s!) > As always, full details. Versions, HW, what SSDs, what HDDs and how connected, what FS on the OSDs, etc. > I am using index sharding, with 256 shards. I don't see any CPUs > saturated on any host (we are using radosgw by the way, and the load is > light there as well). The hard drives don't seem to be *too* busy (a > random OSD shows ~10 wa in top). The network's fine, as we were doing much > better in terms of speed before we filled up. > top is an abysmal tool for these things, use atop in a big terminal window on all 3 hosts for full situational awareness. "iostat -x 3" might do in a pinch for IO related bits, too. Keep in mind that a single busy OSD will drag the performance of the whole cluster down. Other things to check and verify: 1. Are the OSDs reasonably balanced PG wise? 2. How fragmented are the OSD FS? 3. Is a deep scrub running during the low performance times? 4. Have you run out of RAM for the pagecache and more importantly the SLAB for dir_entries due to the number of objects (files)? If so reads will require many more disk accesses than otherwise. This is a typical wall to run into and can be mitigated by more RAM and sysctl tuning. Christian > Is there anything we can do about this, short of replacing hardware? Is > it really a limitation of Ceph that getting 50% full makes your cluster > unusable? Index sharding has seemed to not help at all (I did some > benchmarking, with 128 shards and then 256; same result each time.) > > Or are we out of luck? -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cluster unusable after 50% full, even with index sharding
I have 65TB stored on 24 OSDs on 3 hosts (8 OSDs per host). SSD journals and spinning disks. Our performance before was acceptable for our purposes - 300+MB/s simultaneous transmit and receive. Now that we're up to about 50% of our total storage capacity (65/120TB, say), the write performance is still ok, but the read performance is unworkable (35MB/s!) I am using index sharding, with 256 shards. I don't see any CPUs saturated on any host (we are using radosgw by the way, and the load is light there as well). The hard drives don't seem to be *too* busy (a random OSD shows ~10 wa in top). The network's fine, as we were doing much better in terms of speed before we filled up. Is there anything we can do about this, short of replacing hardware? Is it really a limitation of Ceph that getting 50% full makes your cluster unusable? Index sharding has seemed to not help at all (I did some benchmarking, with 128 shards and then 256; same result each time.) Or are we out of luck? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cluster unusable
Hi, We use Ceph 0.80.7 for our IceHouse PoC. 3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total. 4 pools for RBD, size=2, 512 PGs per pool Everything was fine until mid of last week, and here's what happened: - OSD node #12 passed away - AFAICR, ceph recovered fine - I installed a fresh new node #12 (which inadvertently erased its 2 attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster - it was looking okay, except that the weight for the 2 OSDs (osd.0 and osd.4) was a solid -3.052e-05. - I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph osd crush reweight' on both OSDs - ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday evening - on Monday morning (yesterday), ceph was still busy. Actually the two new OSDs were flapping (msg map eX wrongly marked me down every minute) - I found the root cause was the firewall on node #12. I opened tcp ports 6789-6900 and this solved the flapping issue - ceph kept on reorganising PGs and reached this unhealthy state: --- 900 PGs stuck unclean --- some 'requests are blocked 32 sec' --- command 'rbd info images/image_id hung --- all tested VMs hung - So I tried this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, and removed the 2 new OSDs - ceph again started rebalancing data, and things were looking better (VMs responding, although pretty slowly) - but at the end, which is the current state, the cluster was back to an unhealthy state, and our PoC is stuck. Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm UTC+1 and then back on Jan 5. So there are around 30 hours left for solving this PoC sev1 issue. So I hope that the community can help me find a solution before Christmas. Here are the details (actual host and DC names not shown in these outputs). [root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info images/$im;done Tue Dec 23 06:53:15 GMT 2014 0dde9837-3e45-414d-a2c5-902adee0cfe9 no reply for 2 hours, still ongoing... [root@MON ]# rbd ls images | head -5 0dde9837-3e45-414d-a2c5-902adee0cfe9 2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e 3917346f-12b4-46b8-a5a1-04296ea0a826 4bde285b-28db-4bef-99d5-47ce07e2463d 7da30b4c-4547-4b4c-a96e-6a3528e03214 [root@MON ]# [cloud-user@francois-vm2 ~]$ ls -lh /tmp/file -rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file [cloud-user@francois-vm2 ~]$ rm /tmp/file no reply for 1 hour, still ongoing. The RBD image used by that VM is 'volume-2e989ca0-b620-42ca-a16f-e218aea32000' [root@MON ~]# ceph -s cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03 health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked 32 sec; noscrub,nodeep-scrub flag(s) set monmap e6: 3 mons at {MON01=10.60.9.11:6789/0,MON06=10.60.9.16:6789/0,MON09=10.60.9.19:6789/0}, election epoch 1338, quorum 0,1,2 MON01,MON06,MON09 osdmap e42050: 6 osds: 6 up, 6 in flags noscrub,nodeep-scrub pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects 600 GB used, 1031 GB / 1632 GB avail 2 inactive 2045 active+clean 1 remapped+peering client io 818 B/s wr, 0 op/s [root@MON ~]# ceph health detail HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked 32 sec; 2 osds have slow requests; noscrub,nodeep-scrub flag(s) set pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last acting [2,1] pg 5.ae is stuck inactive for 54774.738938, current state inactive, last acting [2,1] pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, last acting [1,0] pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last acting [2,1] pg 5.ae is stuck unclean for 286227.592617, current state inactive, last acting [2,1] pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, last acting [1,0] pg 5.b3 is remapped+peering, acting [1,0] 87 ops are blocked 67108.9 sec 16 ops are blocked 33554.4 sec 84 ops are blocked 67108.9 sec on osd.1 16 ops are blocked 33554.4 sec on osd.1 3 ops are blocked 67108.9 sec on osd.2 2 osds have slow requests noscrub,nodeep-scrub flag(s) set [root@MON]# ceph osd tree # idweight type name up/down reweight -1 1.08root default -5 0.54datacenter dc_TWO -2 0.54host node10 1 0.27osd.1 up 1 5 0.27osd.5 up 1 -4 0 host node12 -6 0.54datacenter dc_ONE -3 0.54host node11 2 0.27osd.2 up 1 3 0.27osd.3 up 1 0 0 osd.0 up 1 4 0 osd.4 up 1 (I'm concerned about the above two ghost osd.0 and osd.4...)
Re: [ceph-users] Cluster unusable
Hi François, Could you paste somewhere the output of ceph report to check the pg dump ? (it's probably going to be a little too big for the mailing list). You can bring back osd.0 and osd.4 into the host to which they belong (instead of being at the root of the crush map) with crush set: http://ceph.com/docs/master/rados/operations/crush-map/#add-move-an-osd They won't be used by the ruleset 0 because they are not under the default bucket. To make sure this happens automagically, you may consider using osd_crush_update_on_start=true : http://ceph.com/docs/master/rados/operations/crush-map/#ceph-crush-location-hook http://workbench.dachary.org/ceph/ceph/blob/firefly/src/upstart/ceph-osd.conf#L18 Cheers On 23/12/2014 09:56, Francois Petit wrote: Hi, We use Ceph 0.80.7 for our IceHouse PoC. 3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total. 4 pools for RBD, size=2, 512 PGs per pool Everything was fine until mid of last week, and here's what happened: - OSD node #12 passed away - AFAICR, ceph recovered fine - I installed a fresh new node #12 (which inadvertently erased its 2 attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster - it was looking okay, except that the weight for the 2 OSDs (osd.0 and osd.4) was a solid -3.052e-05. - I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph osd crush reweight' on both OSDs - ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday evening - on Monday morning (yesterday), ceph was still busy. Actually the two new OSDs were flapping (msg map eX wrongly marked me down every minute) - I found the root cause was the firewall on node #12. I opened tcp ports 6789-6900 and this solved the flapping issue - ceph kept on reorganising PGs and reached this unhealthy state: --- 900 PGs stuck unclean --- some 'requests are blocked 32 sec' --- command 'rbd info images/image_id hung --- all tested VMs hung - So I tried this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, and removed the 2 new OSDs - ceph again started rebalancing data, and things were looking better (VMs responding, although pretty slowly) - but at the end, which is the current state, the cluster was back to an unhealthy state, and our PoC is stuck. Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm UTC+1 and then back on Jan 5. So there are around 30 hours left for solving this PoC sev1 issue. So I hope that the community can help me find a solution before Christmas. Here are the details (actual host and DC names not shown in these outputs). [root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info images/$im;done Tue Dec 23 06:53:15 GMT 2014 0dde9837-3e45-414d-a2c5-902adee0cfe9 no reply for 2 hours, still ongoing... [root@MON ]# rbd ls images | head -5 0dde9837-3e45-414d-a2c5-902adee0cfe9 2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e 3917346f-12b4-46b8-a5a1-04296ea0a826 4bde285b-28db-4bef-99d5-47ce07e2463d 7da30b4c-4547-4b4c-a96e-6a3528e03214 [root@MON ]# [cloud-user@francois-vm2 ~]$ ls -lh /tmp/file -rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file [cloud-user@francois-vm2 ~]$ rm /tmp/file no reply for 1 hour, still ongoing. The RBD image used by that VM is 'volume-2e989ca0-b620-42ca-a16f-e218aea32000' [root@MON ~]# ceph -s cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03 health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked 32 sec; noscrub,nodeep-scrub flag(s) set monmap e6: 3 mons at {MON01=10.60.9.11:6789/0,MON06=10.60.9.16:6789/0,MON09=10.60.9.19:6789/0}, election epoch 1338, quorum 0,1,2 MON01,MON06,MON09 osdmap e42050: 6 osds: 6 up, 6 in flags noscrub,nodeep-scrub pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects 600 GB used, 1031 GB / 1632 GB avail 2 inactive 2045 active+clean 1 remapped+peering client io 818 B/s wr, 0 op/s [root@MON ~]# ceph health detail HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked 32 sec; 2 osds have slow requests; noscrub,nodeep-scrub flag(s) set pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last acting [2,1] pg 5.ae is stuck inactive for 54774.738938, current state inactive, last acting [2,1] pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, last acting [1,0] pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last acting [2,1] pg 5.ae is stuck unclean for 286227.592617, current state inactive, last acting [2,1] pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, last acting [1,0] pg 5.b3 is remapped+peering, acting [1,0] 87 ops are blocked 67108.9 sec 16 ops are
Re: [ceph-users] Cluster unusable
Hi Loïc, Thanks. Am trying to find where I can make the report available to you [root@qvitblhat06 ~]# ceph report /tmp/ceph_report report 3298035134 [root@qvitblhat06 ~]# ls -lh /tmp/ceph_report -rw-r--r--. 1 root root 4.7M Dec 23 10:38 /tmp/ceph_report [root@qvitblhat06 ~]# (Sorry guys for the unwanted ad that was sent in my first email...) Francois___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster unusable
Here you go:http://www.filedropper.com/cephreport Francois ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster unusable
Hi, I got a recommendation From Stephan to restart the OSDs one by one. So I did it. It helped a bit (some IOs completed), but at the end, the state was the same as before, and new IOs still hung. Loïc, thanks for the advice on moving back the osd.0 and osd.4 into the game. Actually this was done by simply restarting ceph on that node: [root@qvitblhat12 ~]# date;service ceph status Tue Dec 23 14:36:11 UTC 2014 === osd.0 === osd.0: running {version:0.80.7} === osd.4 === osd.4: running {version:0.80.7} [root@qvitblhat12 ~]# date;service ceph restart Tue Dec 23 14:36:17 UTC 2014 === osd.0 === === osd.0 === Stopping Ceph osd.0 on qvitblhat12...kill 4527...kill 4527...done === osd.0 === create-or-move updating item name 'osd.0' weight 0.27 at location {host=qvitblhat12,root=default} to crush map Starting Ceph osd.0 on qvitblhat12... Running as unit run-4398.service. === osd.4 === === osd.4 === Stopping Ceph osd.4 on qvitblhat12...kill 5375...done === osd.4 === create-or-move updating item name 'osd.4' weight 0.27 at location {host=qvitblhat12,root=default} to crush map Starting Ceph osd.4 on qvitblhat12... Running as unit run-4720.service. [root@qvitblhat06 ~]# ceph osd tree # idweighttype nameup/downreweight -11.62root default -51.08datacenter dc_XAT -20.54host qvitblhat10 10.27osd.1up1 50.27osd.5up1 -40.54host qvitblhat12 00.27osd.0up1 40.27osd.4up1 -60.54datacenter dc_QVI -30.54host qvitblhat11 20.27osd.2up1 30.27osd.3up1 [root@qvitblhat06 ~]# This change made ceph to rebalance data, and then the miracle, as all PGs ended up as active+clean. [root@qvitblhat06 ~]# ceph health detail HEALTH_WARN noscrub,nodeep-scrub flag(s) set noscrub,nodeep-scrub flag(s) set Well apart from being happy that the cluster is now healthy, I find it a little bit scary of having to shake it in one direction and another and hope that it will eventually recover, while in the meantime my users IOs are stuck... So is there a way to understand what happened ? Francois___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com