> Op 1 september 2016 om 18:55 schreef Dan Jakubiec <dan.jakub...@gmail.com>: > > > Thanks Wido. Reed and I have been working together to try to restore this > cluster for about 3 weeks now. I have been accumulating a number of failure > modes that I am hoping to share with the Ceph group soon, but have been > holding off a bit until we see the full picture clearly so that we can > provide some succinct observations. > > We know that losing 6 of 8 OSDs was definitely going to result in data loss, > so I think we are resigned to that. What has been difficult for us is that > there have been many steps in the rebuild process that seem to get stuck and > need our intervention. But it is not 100% obvious what interventions we > should applying. > > My very over-simplied hope was this: > > We would remove the corrupted OSDs from the cluster > We would replace them with new OSDs > Ceph would figure out that a lot of PGs were lost > We would "agree and say okay -- lose the objects/files" > The cluster would use what remains and return to working state > > I feel we have done something wrong along the way, and at this point we are > trying to figure out how to do step #4 completely. We are about to follow > the steps to "mark unfound lost", which makes sense to me... but I'm not sure > what to do about all the other inconsistencies. > > What procedure do we need to follow to just tell Ceph "those PGs are lost, > let's move on"? > > === > > A very quick history of what we did to get here: > > 8 OSDs lost power simultaneously. > 2 OSDs came back without issues. > 1 OSD wouldn't start (various assertion failures), but we were able to copy > its PGs to a new OSD as follows: > ceph-objectstore-tool "export" > ceph osd crush rm osd.N > ceph auth del osd.N > ceph os rm osd.N > Create new OSD from scrach (it got a new OSD ID) > ceph-objectstore-tool "import" > The remaining 5 OSDs were corrupt beyond repair (could not export, mostly due > to missing leveldb files after xfs_repair). We redeployed them as follows: > ceph osd crush rm osd.N > ceph auth del osd.N > ceph os rm osd.N > Create new OSD from scratch (it got the same OSD ID as the old OSD) > > All the new OSDs from #4.4 ended up getting the same OSD ID as the original > OSD. Don't know if that is part of the problem? It seems like doing the > "crush rm" should have advised the cluster correctly, but perhaps not? > > Where did we go wrong in the recovery process?
You have to mark those OSDs as lost and also force create the incomplete PGs. But I think you have lost so many objects that the cluster is beyond a point of repair honestly. Wido > > Thank you! > > -- Dan > > > On Sep 1, 2016, at 00:18, Wido den Hollander <w...@42on.com> wrote: > > > > > >> Op 31 augustus 2016 om 23:21 schreef Reed Dier <reed.d...@focusvq.com>: > >> > >> > >> Multiple XFS corruptions, multiple leveldb issues. Looked to be result of > >> write cache settings which have been adjusted now. > >> > > > > That is bad news, really bad. > > > >> You’ll see below that there are tons of PG’s in bad states, and it was > >> slowly but surely bringing the number of bad PGs down, but it seems to > >> have hit a brick wall with this one slow request operation. > >> > > > > No, you have more issues. You can 17 PGs which are incomplete, a few > > down+incomplete. > > > > Without those PGs functioning (active+X) your MDS will probably not work. > > > > Take a look at: > > http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ > > > > Make sure you go to HEALTH_WARN at first, in HEALTH_ERR the MDS will never > > come online. > > > > Wido > > > >>> ceph -s > >>> cluster [] > >>> health HEALTH_ERR > >>> 292 pgs are stuck inactive for more than 300 seconds > >>> 142 pgs backfill_wait > >>> 135 pgs degraded > >>> 63 pgs down > >>> 80 pgs incomplete > >>> 199 pgs inconsistent > >>> 2 pgs recovering > >>> 5 pgs recovery_wait > >>> 1 pgs repair > >>> 132 pgs stale > >>> 160 pgs stuck inactive > >>> 132 pgs stuck stale > >>> 71 pgs stuck unclean > >>> 128 pgs undersized > >>> 1 requests are blocked > 32 sec > >>> recovery 5301381/46255447 objects degraded (11.461%) > >>> recovery 6335505/46255447 objects misplaced (13.697%) > >>> recovery 131/20781800 unfound (0.001%) > >>> 14943 scrub errors > >>> mds cluster is degraded > >>> monmap e1: 3 mons at {core=[]:6789/0,db=[]:6789/0,dev=[]:6789/0} > >>> election epoch 262, quorum 0,1,2 core,dev,db > >>> fsmap e3627: 1/1/1 up {0=core=up:replay} > >>> osdmap e3685: 8 osds: 8 up, 8 in; 153 remapped pgs > >>> flags sortbitwise > >>> pgmap v1807138: 744 pgs, 10 pools, 7668 GB data, 20294 kobjects > >>> 8998 GB used, 50598 GB / 59596 GB avail > >>> 5301381/46255447 objects degraded (11.461%) > >>> 6335505/46255447 objects misplaced (13.697%) > >>> 131/20781800 unfound (0.001%) > >>> 209 active+clean > >>> 170 active+clean+inconsistent > >>> 112 stale+active+clean > >>> 74 undersized+degraded+remapped+wait_backfill+peered > >>> 63 down+incomplete > >>> 48 active+undersized+degraded+remapped+wait_backfill > >>> 19 stale+active+clean+inconsistent > >>> 17 incomplete > >>> 12 active+remapped+wait_backfill > >>> 5 active+recovery_wait+degraded > >>> 4 > >>> undersized+degraded+remapped+inconsistent+wait_backfill+peered > >>> 4 active+remapped+inconsistent+wait_backfill > >>> 2 active+recovering+degraded > >>> 2 undersized+degraded+remapped+peered > >>> 1 stale+active+clean+scrubbing+deep+inconsistent+repair > >>> 1 active+clean+scrubbing+deep > >>> 1 active+clean+scrubbing+inconsistent > >> > >> > >> Thanks, > >> > >> Reed > >> > >>> On Aug 31, 2016, at 4:08 PM, Wido den Hollander <w...@42on.com> wrote: > >>> > >>>> > >>>> Op 31 augustus 2016 om 22:56 schreef Reed Dier <reed.d...@focusvq.com > >>>> <mailto:reed.d...@focusvq.com>>: > >>>> > >>>> > >>>> After a power failure left our jewel cluster crippled, I have hit a > >>>> sticking point in attempted recovery. > >>>> > >>>> Out of 8 osd’s, we likely lost 5-6, trying to salvage what we can. > >>>> > >>> > >>> That's probably to much. How do you mean lost? Is XFS crippled/corrupted? > >>> That shouldn't happen. > >>> > >>>> In addition to rados pools, we were also using CephFS, and the > >>>> cephfs.metadata and cephfs.data pools likely lost plenty of PG’s. > >>>> > >>> > >>> What is the status of all PGs? What does 'ceph -s' show? > >>> > >>> Are all PGs active? Since that's something which needs to be done first. > >>> > >>>> The mds has reported this ever since returning from the power loss: > >>>>> # ceph mds stat > >>>>> e3627: 1/1/1 up {0=core=up:replay} > >>>> > >>>> > >>>> When looking at the slow request on the osd, it shows this task which I > >>>> can’t quite figure out. Any help appreciated. > >>>> > >>> > >>> Are all clients (including MDS) and OSDs running the same version? > >>> > >>> Wido > >>> > >>>>> # ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight > >>>>> { > >>>>> "ops": [ > >>>>> { > >>>>> "description": "osd_op(mds.0.3625:8 6.c5265ab3 (undecoded) > >>>>> ack+retry+read+known_if_redirected+full_force e3668)", > >>>>> "initiated_at": "2016-08-31 10:37:18.833644", > >>>>> "age": 22212.235361, > >>>>> "duration": 22212.235379, > >>>>> "type_data": [ > >>>>> "no flag points reached", > >>>>> [ > >>>>> { > >>>>> "time": "2016-08-31 10:37:18.833644", > >>>>> "event": "initiated" > >>>>> } > >>>>> ] > >>>>> ] > >>>>> } > >>>>> ], > >>>>> "num_ops": 1 > >>>>> } > >>>> > >>>> Thanks, > >>>> > >>>> Reed > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com