Re: [ceph-users] Slow Request on OSD

Wido den Hollander Thu, 01 Sep 2016 14:04:06 -0700

> Op 1 september 2016 om 18:55 schreef Dan Jakubiec <dan.jakub...@gmail.com>:
> 
> 
> Thanks Wido.  Reed and I have been working together to try to restore this 
> cluster for about 3 weeks now.  I have been accumulating a number of failure 
> modes that I am hoping to share with the Ceph group soon, but have been 
> holding off a bit until we see the full picture clearly so that we can 
> provide some succinct observations.
> 
> We know that losing 6 of 8 OSDs was definitely going to result in data loss, 
> so I think we are resigned to that.  What has been difficult for us is that 
> there have been many steps in the rebuild process that seem to get stuck and 
> need our intervention.  But it is not 100% obvious what interventions we 
> should applying.
> 
> My very over-simplied hope was this:
> 
> We would remove the corrupted OSDs from the cluster
> We would replace them with new OSDs
> Ceph would figure out that a lot of PGs were lost
> We would "agree and say okay -- lose the objects/files"
> The cluster would use what remains and return to working state
> 
> I feel we have done something wrong along the way, and at this point we are 
> trying to figure out how to do step #4 completely.  We are about to follow 
> the steps to "mark unfound lost", which makes sense to me... but I'm not sure 
> what to do about all the other inconsistencies.
> 
> What procedure do we need to follow to just tell Ceph "those PGs are lost, 
> let's move on"?
> 
> ===
> 
> A very quick history of what we did to get here:
> 
> 8 OSDs lost power simultaneously.
> 2 OSDs came back without issues.
> 1 OSD wouldn't start (various assertion failures), but we were able to copy 
> its PGs to a new OSD as follows:
> ceph-objectstore-tool "export"
> ceph osd crush rm osd.N
> ceph auth del osd.N
> ceph os rm osd.N
> Create new OSD from scrach (it got a new OSD ID)
> ceph-objectstore-tool "import"
> The remaining 5 OSDs were corrupt beyond repair (could not export, mostly due 
> to missing leveldb files after xfs_repair).  We redeployed them as follows:
> ceph osd crush rm osd.N
> ceph auth del osd.N
> ceph os rm osd.N
> Create new OSD from scratch (it got the same OSD ID as the old OSD)
> 
> All the new OSDs from #4.4 ended up getting the same OSD ID as the original 
> OSD.  Don't know if that is part of the problem?  It seems like doing the 
> "crush rm" should have advised the cluster correctly, but perhaps not?
> 
> Where did we go wrong in the recovery process?


You have to mark those OSDs as lost and also force create the incomplete PGs.

But I think you have lost so many objects that the cluster is beyond a point of 
repair honestly.

Wido

> 
> Thank you!
> 
> -- Dan
> 
> > On Sep 1, 2016, at 00:18, Wido den Hollander <w...@42on.com> wrote:
> > 
> > 
> >> Op 31 augustus 2016 om 23:21 schreef Reed Dier <reed.d...@focusvq.com>:
> >> 
> >> 
> >> Multiple XFS corruptions, multiple leveldb issues. Looked to be result of 
> >> write cache settings which have been adjusted now.
> >> 
> > 
> > That is bad news, really bad.
> > 
> >> You’ll see below that there are tons of PG’s in bad states, and it was 
> >> slowly but surely bringing the number of bad PGs down, but it seems to 
> >> have hit a brick wall with this one slow request operation.
> >> 
> > 
> > No, you have more issues. You can 17 PGs which are incomplete, a few 
> > down+incomplete.
> > 
> > Without those PGs functioning (active+X) your MDS will probably not work.
> > 
> > Take a look at: 
> > http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> > 
> > Make sure you go to HEALTH_WARN at first, in HEALTH_ERR the MDS will never 
> > come online.
> > 
> > Wido
> > 
> >>> ceph -s
> >>> cluster []
> >>>     health HEALTH_ERR
> >>>            292 pgs are stuck inactive for more than 300 seconds
> >>>            142 pgs backfill_wait
> >>>            135 pgs degraded
> >>>            63 pgs down
> >>>            80 pgs incomplete
> >>>            199 pgs inconsistent
> >>>            2 pgs recovering
> >>>            5 pgs recovery_wait
> >>>            1 pgs repair
> >>>            132 pgs stale
> >>>            160 pgs stuck inactive
> >>>            132 pgs stuck stale
> >>>            71 pgs stuck unclean
> >>>            128 pgs undersized
> >>>            1 requests are blocked > 32 sec
> >>>            recovery 5301381/46255447 objects degraded (11.461%)
> >>>            recovery 6335505/46255447 objects misplaced (13.697%)
> >>>            recovery 131/20781800 unfound (0.001%)
> >>>            14943 scrub errors
> >>>            mds cluster is degraded
> >>>     monmap e1: 3 mons at {core=[]:6789/0,db=[]:6789/0,dev=[]:6789/0}
> >>>            election epoch 262, quorum 0,1,2 core,dev,db
> >>>      fsmap e3627: 1/1/1 up {0=core=up:replay}
> >>>     osdmap e3685: 8 osds: 8 up, 8 in; 153 remapped pgs
> >>>            flags sortbitwise
> >>>      pgmap v1807138: 744 pgs, 10 pools, 7668 GB data, 20294 kobjects
> >>>            8998 GB used, 50598 GB / 59596 GB avail
> >>>            5301381/46255447 objects degraded (11.461%)
> >>>            6335505/46255447 objects misplaced (13.697%)
> >>>            131/20781800 unfound (0.001%)
> >>>                 209 active+clean
> >>>                 170 active+clean+inconsistent
> >>>                 112 stale+active+clean
> >>>                  74 undersized+degraded+remapped+wait_backfill+peered
> >>>                  63 down+incomplete
> >>>                  48 active+undersized+degraded+remapped+wait_backfill
> >>>                  19 stale+active+clean+inconsistent
> >>>                  17 incomplete
> >>>                  12 active+remapped+wait_backfill
> >>>                   5 active+recovery_wait+degraded
> >>>                   4 
> >>> undersized+degraded+remapped+inconsistent+wait_backfill+peered
> >>>                   4 active+remapped+inconsistent+wait_backfill
> >>>                   2 active+recovering+degraded
> >>>                   2 undersized+degraded+remapped+peered
> >>>                   1 stale+active+clean+scrubbing+deep+inconsistent+repair
> >>>                   1 active+clean+scrubbing+deep
> >>>                   1 active+clean+scrubbing+inconsistent
> >> 
> >> 
> >> Thanks,
> >> 
> >> Reed
> >> 
> >>> On Aug 31, 2016, at 4:08 PM, Wido den Hollander <w...@42on.com> wrote:
> >>> 
> >>>> 
> >>>> Op 31 augustus 2016 om 22:56 schreef Reed Dier <reed.d...@focusvq.com 
> >>>> <mailto:reed.d...@focusvq.com>>:
> >>>> 
> >>>> 
> >>>> After a power failure left our jewel cluster crippled, I have hit a 
> >>>> sticking point in attempted recovery.
> >>>> 
> >>>> Out of 8 osd’s, we likely lost 5-6, trying to salvage what we can.
> >>>> 
> >>> 
> >>> That's probably to much. How do you mean lost? Is XFS crippled/corrupted? 
> >>> That shouldn't happen.
> >>> 
> >>>> In addition to rados pools, we were also using CephFS, and the 
> >>>> cephfs.metadata and cephfs.data pools likely lost plenty of PG’s.
> >>>> 
> >>> 
> >>> What is the status of all PGs? What does 'ceph -s' show?
> >>> 
> >>> Are all PGs active? Since that's something which needs to be done first.
> >>> 
> >>>> The mds has reported this ever since returning from the power loss:
> >>>>> # ceph mds stat
> >>>>> e3627: 1/1/1 up {0=core=up:replay}
> >>>> 
> >>>> 
> >>>> When looking at the slow request on the osd, it shows this task which I 
> >>>> can’t quite figure out. Any help appreciated.
> >>>> 
> >>> 
> >>> Are all clients (including MDS) and OSDs running the same version?
> >>> 
> >>> Wido
> >>> 
> >>>>> # ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight
> >>>>> {
> >>>>>   "ops": [
> >>>>>       {
> >>>>>           "description": "osd_op(mds.0.3625:8 6.c5265ab3 (undecoded) 
> >>>>> ack+retry+read+known_if_redirected+full_force e3668)",
> >>>>>           "initiated_at": "2016-08-31 10:37:18.833644",
> >>>>>           "age": 22212.235361,
> >>>>>           "duration": 22212.235379,
> >>>>>           "type_data": [
> >>>>>               "no flag points reached",
> >>>>>               [
> >>>>>                   {
> >>>>>                       "time": "2016-08-31 10:37:18.833644",
> >>>>>                       "event": "initiated"
> >>>>>                   }
> >>>>>               ]
> >>>>>           ]
> >>>>>       }
> >>>>>   ],
> >>>>>   "num_ops": 1
> >>>>> }
> >>>> 
> >>>> Thanks,
> >>>> 
> >>>> Reed
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> >>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow Request on OSD

Reply via email to