Re: [ceph-users] PG stuck peering after host reboot

Wido den Hollander Thu, 16 Feb 2017 22:48:54 -0800

> Op 16 februari 2017 om 14:55 schreef george.vasilaka...@stfc.ac.uk:
> 
> 
> Hi folks,
> 
> I have just made a tracker for this issue: 
> http://tracker.ceph.com/issues/18960
> I used ceph-post-file to upload some logs from the primary OSD for the 
> troubled PG.
> 
> Any help would be appreciated.
> 
> If we can't get it to peer, we'd like to at least get it unstuck, even if it 
> means data loss.
> 
> What's the proper way to go about doing that?


Can you try this:

1. Go to the host
2. Stop OSD 595
3. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info 
--pgid 1.323

What does osd.595 think about that PG?

You could even try 'rm-past-intervals' with the object-store tool, but that 
might be a bit dangerous. Wouldn't do that immediately.

Wido

> 
> Best regards,
> 
> George
> ________________________________________
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
> george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
> Sent: 14 February 2017 10:27
> To: bhubb...@redhat.com; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] PG stuck peering after host reboot
> 
> Hi Brad,
> 
> I'll be doing so later in the day.
> 
> Thanks,
> 
> George
> ________________________________________
> From: Brad Hubbard [bhubb...@redhat.com]
> Sent: 13 February 2017 22:03
> To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
> Subject: Re: [ceph-users] PG stuck peering after host reboot
> 
> I'd suggest creating a tracker and uploading a full debug log from the
> primary so we can look at this in more detail.
> 
> On Mon, Feb 13, 2017 at 9:11 PM,  <george.vasilaka...@stfc.ac.uk> wrote:
> > Hi Brad,
> >
> > I could not tell you that as `ceph pg 1.323 query` never completes, it just 
> > hangs there.
> >
> > On 11/02/2017, 00:40, "Brad Hubbard" <bhubb...@redhat.com> wrote:
> >
> >     On Thu, Feb 9, 2017 at 3:36 AM,  <george.vasilaka...@stfc.ac.uk> wrote:
> >     > Hi Corentin,
> >     >
> >     > I've tried that, the primary hangs when trying to injectargs so I set 
> > the option in the config file and restarted all OSDs in the PG, it came up 
> > with:
> >     >
> >     > pg 1.323 is remapped+peering, acting 
> > [595,1391,2147483647,127,937,362,267,320,7,634,716]
> >     >
> >     > Still can't query the PG, no error messages in the logs of osd.240.
> >     > The logs on osd.595 and osd.7 still fill up with the same messages.
> >
> >     So what does "peering_blocked_by_detail" show in that case since it
> >     can no longer show "peering_blocked_by_history_les_bound"?
> >
> >     >
> >     > Regards,
> >     >
> >     > George
> >     > ________________________________
> >     > From: Corentin Bonneton [l...@titin.fr]
> >     > Sent: 08 February 2017 16:31
> >     > To: Vasilakakos, George (STFC,RAL,SC)
> >     > Cc: ceph-users@lists.ceph.com
> >     > Subject: Re: [ceph-users] PG stuck peering after host reboot
> >     >
> >     > Hello,
> >     >
> >     > I already had the case, I applied the parameter 
> > (osd_find_best_info_ignore_history_les) to all the osd that have reported 
> > the queries blocked.
> >     >
> >     > --
> >     > Cordialement,
> >     > CEO FEELB | Corentin BONNETON
> >     > cont...@feelb.io<mailto:cont...@feelb.io>
> >     >
> >     > Le 8 févr. 2017 à 17:17, 
> > george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk> a écrit 
> > :
> >     >
> >     > Hi Ceph folks,
> >     >
> >     > I have a cluster running Jewel 10.2.5 using a mix EC and replicated 
> > pools.
> >     >
> >     > After rebooting a host last night, one PG refuses to complete peering
> >     >
> >     > pg 1.323 is stuck inactive for 73352.498493, current state peering, 
> > last acting [595,1391,240,127,937,362,267,320,7,634,716]
> >     >
> >     > Restarting OSDs or hosts does nothing to help, or sometimes results 
> > in things like this:
> >     >
> >     > pg 1.323 is remapped+peering, acting 
> > [2147483647,1391,240,127,937,362,267,320,7,634,716]
> >     >
> >     >
> >     > The host that was rebooted is home to osd.7 (8). If I go onto it to 
> > look at the logs for osd.7 this is what I see:
> >     >
> >     > $ tail -f /var/log/ceph/ceph-osd.7.log
> >     > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- 
> > XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 
> > pipe(0x7f6074a0b400 sd=34 :42828 s=2 pgs=319 cs=471 l=0 
> > c=0x7f6070086700).fault, initiating reconnect
> >     >
> >     > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> 
> > indicates the direction of communication. I've traced these to osd.7 (rank 
> > 8 in the stuck PG) reaching out to osd.595 (the primary in the stuck PG).
> >     >
> >     > Meanwhile, looking at the logs of osd.595 I see this:
> >     >
> >     > $ tail -f /var/log/ceph/ceph-osd.595.log
> >     > 2017-02-08 15:41:15.760708 7f1765673700  0 -- 
> > XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 
> > pipe(0x7f17b2911400 sd=101 :6921 s=0 pgs=0 cs=0 l=0 
> > c=0x7f17b7beaf00).accept connect_seq 478 vs existing 477 state standby
> >     > 2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 
> > 1941070384 != exp 3786596716
> >     >
> >     > which again shows osd.595 reaching out to osd.7 and from what I could 
> > gather the CRC problem is about messaging.
> >     >
> >     > Google searching has yielded nothing particularly useful on how to 
> > get this unstuck.
> >     >
> >     > ceph pg 1.323 query seems to hang forever but it completed once last 
> > night and I noticed this:
> >     >
> >     >            "peering_blocked_by_detail": [
> >     >                {
> >     >                    "detail": "peering_blocked_by_history_les_bound"
> >     >                }
> >     >
> >     > We have seen this before and it was cleared by setting 
> > osd_find_best_info_ignore_history_les to true for the first two OSDs on the 
> > stuck PGs (this was on a 3 replica pool). This hasn't worked in this case 
> > and I suspect the option needs to be set on either a majority of OSDs or 
> > enough k number of OSDs to be able to use their data and ignore history.
> >     >
> >     > We would really appreciate any guidance and/or help the community can 
> > offer!
> >     >
> >     > _______________________________________________
> >     > ceph-users mailing list
> >     > ceph-users@lists.ceph.com
> >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >     --
> >     Cheers,
> >     Brad
> >
> >
> 
> 
> 
> --
> Cheers,
> Brad
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG stuck peering after host reboot

Reply via email to