Hi Wido, In an effort to get the cluster to complete peering that PG (as we need to be able to use our pool) we have removed osd.595 from the CRUSH map to allow a new mapping to occur.
When I left the office yesterday osd.307 had replaced osd.595 in the up set but the acting set had CRUSH_ITEM_NONE in place of the primary. The PG was in a remapped+peering state and recovery was taking place for the other PGs that lived on that OSD. Worth noting that osd.307 in on the same host as osd.595. We’ll have a look on osd.595 like you suggested. On 17/02/2017, 06:48, "Wido den Hollander" <w...@42on.com> wrote: > >> Op 16 februari 2017 om 14:55 schreef george.vasilaka...@stfc.ac.uk: >> >> >> Hi folks, >> >> I have just made a tracker for this issue: >> http://tracker.ceph.com/issues/18960 >> I used ceph-post-file to upload some logs from the primary OSD for the >> troubled PG. >> >> Any help would be appreciated. >> >> If we can't get it to peer, we'd like to at least get it unstuck, even if it >> means data loss. >> >> What's the proper way to go about doing that? > >Can you try this: > >1. Go to the host >2. Stop OSD 595 >3. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info >--pgid 1.323 > >What does osd.595 think about that PG? > >You could even try 'rm-past-intervals' with the object-store tool, but that >might be a bit dangerous. Wouldn't do that immediately. > >Wido > >> >> Best regards, >> >> George >> ________________________________________ >> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of >> george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk] >> Sent: 14 February 2017 10:27 >> To: bhubb...@redhat.com; ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] PG stuck peering after host reboot >> >> Hi Brad, >> >> I'll be doing so later in the day. >> >> Thanks, >> >> George >> ________________________________________ >> From: Brad Hubbard [bhubb...@redhat.com] >> Sent: 13 February 2017 22:03 >> To: Vasilakakos, George (STFC,RAL,SC); Ceph Users >> Subject: Re: [ceph-users] PG stuck peering after host reboot >> >> I'd suggest creating a tracker and uploading a full debug log from the >> primary so we can look at this in more detail. >> >> On Mon, Feb 13, 2017 at 9:11 PM, <george.vasilaka...@stfc.ac.uk> wrote: >> > Hi Brad, >> > >> > I could not tell you that as `ceph pg 1.323 query` never completes, it >> > just hangs there. >> > >> > On 11/02/2017, 00:40, "Brad Hubbard" <bhubb...@redhat.com> wrote: >> > >> > On Thu, Feb 9, 2017 at 3:36 AM, <george.vasilaka...@stfc.ac.uk> wrote: >> > > Hi Corentin, >> > > >> > > I've tried that, the primary hangs when trying to injectargs so I >> > set the option in the config file and restarted all OSDs in the PG, it >> > came up with: >> > > >> > > pg 1.323 is remapped+peering, acting >> > [595,1391,2147483647,127,937,362,267,320,7,634,716] >> > > >> > > Still can't query the PG, no error messages in the logs of osd.240. >> > > The logs on osd.595 and osd.7 still fill up with the same messages. >> > >> > So what does "peering_blocked_by_detail" show in that case since it >> > can no longer show "peering_blocked_by_history_les_bound"? >> > >> > > >> > > Regards, >> > > >> > > George >> > > ________________________________ >> > > From: Corentin Bonneton [l...@titin.fr] >> > > Sent: 08 February 2017 16:31 >> > > To: Vasilakakos, George (STFC,RAL,SC) >> > > Cc: ceph-users@lists.ceph.com >> > > Subject: Re: [ceph-users] PG stuck peering after host reboot >> > > >> > > Hello, >> > > >> > > I already had the case, I applied the parameter >> > (osd_find_best_info_ignore_history_les) to all the osd that have reported >> > the queries blocked. >> > > >> > > -- >> > > Cordialement, >> > > CEO FEELB | Corentin BONNETON >> > > cont...@feelb.io<mailto:cont...@feelb.io> >> > > >> > > Le 8 févr. 2017 à 17:17, >> > george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk> a >> > écrit : >> > > >> > > Hi Ceph folks, >> > > >> > > I have a cluster running Jewel 10.2.5 using a mix EC and replicated >> > pools. >> > > >> > > After rebooting a host last night, one PG refuses to complete peering >> > > >> > > pg 1.323 is stuck inactive for 73352.498493, current state peering, >> > last acting [595,1391,240,127,937,362,267,320,7,634,716] >> > > >> > > Restarting OSDs or hosts does nothing to help, or sometimes results >> > in things like this: >> > > >> > > pg 1.323 is remapped+peering, acting >> > [2147483647,1391,240,127,937,362,267,320,7,634,716] >> > > >> > > >> > > The host that was rebooted is home to osd.7 (8). If I go onto it to >> > look at the logs for osd.7 this is what I see: >> > > >> > > $ tail -f /var/log/ceph/ceph-osd.7.log >> > > 2017-02-08 15:41:00.445247 7f5fcc2bd700 0 -- >> > XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 >> > pipe(0x7f6074a0b400 sd=34 :42828 s=2 pgs=319 cs=471 l=0 >> > c=0x7f6070086700).fault, initiating reconnect >> > > >> > > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> >> > indicates the direction of communication. I've traced these to osd.7 (rank >> > 8 in the stuck PG) reaching out to osd.595 (the primary in the stuck PG). >> > > >> > > Meanwhile, looking at the logs of osd.595 I see this: >> > > >> > > $ tail -f /var/log/ceph/ceph-osd.595.log >> > > 2017-02-08 15:41:15.760708 7f1765673700 0 -- >> > XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 >> > pipe(0x7f17b2911400 sd=101 :6921 s=0 pgs=0 cs=0 l=0 >> > c=0x7f17b7beaf00).accept connect_seq 478 vs existing 477 state standby >> > > 2017-02-08 15:41:20.768844 7f1765673700 0 bad crc in front >> > 1941070384 != exp 3786596716 >> > > >> > > which again shows osd.595 reaching out to osd.7 and from what I >> > could gather the CRC problem is about messaging. >> > > >> > > Google searching has yielded nothing particularly useful on how to >> > get this unstuck. >> > > >> > > ceph pg 1.323 query seems to hang forever but it completed once last >> > night and I noticed this: >> > > >> > > "peering_blocked_by_detail": [ >> > > { >> > > "detail": "peering_blocked_by_history_les_bound" >> > > } >> > > >> > > We have seen this before and it was cleared by setting >> > osd_find_best_info_ignore_history_les to true for the first two OSDs on >> > the stuck PGs (this was on a 3 replica pool). This hasn't worked in this >> > case and I suspect the option needs to be set on either a majority of OSDs >> > or enough k number of OSDs to be able to use their data and ignore history. >> > > >> > > We would really appreciate any guidance and/or help the community >> > can offer! >> > > >> > > _______________________________________________ >> > > ceph-users mailing list >> > > ceph-users@lists.ceph.com >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > >> > >> > -- >> > Cheers, >> > Brad >> > >> > >> >> >> >> -- >> Cheers, >> Brad >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com