Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

Brandon Morris, PMP Fri, 03 Jun 2016 09:11:54 -0700

Nice catch.  That was a copy-paste error.  Sorry

it should have read:


 3. Flush the journal and export the primary version of the PG.  This took
1 minute on a well-behaved PG and 4 hours on the misbehaving PG
   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
--journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
--file /root/32.10c.b.export

  4. Import the PG into a New / Temporary OSD that is also offline,
   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
--journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op import
--file /root/32.10c.b.export


On Thu, Jun 2, 2016 at 5:10 PM, Brad Hubbard <bhubb...@redhat.com> wrote:

> On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
> <brandon.morris....@gmail.com> wrote:
>
> > The only way that I was able to get back to Health_OK was to
> export/import.  ***** Please note, any time you use the
> ceph_objectstore_tool you risk data loss if not done carefully.   Never
> remove a PG until you have a known good export *****
> >
> > Here are the steps I used:
> >
> > 1. set NOOUT, NO BACKFILL
> > 2. Stop the OSD's that have the erroring PG
> > 3. Flush the journal and export the primary version of the PG.  This
> took 1 minute on a well-behaved PG and 4 hours on the misbehaving PG
> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
> >
> > 4. Import the PG into a New / Temporary OSD that is also offline,
> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
>
> This should be an import op and presumably to a different data path
> and journal path more like the following?
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
> --journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
> import --file /root/32.10c.b.export
>
> Just trying to clarify for anyone that comes across this thread in the
> future.
>
> Cheers,
> Brad
>
> >
> > 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your
> case it looks like)
> > 6. Start cluster OSD's
> > 7. Start the temporary OSD's and ensure 32.10c backfills correctly to
> the 3 OSD's it is supposed to be on.
> >
> > This is similar to the recovery process described in this post from
> 04/09/2015:
> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>  Hopefully it works in your case too and you can the cluster back to a
> state that you can make the CephFS directories smaller.
> >
> > - Brandon
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

Reply via email to