We recently had a few inconsistent PGs crop up on one of our clusters, and
I wanted to describe the process used to repair them for review and perhaps
to help someone in the future.

Our state roughly matched David's described comment here:

http://tracker.ceph.com/issues/21388#note-1

However, we were missing the object entirely on the primary OSD. This may
have been due to previous manual repair attempts, but the exact cause of
the missing object is unclear.

In order to get the PG into a state consistent with David's comment, I
exported the perceived "good" copy of the PG using ceph-objectstore-tool
and imported it to the primary OSD.

At this point, a repair would consistently cause an empty listing in "rados
list-inconsistent-obj" (but still inconsistent), and a deep-scrub would
cause the "list-inconsistent-obj" state to appear as David described.
However, "rados get" resulted in I/O errors.

I again used ceph-objectstore-tool with the "get-bytes" option to dump the
object contents to a file and "rados put" that.

It seemed to work and the customer's VM hasn't noticed anything awry yet...
but then again it wasn't prior to this either. Seems the right data is in
place and the PG is consistent after a deep-scrub.

Pretty standard stuff, but might help with alternative ways of dumping byte
data in the future as long as others don't see an issue with this. I see at
least one other with the same I/O error on the bug.

--
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to