Success! Hopefully my notes from the process will help: In the event of multiple disk failures the cluster could lose PGs. Should this occur it is best to attempt to restart the OSD process and have the drive marked as up+out. Marking the drive as out will cause data to flow off the drive to elsewhere in the cluster. In the event that the ceph-osd process is unable to keep running you could try using the ceph_objectstore_tool program to extract just the damaged PGs and import them into working PGs.
Fixing Journals In this particular scenario things were complicated by the fact that ceph_objectstore_tool came out in Giant but we were running Firefly. Not wanting to upgrade the cluster in a degraded state this required that the OSD drives be moved to a different physical machine for repair. This added a lot of steps related to the journals but it wasn't a big deal. That process looks like: On Storage1: stop ceph-osd id=15 ceph-osd -i 15 --flush-journal ls -l /var/lib/ceph/osd/ceph-15/journal Note the journal device UUID then pull the disk and move it to Ithome: rm /var/lib/ceph/osd/ceph-15/journal ceph-osd -i 15 --mkjournal That creates a colocated journal for which to use during the ceph_objectstore_tool commands. Once done then: ceph-osd -i 15 --flush-journal rm /var/lib/ceph/osd/ceph-15/journal Pull the disk and bring it back to Storage1. Then: ln -s /dev/disk/by-partitionuuid/b4f8d911-5ac9-4bf0-a06a-b8492e25a00f /var/lib/ceph/osd/ceph-15/journal ceph-osd -i 15 --mkjournal start ceph-osd id=15 This all won't be needed once the cluster is running Hammer because then there will be an available version of ceph_objectstore_tool on the local machine and you can keep the journals throughout the process. Recovery Process We were missing two PGs, 3.c7 and 3.102. These PGs were hosted on OSD.0 and OSD.15 which were the two disks which failed out of Storage1. The disk for OSD.0 seemed to be a total loss while the disk for OSD.15 was somewhat more cooperative but not in a place to be up and running in the cluster. I took the dying OSD.15 drive and placed it into a new physical machine with a fresh install of Ceph Giant. Using Giant's ceph_objectstore_tool I was able to extract the PGs with a command like: for i in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-15 --journal /var/lib/ceph/osd/ceph-15/journal --op export --pgid $i --file ~/${i}.export Once both PGs were successfully exported I attempted to import them into a new temporary OSD following instructions from here. For some reason that didn't work. The OSD was up+in but wasn't backfilling the PGs into the cluster. If you find yourself in this process I would try that first just in case it provides a cleaner process. Considering the above didn't work and we were looking at the possibility of losing the RBD volume (or perhaps worse, the potential of fruitlessly fscking 35TB) I took what I might describe as heroic measures: Running ceph pg dump | grep incomplete 3.c7 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.968841 0'0 15730:17 [15,0] 15 [15,0] 15 13985'54076 2015-03-31 19:14:22.721695 13985'54076 2015-03-31 19:14:22.721695 3.102 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.529594 0'0 15730:21 [0,15] 0 [0,15] 0 13985'53107 2015-03-29 21:17:15.568125 13985'49195 2015-03-24 18:38:08.244769 Then I stopped all OSDs, which blocked all I/O to the cluster, with: stop ceph-osd-all Then I looked for all copies of the PG on all OSDs with: for i in 3.c7 3.102 ; do find /var/lib/ceph/osd/ -maxdepth 3 -type d -name "$i" ; done | sort -V /var/lib/ceph/osd/ceph-0/current/3.c7_head /var/lib/ceph/osd/ceph-0/current/3.102_head /var/lib/ceph/osd/ceph-3/current/3.c7_head /var/lib/ceph/osd/ceph-13/current/3.102_head /var/lib/ceph/osd/ceph-15/current/3.c7_head /var/lib/ceph/osd/ceph-15/current/3.102_head Then I flushed the journals for all of those OSDs with: for i in 0 3 13 15 ; do ceph-osd -i $i --flush-journal ; done Then I removed all of those drives and moved them (using Journal Fixing above) to Ithome where I used ceph_objectstore_tool to remove all traces of 3.102 and 3.c7: for i in 0 3 13 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op remove --pgid $j ; done ; done Then I imported the PGs onto OSD.0 and OSD.15 with: for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op import --file ~/${j}.export ; done ; done for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm /var/log/ceph/osd/ceph-$i/journal ; done Then I moved the disks back to Storage1 and started them all back up again. I think that this should have worked but what happened in this case was that OSD.0 didn't start up for some reason. I initially thought that that wouldn't matter because OSD.15 did start and so we should have had everything but a ceph pg query of the PGs showed something like: "blocked": "peering is blocked due to down osds", "down_osds_we_would_probe": [0], "peering_blocked_by": [{ "osd": 0, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed" }] So I then removed OSD.0 from the cluster and everything came back to life. Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans! _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com