Success! Hopefully my notes from the process will help:

In the event of multiple disk failures the cluster could lose PGs. Should this 
occur it is best to attempt to restart the OSD process and have the drive 
marked as up+out. Marking the drive as out will cause data to flow off the 
drive to elsewhere in the cluster. In the event that the ceph-osd process is 
unable to keep running you could try using the ceph_objectstore_tool program to 
extract just the damaged PGs and import them into working PGs.

Fixing Journals
In this particular scenario things were complicated by the fact that 
ceph_objectstore_tool came out in Giant but we were running Firefly. Not 
wanting to upgrade the cluster in a degraded state this required that the OSD 
drives be moved to a different physical machine for repair. This added a lot of 
steps related to the journals but it wasn't a big deal. That process looks like:

On Storage1:
stop ceph-osd id=15
ceph-osd -i 15 --flush-journal
ls -l /var/lib/ceph/osd/ceph-15/journal

Note the journal device UUID then pull the disk and move it to Ithome:
rm /var/lib/ceph/osd/ceph-15/journal
ceph-osd -i 15 --mkjournal

That creates a colocated journal for which to use during the 
ceph_objectstore_tool commands. Once done then:
ceph-osd -i 15 --flush-journal
rm /var/lib/ceph/osd/ceph-15/journal

Pull the disk and bring it back to Storage1. Then:
ln -s /dev/disk/by-partitionuuid/b4f8d911-5ac9-4bf0-a06a-b8492e25a00f 
/var/lib/ceph/osd/ceph-15/journal
ceph-osd -i 15 --mkjournal
start ceph-osd id=15

This all won't be needed once the cluster is running Hammer because then there 
will be an available version of ceph_objectstore_tool on the local machine and 
you can keep the journals throughout the process.


Recovery Process
We were missing two PGs, 3.c7 and 3.102. These PGs were hosted on OSD.0 and 
OSD.15 which were the two disks which failed out of Storage1. The disk for 
OSD.0 seemed to be a total loss while the disk for OSD.15 was somewhat more 
cooperative but not in a place to be up and running in the cluster. I took the 
dying OSD.15 drive and placed it into a new physical machine with a fresh 
install of Ceph Giant. Using Giant's ceph_objectstore_tool I was able to 
extract the PGs with a command like:
for i in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-15 
--journal /var/lib/ceph/osd/ceph-15/journal --op export --pgid $i --file 
~/${i}.export

Once both PGs were successfully exported I attempted to import them into a new 
temporary OSD following instructions from here. For some reason that didn't 
work. The OSD was up+in but wasn't backfilling the PGs into the cluster. If you 
find yourself in this process I would try that first just in case it provides a 
cleaner process.
Considering the above didn't work and we were looking at the possibility of 
losing the RBD volume (or perhaps worse, the potential of fruitlessly fscking 
35TB) I took what I might describe as heroic measures:

Running
ceph pg dump | grep incomplete

3.c7   0  0  0  0  0  0  0  incomplete  2015-04-02  20:49:32.968841  0'0  
15730:17  [15,0]  15  [15,0]  15  13985'54076  2015-03-31  19:14:22.721695  
13985'54076  2015-03-31  19:14:22.721695
3.102  0  0  0  0  0  0  0  incomplete  2015-04-02  20:49:32.529594  0'0  
15730:21  [0,15]  0   [0,15]  0   13985'53107  2015-03-29  21:17:15.568125  
13985'49195  2015-03-24  18:38:08.244769

Then I stopped all OSDs, which blocked all I/O to the cluster, with:
stop ceph-osd-all

Then I looked for all copies of the PG on all OSDs with:
for i in 3.c7 3.102 ; do find /var/lib/ceph/osd/ -maxdepth 3 -type d -name "$i" 
; done | sort -V

/var/lib/ceph/osd/ceph-0/current/3.c7_head
/var/lib/ceph/osd/ceph-0/current/3.102_head
/var/lib/ceph/osd/ceph-3/current/3.c7_head
/var/lib/ceph/osd/ceph-13/current/3.102_head
/var/lib/ceph/osd/ceph-15/current/3.c7_head
/var/lib/ceph/osd/ceph-15/current/3.102_head

Then I flushed the journals for all of those OSDs with:
for i in 0 3 13 15 ; do ceph-osd -i $i --flush-journal ; done

Then I removed all of those drives and moved them (using Journal Fixing above) 
to Ithome where I used ceph_objectstore_tool to remove all traces of 3.102 and 
3.c7:
for i in 0 3 13 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data 
/var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op 
remove --pgid $j ; done ; done

Then I imported the PGs onto OSD.0 and OSD.15 with:
for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data 
/var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op 
import --file ~/${j}.export ; done ; done
for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm 
/var/log/ceph/osd/ceph-$i/journal ; done

Then I moved the disks back to Storage1 and started them all back up again. I 
think that this should have worked but what happened in this case was that 
OSD.0 didn't start up for some reason. I initially thought that that wouldn't 
matter because OSD.15 did start and so we should have had everything but a ceph 
pg query of the PGs showed something like:
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [0],
"peering_blocked_by": [{
     "osd": 0,
     "current_lost_at": 0,
     "comment": "starting or marking this osd lost may let us proceed"
}]

So I then removed OSD.0 from the cluster and everything came back to life. 
Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans!
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to