[ceph-users] Cannot mount CephFS after irreversible OSD lost

Mykola Dvornik Tue, 17 Nov 2015 02:09:20 -0800

Dear ceph experts,

I've built and administrating 12 OSD ceph cluster (spanning over 3nodes) with replication count of 2. The ceph version is


ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

The cluster hosts two pools (data and metadata) that are exported overCephFS.

At some point the OSDs approached 'full' state one of them gotcorrupted. The easiest solution was to remove/re-add the wiped OSD back.

It went fine, the cluster was recovering without issues. At the pointof only 39 degraded objects left another OSD corrupted (its peeractually). I was not able to recover it and I have made a hard decisionto remove it, wipe and re-add back to the cluster. Since no backupshave been made, the data corruption was expected.

To my surprise when all OSDs got back online and cluster started torecover, only one incomplete PG has been reported. I've worked aroundit by ssh'ing to the node that holds its primary OSDs and thenexporting the corrupted pg with 'ceph-objectstore-tool --op export'marking it 'complete' afterwards. Once cluster recovered, I've importedthe pg's data back to its primary OSD. The recovery then fullycompleted and at the moment 'ceph -s' gives me:


   cluster 7972d1e9-2843-41a3-a4e7-9889d9c75850
    health HEALTH_WARN
           1 near full osd(s)
    monmap e1: 1 mons at {000-s-ragnarok=xxx.xxx.xxx.xxx:6789/0}
           election epoch 1, quorum 0 000-s-ragnarok
    mdsmap e9393: 1/1/0 up {0=000-s-ragnarok=up:active}
    osdmap e185363: 12 osds: 12 up, 12 in
     pgmap v5599327: 1024 pgs, 2 pools, 7758 GB data, 22316 kobjects
           15804 GB used, 6540 GB / 22345 GB avail
               1020 active+clean
                  4 active+clean+scrubbing+deep

However when I've brought the mds back online the CephFS cannot bemounted anymore complaining on the client side 'mount error 5 =Input/output error'. Since mds was running just fine without anysuspicious messages in its log, I've decided that something happened toits journal and CephFS disaster recovery is needed. I've stopped themds and tried to make a backup of the journal. UnfortunatelyA, the toolcrashed with the following output:


cephfs-journal-tool journal export backup.bin
journal is 1841503004303~12076
*** buffer overflow detected ***: cephfs-journal-tool terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7f175ef12a57]
/lib64/libc.so.6(+0x10bc10)[0x7f175ef10c10]
/lib64/libc.so.6(+0x10b119)[0x7f175ef10119]
/lib64/libc.so.6(_IO_vfprintf+0x2f00)[0x7f175ee4f430]
/lib64/libc.so.6(__vsprintf_chk+0x88)[0x7f175ef101a8]
/lib64/libc.so.6(__sprintf_chk+0x7d)[0x7f175ef100fd]
cephfs-journal-tool(_ZN6Dumper4dumpEPKc+0x630)[0x7f1763374720]
cephfs-journal-tool(_ZN11JournalTool14journal_exportERKSsb+0x294)[0x7f1763357874]
cephfs-journal-tool(_ZN11JournalTool12main_journalERSt6vectorIPKcSaIS2_EE+0x105)[0x7f17633580c5]
cephfs-journal-tool(_ZN11JournalTool4mainERSt6vectorIPKcSaIS2_EE+0x56e)[0x7f17633514de]
cephfs-journal-tool(main+0x1de)[0x7f1763350d4e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f175ee26af5]
cephfs-journal-tool(+0x1ccae9)[0x7f1763356ae9]
...

-3> 2015-11-17 10:43:00.874529 7f174db4b700 1 --xxx.xxx.xxx.xxx:6802/3019233561 <== osd.9 xxx.xxx.xxx.xxx:6808/13662 1==== osd_op_reply(4 200.0006b309 [stat] v0'0 uv0 ack = -2 ((2) No suchfile or directory)) v6 ==== 179+0+0 (2303160312 0 0) 0x7f1767c719c0 con0x7f1767d194a0

...

So I've used rados tool to export the cephfs_metadata pool, and thenproceeded with


cephfs-journal-tool event recover_dentries summary
cephfs-journal-tool journal reset
cephfs-table-tool all reset session
ceph fs reset home --yes-i-really-mean-it

After this manipulation, the cephfs-journal-tool journal exportbackup.rec worked, but wrote 48 bytes at around 1.8TB offset!


Then I've brought mds back online, but CephFS is still non-mountable.

I've tried to flush the journal with:

ceph daemon mds.000-s-ragnarok flush journal

No luck. Then I've stopped mds and relaunched with

ceph-mds -i 000-s-ragnarok --journal_check 0 --debug_mds=10--debug_ms=100


It persistently outputs this snippet for a couple of hours:

7faf0bd58700  7 mds.0.cache trim max=100000  cur=17
7faf0bd58700 10 mds.0.cache trim_client_leases

7faf0bd58700 2 mds.0.cache check_memory_usage total 256288, rss 19116,heap 48056, malloc 1791 mmap 0, baseline 48056, buffers 0, 0 / 19inodes have caps, 0 caps, 0 caps per inode7faf0bd58700 10 mds.0.log trim 1 / 30 segments, 8 / -1 events, 0 (0)expiring, 0 (0) expired7faf0bd58700 10 mds.0.log _trim_expired_segments waiting for1841488226436/1841503004303 to expire

7faf0bd58700 10 mds.0.server find_idle_sessions.  laggy until 0.000000
7faf0bd58700 10 mds.0.locker scatter_tick
7faf0bd58700 10 mds.0.cache find_stale_fragment_freeze
7faf0bd58700 10 mds.0.snap check_osd_map - version unchanged
7faf0b557700 10 mds.beacon.000-s-ragnarok _send up:active seq 12

So it appears to me that even despite 'cephfs-journal-tool journalreset', the journal is not wiped and its corruption blocks CephFS frombeing mounted.


The output of 'cephfs-journal-tool event get list' is

0x1acc221e68f SUBTREEMAP:  ()
0x1acc221e9ab UPDATE:  (scatter_writebehind)
 stray7
0x1acc221f05e UPDATE:  (scatter_writebehind)
 stray8
0x1acc221f711 UPDATE:  (scatter_writebehind)
 stray7
0x1acc221fdc4 UPDATE:  (scatter_writebehind)
 stray8
0x1acc2220477 UPDATE:  (scatter_writebehind)
 stray9
0x1acc2220b2a UPDATE:  (scatter_writebehind)
 stray9
0x1acc22211dd UPDATE:  (scatter_writebehind)

The output of 'cephfs-journal-tool header get' is

{
   "magic": "ceph fs volume v011",
   "write_pos": 1841503016379,
   "expire_pos": 1841503004303,
   "trimmed_pos": 1841488199680,
   "stream_format": 1,
   "layout": {
       "stripe_unit": 4194304,
       "stripe_count": 1,
       "object_size": 4194304,
       "cas_hash": 0,
       "object_stripe_unit": 0,
       "pg_pool": 2
   }
}

The output of 'cephfs-journal-tool journal inspect' is

Overall journal integrity: OK

At the moment I am running 'cephfs-data-scan scan_extents cephfs_data'.I guess it won't help me much to back CephFS back online, but might fixsome corrupted metadata.

So my questing is how to identify what really blocks CephFS from beingmounted. Is it possible to start with a fresh journal by doing ‘fsremove; fs new’ reusing the data pool, and using the rados backup ofthe metadata pool?

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cannot mount CephFS after irreversible OSD lost

Reply via email to