Dear ceph experts,
I've built and administrating 12 OSD ceph cluster (spanning over 3
nodes) with replication count of 2. The ceph version is
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
The cluster hosts two pools (data and metadata) that are exported over
CephFS.
At some point the OSDs approached 'full' state one of them got
corrupted. The easiest solution was to remove/re-add the wiped OSD back.
It went fine, the cluster was recovering without issues. At the point
of only 39 degraded objects left another OSD corrupted (its peer
actually). I was not able to recover it and I have made a hard decision
to remove it, wipe and re-add back to the cluster. Since no backups
have been made, the data corruption was expected.
To my surprise when all OSDs got back online and cluster started to
recover, only one incomplete PG has been reported. I've worked around
it by ssh'ing to the node that holds its primary OSDs and then
exporting the corrupted pg with 'ceph-objectstore-tool --op export'
marking it 'complete' afterwards. Once cluster recovered, I've imported
the pg's data back to its primary OSD. The recovery then fully
completed and at the moment 'ceph -s' gives me:
cluster 7972d1e9-2843-41a3-a4e7-9889d9c75850
health HEALTH_WARN
1 near full osd(s)
monmap e1: 1 mons at {000-s-ragnarok=xxx.xxx.xxx.xxx:6789/0}
election epoch 1, quorum 0 000-s-ragnarok
mdsmap e9393: 1/1/0 up {0=000-s-ragnarok=up:active}
osdmap e185363: 12 osds: 12 up, 12 in
pgmap v5599327: 1024 pgs, 2 pools, 7758 GB data, 22316 kobjects
15804 GB used, 6540 GB / 22345 GB avail
1020 active+clean
4 active+clean+scrubbing+deep
However when I've brought the mds back online the CephFS cannot be
mounted anymore complaining on the client side 'mount error 5 =
Input/output error'. Since mds was running just fine without any
suspicious messages in its log, I've decided that something happened to
its journal and CephFS disaster recovery is needed. I've stopped the
mds and tried to make a backup of the journal. UnfortunatelyA, the tool
crashed with the following output:
cephfs-journal-tool journal export backup.bin
journal is 1841503004303~12076
*** buffer overflow detected ***: cephfs-journal-tool terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7f175ef12a57]
/lib64/libc.so.6(+0x10bc10)[0x7f175ef10c10]
/lib64/libc.so.6(+0x10b119)[0x7f175ef10119]
/lib64/libc.so.6(_IO_vfprintf+0x2f00)[0x7f175ee4f430]
/lib64/libc.so.6(__vsprintf_chk+0x88)[0x7f175ef101a8]
/lib64/libc.so.6(__sprintf_chk+0x7d)[0x7f175ef100fd]
cephfs-journal-tool(_ZN6Dumper4dumpEPKc+0x630)[0x7f1763374720]
cephfs-journal-tool(_ZN11JournalTool14journal_exportERKSsb+0x294)[0x7f1763357874]
cephfs-journal-tool(_ZN11JournalTool12main_journalERSt6vectorIPKcSaIS2_EE+0x105)[0x7f17633580c5]
cephfs-journal-tool(_ZN11JournalTool4mainERSt6vectorIPKcSaIS2_EE+0x56e)[0x7f17633514de]
cephfs-journal-tool(main+0x1de)[0x7f1763350d4e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f175ee26af5]
cephfs-journal-tool(+0x1ccae9)[0x7f1763356ae9]
...
-3> 2015-11-17 10:43:00.874529 7f174db4b700 1 --
xxx.xxx.xxx.xxx:6802/3019233561 <== osd.9 xxx.xxx.xxx.xxx:6808/13662 1
==== osd_op_reply(4 200.0006b309 [stat] v0'0 uv0 ack = -2 ((2) No such
file or directory)) v6 ==== 179+0+0 (2303160312 0 0) 0x7f1767c719c0 con
0x7f1767d194a0
...
So I've used rados tool to export the cephfs_metadata pool, and then
proceeded with
cephfs-journal-tool event recover_dentries summary
cephfs-journal-tool journal reset
cephfs-table-tool all reset session
ceph fs reset home --yes-i-really-mean-it
After this manipulation, the cephfs-journal-tool journal export
backup.rec worked, but wrote 48 bytes at around 1.8TB offset!
Then I've brought mds back online, but CephFS is still non-mountable.
I've tried to flush the journal with:
ceph daemon mds.000-s-ragnarok flush journal
No luck. Then I've stopped mds and relaunched with
ceph-mds -i 000-s-ragnarok --journal_check 0 --debug_mds=10
--debug_ms=100
It persistently outputs this snippet for a couple of hours:
7faf0bd58700 7 mds.0.cache trim max=100000 cur=17
7faf0bd58700 10 mds.0.cache trim_client_leases
7faf0bd58700 2 mds.0.cache check_memory_usage total 256288, rss 19116,
heap 48056, malloc 1791 mmap 0, baseline 48056, buffers 0, 0 / 19
inodes have caps, 0 caps, 0 caps per inode
7faf0bd58700 10 mds.0.log trim 1 / 30 segments, 8 / -1 events, 0 (0)
expiring, 0 (0) expired
7faf0bd58700 10 mds.0.log _trim_expired_segments waiting for
1841488226436/1841503004303 to expire
7faf0bd58700 10 mds.0.server find_idle_sessions. laggy until 0.000000
7faf0bd58700 10 mds.0.locker scatter_tick
7faf0bd58700 10 mds.0.cache find_stale_fragment_freeze
7faf0bd58700 10 mds.0.snap check_osd_map - version unchanged
7faf0b557700 10 mds.beacon.000-s-ragnarok _send up:active seq 12
So it appears to me that even despite 'cephfs-journal-tool journal
reset', the journal is not wiped and its corruption blocks CephFS from
being mounted.
The output of 'cephfs-journal-tool event get list' is
0x1acc221e68f SUBTREEMAP: ()
0x1acc221e9ab UPDATE: (scatter_writebehind)
stray7
0x1acc221f05e UPDATE: (scatter_writebehind)
stray8
0x1acc221f711 UPDATE: (scatter_writebehind)
stray7
0x1acc221fdc4 UPDATE: (scatter_writebehind)
stray8
0x1acc2220477 UPDATE: (scatter_writebehind)
stray9
0x1acc2220b2a UPDATE: (scatter_writebehind)
stray9
0x1acc22211dd UPDATE: (scatter_writebehind)
The output of 'cephfs-journal-tool header get' is
{
"magic": "ceph fs volume v011",
"write_pos": 1841503016379,
"expire_pos": 1841503004303,
"trimmed_pos": 1841488199680,
"stream_format": 1,
"layout": {
"stripe_unit": 4194304,
"stripe_count": 1,
"object_size": 4194304,
"cas_hash": 0,
"object_stripe_unit": 0,
"pg_pool": 2
}
}
The output of 'cephfs-journal-tool journal inspect' is
Overall journal integrity: OK
At the moment I am running 'cephfs-data-scan scan_extents cephfs_data'.
I guess it won't help me much to back CephFS back online, but might fix
some corrupted metadata.
So my questing is how to identify what really blocks CephFS from being
mounted. Is it possible to start with a fresh journal by doing ‘fs
remove; fs new’ reusing the data pool, and using the rados backup of
the metadata pool?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com