Hi, I was deleting a lot of hard linked files, when "something" happened.
Now my mds starts for a few seconds, writes a lot of these lines: -43> 2017-09-06 13:51:43.396588 7f9047b21700 10 log_client will send 2017-09-06 13:51:40.531563 mds.0 10.210.32.12:6802/2735447218 4963 : cluster [ERR] loaded dup inode 100007d6511 [2,head] v17234443 at ~mds 0/stray8/100007d6511, but inode 100007d6511.head v17500983 already exists at ~mds0/stray7/100007d6511 And finally this: -3> 2017-09-06 13:51:43.396762 7f9047b21700 10 monclient: _send_mon_message to mon.2 at 10.210.34.11:6789/0 -2> 2017-09-06 13:51:43.396770 7f9047b21700 1 -- 10.210.32.12:6802/2735447218 --> 10.210.34.11:6789/0 -- log(1000 entries from seq 4003 at 2017-09-06 13:51:38.718139) v1 -- ?+0 0x7f905c5d5d40 con 0x7f905902 c600 -1> 2017-09-06 13:51:43.399561 7f9047b21700 1 -- 10.210.32.12:6802/2735447218 <== mon.2 10.210.34.11:6789/0 26 ==== mdsbeacon(152160002/0 up:active seq 8 v47532) v7 ==== 126+0+0 (20071477 0 0) 0x7f90591b208 0 con 0x7f905902c600 0> 2017-09-06 13:51:43.401125 7f9043b19700 -1 *** Caught signal (Aborted) ** in thread 7f9043b19700 thread_name:mds_rank_progr ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x5087b7) [0x7f904ed547b7] 2: (()+0xf890) [0x7f904e156890] 3: (gsignal()+0x37) [0x7f904c5e1067] 4: (abort()+0x148) [0x7f904c5e2448] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f904ee5e386] 6: (StrayManager::eval_remote_stray(CDentry*, CDentry*)+0x492) [0x7f904ebaad12] 7: (StrayManager::__eval_stray(CDentry*, bool)+0x5f5) [0x7f904ebaefd5] 8: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f904ebaf7ae] 9: (MDCache::scan_stray_dir(dirfrag_t)+0x165) [0x7f904eb04145] 10: (MDCache::populate_mydir()+0x7fc) [0x7f904eb73acc] 11: (MDCache::open_root()+0xef) [0x7f904eb7447f] 12: (MDSInternalContextBase::complete(int)+0x203) [0x7f904ecad5c3] 13: (MDSRank::_advance_queues()+0x382) [0x7f904ea689e2] 14: (MDSRank::ProgressThread::entry()+0x4a) [0x7f904ea68e6a] 15: (()+0x8064) [0x7f904e14f064] 16: (clone()+0x6d) [0x7f904c69462d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse 99/99 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mds.0.log --- end dump of recent events --- Looking at daemonperf, it seems the mds crashes when trying to write something: root@mds01:~ # /etc/init.d/ceph restart [ ok ] Restarting ceph (via systemctl): ceph.service. root@mds01:~ # ceph daemonperf mds.0 ---objecter--- writ read actv| 0 0 0 0 0 0 0 0 0 6 12 0 0 0 0 0 0 0 0 0 0 0 3 1 0 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 64 0 0 Traceback (most recent call last): File "/usr/bin/ceph", line 948, in <module> retval = main() File "/usr/bin/ceph", line 638, in main DaemonWatcher(sockpath).run(interval, count) File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 265, in run dump = json.loads(admin_socket(self.asok_path, ["perf", "dump"])) File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 60, in admin_socket raise RuntimeError('exception getting command descriptions: ' + str(e)) RuntimeError: exception getting command descriptions: [Errno 111] Connection refused And indeed, I am able to prevent the crash by running: root@mds02:~ # ceph --admin-daemon /var/run/ceph/ceph-mds.1.asok force_readonly during startup of the mds. Any advice on how to repair the filesystem? I already tried this without success: http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/ Ceph Version used is Jewel 10.2.9. Micha Krause _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com