Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.
Hi, Did you edit the code before trying Luminous? Yes, I'm still on jewel. I also noticed from your > original mail that it appears you're using multiple active metadata> servers? If so, that's not stable in Jewel. You may have tripped on> one of many bugs fixed in Luminous for that configuration. No, Im using active/backup configuration. Micha Krause ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.
On Thu, Sep 28, 2017 at 5:16 AM, Micha Krausewrote: > Hi, > > I had a chance to catch John Spray at the Ceph Day, and he suggested that I > try to reproduce this bug in luminos. Did you edit the code before trying Luminous? I also noticed from your original mail that it appears you're using multiple active metadata servers? If so, that's not stable in Jewel. You may have tripped on one of many bugs fixed in Luminous for that configuration. -- Patrick Donnelly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.
On Thu, Sep 28, 2017 at 5:16 AM Micha Krausewrote: > Hi, > > I had a chance to catch John Spray at the Ceph Day, and he suggested that > I try to reproduce this bug in luminos. > > To fix my immediate problem we discussed 2 ideas: > > 1. Manually edit the Meta-data, unfortunately I was not able to find any > Information on how the meta-data is structured :-( > > 2. Edit the code to set the link count to 0 if it is negative: > > > diff --git a/src/mds/StrayManager.cc b/src/mds/StrayManager.cc > index 9e53907..2ca1449 100644 > --- a/src/mds/StrayManager.cc > +++ b/src/mds/StrayManager.cc > @@ -553,6 +553,10 @@ bool StrayManager::__eval_stray(CDentry *dn, bool > delay) > logger->set(l_mdc_num_strays_delayed, num_strays_delayed); > } > > + if (in->inode.nlink < 0) { > +in->inode.nlink=0; > + } > + > // purge? > if (in->inode.nlink == 0) { > // past snaprealm parents imply snapped dentry remote links. > diff --git a/src/xxHash b/src/xxHash > --- a/src/xxHash > +++ b/src/xxHash > @@ -1 +1 @@ > > > Im not sure if this works, the patched mds no longer crashes, however I > expected that this value: > > root@mds02:~ # ceph daemonperf mds.1 > -mds-- --mds_server-- ---objecter--- -mds_cache- > ---mds_log > rlat inos caps|hsr hcs hcr |writ read actv|recd recy stry purg|segs evts > subm| >0 100k 0 | 000 | 000 | 00 625k 0 | 30 > 25k 0 > > > Should go down, but it stays at 625k, unfortunately I don't have another > System to compare. > > After I started the patched mds once, I reverted back to an unpatched mds, > and it also stopped crashing, so I guess it did "fix" something. > > > A question just out of curiosity, I tried to log these events with > something like: > > dout(10) << "Fixed negative inode count"; > > or > > derr << "Fixed negative inode count"; > > But my compiler yelled at me for trying this. > dout and derr are big macros. You need to end the line with " << dendl;" to close it off. > > Micha Krause > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.
Hi, I had a chance to catch John Spray at the Ceph Day, and he suggested that I try to reproduce this bug in luminos. To fix my immediate problem we discussed 2 ideas: 1. Manually edit the Meta-data, unfortunately I was not able to find any Information on how the meta-data is structured :-( 2. Edit the code to set the link count to 0 if it is negative: diff --git a/src/mds/StrayManager.cc b/src/mds/StrayManager.cc index 9e53907..2ca1449 100644 --- a/src/mds/StrayManager.cc +++ b/src/mds/StrayManager.cc @@ -553,6 +553,10 @@ bool StrayManager::__eval_stray(CDentry *dn, bool delay) logger->set(l_mdc_num_strays_delayed, num_strays_delayed); } + if (in->inode.nlink < 0) { +in->inode.nlink=0; + } + // purge? if (in->inode.nlink == 0) { // past snaprealm parents imply snapped dentry remote links. diff --git a/src/xxHash b/src/xxHash --- a/src/xxHash +++ b/src/xxHash @@ -1 +1 @@ Im not sure if this works, the patched mds no longer crashes, however I expected that this value: root@mds02:~ # ceph daemonperf mds.1 -mds-- --mds_server-- ---objecter--- -mds_cache- ---mds_log rlat inos caps|hsr hcs hcr |writ read actv|recd recy stry purg|segs evts subm| 0 100k 0 | 000 | 000 | 00 625k 0 | 30 25k 0 Should go down, but it stays at 625k, unfortunately I don't have another System to compare. After I started the patched mds once, I reverted back to an unpatched mds, and it also stopped crashing, so I guess it did "fix" something. A question just out of curiosity, I tried to log these events with something like: dout(10) << "Fixed negative inode count"; or derr << "Fixed negative inode count"; But my compiler yelled at me for trying this. Micha Krause ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.
A serious problem of mds I think. Anyone to fix it? Regards. On Thu, Sep 14, 2017 at 19:55 Micha Krausewrote: > Hi, > > looking at the code, and running with debug mds = 10 it looks like I have > an inode with negative link count. > > -2> 2017-09-14 13:28:39.249399 7f3919616700 10 mds.0.cache.strays > eval_stray [dentry #100/stray7/17aa2f6 [2,head] auth (dversion lock) > pv=0 v=23058565 inode=0x7f394b7e0730 0x7f3945a96270] > -1> 2017-09-14 13:28:39.249445 7f3919616700 10 mds.0.cache.strays > inode is [inode 17aa2f6 [2,head] ~mds0/stray7/17aa2f6 auth > v23057120 s=4476488 nl=-1 n(v0 b4476488 1=1+0) (iversion lock) 0x7f394b7e > > I guess "nl" stands for number of links. > > The code in StrayManager.cc checks for: > > if (in->inode.nlink == 0) { ... } > else { > eval_remote_stray(dn, NULL); > } > > void StrayManager::eval_remote_stray(CDentry *stray_dn, CDentry *remote_dn) > { > ... > assert(stray_in->inode.nlink >= 1); > ... > } > > So if my link count is indeed -1 ceph will die here. > > > The question is: how can I get rid of this inode? > > > Micha Krause > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.
Hi, looking at the code, and running with debug mds = 10 it looks like I have an inode with negative link count. -2> 2017-09-14 13:28:39.249399 7f3919616700 10 mds.0.cache.strays eval_stray [dentry #100/stray7/17aa2f6 [2,head] auth (dversion lock) pv=0 v=23058565 inode=0x7f394b7e0730 0x7f3945a96270] -1> 2017-09-14 13:28:39.249445 7f3919616700 10 mds.0.cache.strays inode is [inode 17aa2f6 [2,head] ~mds0/stray7/17aa2f6 auth v23057120 s=4476488 nl=-1 n(v0 b4476488 1=1+0) (iversion lock) 0x7f394b7e I guess "nl" stands for number of links. The code in StrayManager.cc checks for: if (in->inode.nlink == 0) { ... } else { eval_remote_stray(dn, NULL); } void StrayManager::eval_remote_stray(CDentry *stray_dn, CDentry *remote_dn) { ... assert(stray_in->inode.nlink >= 1); ... } So if my link count is indeed -1 ceph will die here. The question is: how can I get rid of this inode? Micha Krause ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS crashes shortly after startup while trying to purge stray files.
Hi, I was deleting a lot of hard linked files, when "something" happened. Now my mds starts for a few seconds, writes a lot of these lines: -43> 2017-09-06 13:51:43.396588 7f9047b21700 10 log_client will send 2017-09-06 13:51:40.531563 mds.0 10.210.32.12:6802/2735447218 4963 : cluster [ERR] loaded dup inode 17d6511 [2,head] v17234443 at ~mds 0/stray8/17d6511, but inode 17d6511.head v17500983 already exists at ~mds0/stray7/17d6511 And finally this: -3> 2017-09-06 13:51:43.396762 7f9047b21700 10 monclient: _send_mon_message to mon.2 at 10.210.34.11:6789/0 -2> 2017-09-06 13:51:43.396770 7f9047b21700 1 -- 10.210.32.12:6802/2735447218 --> 10.210.34.11:6789/0 -- log(1000 entries from seq 4003 at 2017-09-06 13:51:38.718139) v1 -- ?+0 0x7f905c5d5d40 con 0x7f905902 c600 -1> 2017-09-06 13:51:43.399561 7f9047b21700 1 -- 10.210.32.12:6802/2735447218 <== mon.2 10.210.34.11:6789/0 26 mdsbeacon(152160002/0 up:active seq 8 v47532) v7 126+0+0 (20071477 0 0) 0x7f90591b208 0 con 0x7f905902c600 0> 2017-09-06 13:51:43.401125 7f9043b19700 -1 *** Caught signal (Aborted) ** in thread 7f9043b19700 thread_name:mds_rank_progr ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x5087b7) [0x7f904ed547b7] 2: (()+0xf890) [0x7f904e156890] 3: (gsignal()+0x37) [0x7f904c5e1067] 4: (abort()+0x148) [0x7f904c5e2448] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f904ee5e386] 6: (StrayManager::eval_remote_stray(CDentry*, CDentry*)+0x492) [0x7f904ebaad12] 7: (StrayManager::__eval_stray(CDentry*, bool)+0x5f5) [0x7f904ebaefd5] 8: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f904ebaf7ae] 9: (MDCache::scan_stray_dir(dirfrag_t)+0x165) [0x7f904eb04145] 10: (MDCache::populate_mydir()+0x7fc) [0x7f904eb73acc] 11: (MDCache::open_root()+0xef) [0x7f904eb7447f] 12: (MDSInternalContextBase::complete(int)+0x203) [0x7f904ecad5c3] 13: (MDSRank::_advance_queues()+0x382) [0x7f904ea689e2] 14: (MDSRank::ProgressThread::entry()+0x4a) [0x7f904ea68e6a] 15: (()+0x8064) [0x7f904e14f064] 16: (clone()+0x6d) [0x7f904c69462d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse 99/99 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mds.0.log --- end dump of recent events --- Looking at daemonperf, it seems the mds crashes when trying to write something: root@mds01:~ # /etc/init.d/ceph restart [ ok ] Restarting ceph (via systemctl): ceph.service. root@mds01:~ # ceph daemonperf mds.0 ---objecter--- writ read actv| 000 000 000 6 120 000 000 000 031 011 000 010 011 011 011 011 000 010 010 011 000 6400 Traceback (most recent call last): File "/usr/bin/ceph", line 948, in retval = main() File "/usr/bin/ceph", line 638, in main DaemonWatcher(sockpath).run(interval, count) File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 265, in run dump = json.loads(admin_socket(self.asok_path, ["perf", "dump"])) File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 60, in admin_socket raise RuntimeError('exception getting command descriptions: ' + str(e)) RuntimeError: exception getting command descriptions: [Errno 111] Connection refused And indeed, I am able to prevent the crash by running: root@mds02:~ # ceph --admin-daemon /var/run/ceph/ceph-mds.1.asok force_readonly during startup of the mds. Any advice on how to repair the filesystem? I already tried this without success: http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/ Ceph Version used is Jewel 10.2.9. Micha Krause ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com