Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-10-04 Thread Micha Krause

Hi,


Did you edit the code before trying Luminous?


Yes, I'm still on jewel.



I also noticed from your  > original mail that it appears you're using multiple 
active metadata> servers? If so, that's not stable in Jewel. You may have tripped 
on> one of many bugs fixed in Luminous for that configuration.

No, Im using active/backup configuration.


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-10-02 Thread Patrick Donnelly
On Thu, Sep 28, 2017 at 5:16 AM, Micha Krause  wrote:
> Hi,
>
> I had a chance to catch John Spray at the Ceph Day, and he suggested that I
> try to reproduce this bug in luminos.

Did you edit the code before trying Luminous? I also noticed from your
original mail that it appears you're using multiple active metadata
servers? If so, that's not stable in Jewel. You may have tripped on
one of many bugs fixed in Luminous for that configuration.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-09-28 Thread Gregory Farnum
On Thu, Sep 28, 2017 at 5:16 AM Micha Krause  wrote:

> Hi,
>
> I had a chance to catch John Spray at the Ceph Day, and he suggested that
> I try to reproduce this bug in luminos.
>
> To fix my immediate problem we discussed 2 ideas:
>
> 1. Manually edit the Meta-data, unfortunately I was not able to find any
> Information on how the meta-data is structured :-(
>
> 2. Edit the code to set the link count to 0 if it is negative:
>
>
> diff --git a/src/mds/StrayManager.cc b/src/mds/StrayManager.cc
> index 9e53907..2ca1449 100644
> --- a/src/mds/StrayManager.cc
> +++ b/src/mds/StrayManager.cc
> @@ -553,6 +553,10 @@ bool StrayManager::__eval_stray(CDentry *dn, bool
> delay)
>   logger->set(l_mdc_num_strays_delayed, num_strays_delayed);
> }
>
> +  if (in->inode.nlink < 0) {
> +in->inode.nlink=0;
> +  }
> +
> // purge?
> if (in->inode.nlink == 0) {
>   // past snaprealm parents imply snapped dentry remote links.
> diff --git a/src/xxHash b/src/xxHash
> --- a/src/xxHash
> +++ b/src/xxHash
> @@ -1 +1 @@
>
>
> Im not sure if this works, the patched mds no longer crashes, however I
> expected that this value:
>
> root@mds02:~ # ceph daemonperf mds.1
> -mds-- --mds_server-- ---objecter--- -mds_cache-
> ---mds_log
> rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts
> subm|
>0  100k   0 |  000 |  000 |  00  625k   0 | 30
>  25k   0
> 
>
> Should go down, but it stays at 625k, unfortunately I don't have another
> System to compare.
>
> After I started the patched mds once, I reverted back to an unpatched mds,
> and it also stopped crashing, so I guess it did "fix" something.
>
>
> A question just out of curiosity, I tried to log these events with
> something like:
>
>   dout(10) << "Fixed negative inode count";
>
> or
>
>   derr << "Fixed negative inode count";
>
> But my compiler yelled at me for trying this.
>

dout and derr are big macros. You need to end the line with " << dendl;" to
close it off.


>
> Micha Krause
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-09-28 Thread Micha Krause

Hi,

I had a chance to catch John Spray at the Ceph Day, and he suggested that I try 
to reproduce this bug in luminos.

To fix my immediate problem we discussed 2 ideas:

1. Manually edit the Meta-data, unfortunately I was not able to find any 
Information on how the meta-data is structured :-(

2. Edit the code to set the link count to 0 if it is negative:


diff --git a/src/mds/StrayManager.cc b/src/mds/StrayManager.cc
index 9e53907..2ca1449 100644
--- a/src/mds/StrayManager.cc
+++ b/src/mds/StrayManager.cc
@@ -553,6 +553,10 @@ bool StrayManager::__eval_stray(CDentry *dn, bool delay)
 logger->set(l_mdc_num_strays_delayed, num_strays_delayed);
   }

+  if (in->inode.nlink < 0) {
+in->inode.nlink=0;
+  }
+
   // purge?
   if (in->inode.nlink == 0) {
 // past snaprealm parents imply snapped dentry remote links.
diff --git a/src/xxHash b/src/xxHash
--- a/src/xxHash
+++ b/src/xxHash
@@ -1 +1 @@


Im not sure if this works, the patched mds no longer crashes, however I 
expected that this value:

root@mds02:~ # ceph daemonperf mds.1
-mds-- --mds_server-- ---objecter--- -mds_cache- ---mds_log
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts subm|
  0  100k   0 |  000 |  000 |  00  625k   0 | 30   25k   0
   

Should go down, but it stays at 625k, unfortunately I don't have another System 
to compare.

After I started the patched mds once, I reverted back to an unpatched mds, and it also 
stopped crashing, so I guess it did "fix" something.


A question just out of curiosity, I tried to log these events with something 
like:

 dout(10) << "Fixed negative inode count";

or

 derr << "Fixed negative inode count";

But my compiler yelled at me for trying this.


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-09-14 Thread Meyers Mark
A serious problem of mds I think.
Anyone to fix it?

Regards.

On Thu, Sep 14, 2017 at 19:55 Micha Krause  wrote:

> Hi,
>
> looking at the code, and running with debug mds = 10 it looks like I have
> an inode with negative link count.
>
>  -2> 2017-09-14 13:28:39.249399 7f3919616700 10 mds.0.cache.strays
> eval_stray [dentry #100/stray7/17aa2f6 [2,head] auth (dversion lock)
> pv=0 v=23058565 inode=0x7f394b7e0730 0x7f3945a96270]
>  -1> 2017-09-14 13:28:39.249445 7f3919616700 10 mds.0.cache.strays
> inode is [inode 17aa2f6 [2,head] ~mds0/stray7/17aa2f6 auth
> v23057120 s=4476488 nl=-1 n(v0 b4476488 1=1+0) (iversion lock) 0x7f394b7e
>
> I guess "nl" stands for number of links.
>
> The code in StrayManager.cc checks for:
>
> if (in->inode.nlink == 0) { ... }
> else {
> eval_remote_stray(dn, NULL);
> }
>
> void StrayManager::eval_remote_stray(CDentry *stray_dn, CDentry *remote_dn)
> {
> ...
> assert(stray_in->inode.nlink >= 1);
> ...
> }
>
> So if my link count is indeed -1 ceph will die here.
>
>
> The question is: how can I get rid of this inode?
>
>
> Micha Krause
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-09-14 Thread Micha Krause

Hi,

looking at the code, and running with debug mds = 10 it looks like I have an 
inode with negative link count.

-2> 2017-09-14 13:28:39.249399 7f3919616700 10 mds.0.cache.strays 
eval_stray [dentry #100/stray7/17aa2f6 [2,head] auth (dversion lock) pv=0 
v=23058565 inode=0x7f394b7e0730 0x7f3945a96270]
-1> 2017-09-14 13:28:39.249445 7f3919616700 10 mds.0.cache.strays  inode is 
[inode 17aa2f6 [2,head] ~mds0/stray7/17aa2f6 auth v23057120 s=4476488 
nl=-1 n(v0 b4476488 1=1+0) (iversion lock) 0x7f394b7e

I guess "nl" stands for number of links.

The code in StrayManager.cc checks for:

if (in->inode.nlink == 0) { ... }
else {
eval_remote_stray(dn, NULL);
}

void StrayManager::eval_remote_stray(CDentry *stray_dn, CDentry *remote_dn)
{
...
assert(stray_in->inode.nlink >= 1);
...
}

So if my link count is indeed -1 ceph will die here.


The question is: how can I get rid of this inode?


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-09-06 Thread Micha Krause

Hi,

I was deleting a lot of hard linked files, when "something" happened.

Now my mds starts for a few seconds, writes a lot of these lines:

   -43> 2017-09-06 13:51:43.396588 7f9047b21700 10 log_client  will send 
2017-09-06 13:51:40.531563 mds.0 10.210.32.12:6802/2735447218 4963 : cluster [ERR] 
loaded dup inode 17d6511 [2,head] v17234443 at ~mds
0/stray8/17d6511, but inode 17d6511.head v17500983 already exists at 
~mds0/stray7/17d6511


And finally this:


-3> 2017-09-06 13:51:43.396762 7f9047b21700 10 monclient: _send_mon_message 
to mon.2 at 10.210.34.11:6789/0
-2> 2017-09-06 13:51:43.396770 7f9047b21700  1 -- 10.210.32.12:6802/2735447218 
--> 10.210.34.11:6789/0 -- log(1000 entries from seq 4003 at 2017-09-06 
13:51:38.718139) v1 -- ?+0 0x7f905c5d5d40 con 0x7f905902
c600
-1> 2017-09-06 13:51:43.399561 7f9047b21700  1 -- 10.210.32.12:6802/2735447218 
<== mon.2 10.210.34.11:6789/0 26  mdsbeacon(152160002/0 up:active seq 8 
v47532) v7  126+0+0 (20071477 0 0) 0x7f90591b208
0 con 0x7f905902c600
 0> 2017-09-06 13:51:43.401125 7f9043b19700 -1 *** Caught signal (Aborted) 
**
 in thread 7f9043b19700 thread_name:mds_rank_progr

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (()+0x5087b7) [0x7f904ed547b7]
 2: (()+0xf890) [0x7f904e156890]
 3: (gsignal()+0x37) [0x7f904c5e1067]
 4: (abort()+0x148) [0x7f904c5e2448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x256) [0x7f904ee5e386]
 6: (StrayManager::eval_remote_stray(CDentry*, CDentry*)+0x492) [0x7f904ebaad12]
 7: (StrayManager::__eval_stray(CDentry*, bool)+0x5f5) [0x7f904ebaefd5]
 8: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f904ebaf7ae]
 9: (MDCache::scan_stray_dir(dirfrag_t)+0x165) [0x7f904eb04145]
 10: (MDCache::populate_mydir()+0x7fc) [0x7f904eb73acc]
 11: (MDCache::open_root()+0xef) [0x7f904eb7447f]
 12: (MDSInternalContextBase::complete(int)+0x203) [0x7f904ecad5c3]
 13: (MDSRank::_advance_queues()+0x382) [0x7f904ea689e2]
 14: (MDSRank::ProgressThread::entry()+0x4a) [0x7f904ea68e6a]
 15: (()+0x8064) [0x7f904e14f064]
 16: (clone()+0x6d) [0x7f904c69462d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  99/99 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mds.0.log
--- end dump of recent events ---

Looking at daemonperf, it seems the mds crashes when trying to write something:

root@mds01:~ # /etc/init.d/ceph restart
[ ok ] Restarting ceph (via systemctl): ceph.service.

root@mds01:~ # ceph daemonperf mds.0
---objecter---
writ read actv|
  000
  000
  000
  6   120
  000
  000
  000
  031
  011
  000
  010
  011
  011
  011
  011
  000
  010
  010
  011
  000
 6400
Traceback (most recent call last):
  File "/usr/bin/ceph", line 948, in 
retval = main()
  File "/usr/bin/ceph", line 638, in main
DaemonWatcher(sockpath).run(interval, count)
  File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 265, in run
dump = json.loads(admin_socket(self.asok_path, ["perf", "dump"]))
  File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 60, in 
admin_socket
raise RuntimeError('exception getting command descriptions: ' + str(e))
RuntimeError: exception getting command descriptions: [Errno 111] Connection 
refused


And indeed, I am able to prevent the crash by running:

root@mds02:~ # ceph --admin-daemon /var/run/ceph/ceph-mds.1.asok force_readonly

during startup of the mds.

Any advice on how to repair the filesystem?

I already tried this without success:

http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/

Ceph Version used is Jewel 10.2.9.


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com