Re: [ceph-users] Ceph MDS laggy

2019-03-30 Thread Mark Schouten
heng (uker...@gmail.com) Cc: Ceph Users (ceph-users@lists.ceph.com) Subject: Re: [ceph-users] Ceph MDS laggy On Mon, Mar 25, 2019 at 07:13:20PM +0800, Yan, Zheng wrote: > Yes. the fix is in 12.2.11 Great, thanks. -- Mark Schouten  | Tuxis Internet Engineering KvK: 61527076  | http://www.tuxi

Re: [ceph-users] Ceph MDS laggy

2019-03-25 Thread Mark Schouten
On Mon, Mar 25, 2019 at 07:13:20PM +0800, Yan, Zheng wrote: > Yes. the fix is in 12.2.11 Great, thanks. -- Mark Schouten | Tuxis Internet Engineering KvK: 61527076 | http://www.tuxis.nl/ T: 0318 200208 | i...@tuxis.nl ___ ceph-users mailing list ceph

Re: [ceph-users] Ceph MDS laggy

2019-03-25 Thread Yan, Zheng
On Mon, Mar 25, 2019 at 6:36 PM Mark Schouten wrote: > > On Mon, Jan 21, 2019 at 10:17:31AM +0800, Yan, Zheng wrote: > > It's http://tracker.ceph.com/issues/37977. Thanks for your help. > > > > I think I've hit this bug. Ceph MDS using 100% ceph and reporting as > laggy and being kicked out. I'm n

Re: [ceph-users] Ceph MDS laggy

2019-03-25 Thread Mark Schouten
On Mon, Jan 21, 2019 at 10:17:31AM +0800, Yan, Zheng wrote: > It's http://tracker.ceph.com/issues/37977. Thanks for your help. > I think I've hit this bug. Ceph MDS using 100% ceph and reporting as laggy and being kicked out. I'm not sure though if this fix is currently in a released version of L

Re: [ceph-users] Ceph MDS laggy

2019-01-20 Thread Yan, Zheng
It's http://tracker.ceph.com/issues/37977. Thanks for your help. Regards Yan, Zheng On Sun, Jan 20, 2019 at 12:40 AM Adam Tygart wrote: > > It worked for about a week, and then seems to have locked up again. > > Here is the back trace from the threads on the mds: > http://people.cs.ksu.edu/~moze

Re: [ceph-users] Ceph MDS laggy

2019-01-20 Thread Paul Emmerich
I've heard of the same(?) problem on another cluster; they upgraded from 12.2.7 to 12.2.10 and suddenly got problems with their CephFS (and only with the CephFS). However, they downgraded the MDS to 12.2.8 before I could take a look at it, so not sure what caused the issue. 12.2.8 works fine with t

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart
The same user's jobs seem to be the instigator of this issue again. I've looked through their code and see nothing too onerous. This time it was 2400+ cores/jobs on 186 nodes all working in the same directory. Each job reads in a different 110KB file, crunches numbers for while (1+ hours) and then

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart
Just re-checked my notes. We updated from 12.2.8 to 12.2.10 on the 27th of December. -- Adam On Sat, Jan 19, 2019 at 8:26 PM Adam Tygart wrote: > > Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't > happen before then. > > -- > Adam > > On Sat, Jan 19, 2019, 20:17 Pa

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart
Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't happen before then. -- Adam On Sat, Jan 19, 2019, 20:17 Paul Emmerich mailto:paul.emmer...@croit.io> wrote: Did this only start to happen after upgrading to 12.2.10? Paul -- Paul Emmerich Looking for help with your

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Paul Emmerich
Did this only start to happen after upgrading to 12.2.10? Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart wrote: > > It wor

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart
It worked for about a week, and then seems to have locked up again. Here is the back trace from the threads on the mds: http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt -- Adam On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng wrote: > > On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart wrote

Re: [ceph-users] Ceph MDS laggy

2019-01-13 Thread Yan, Zheng
On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart wrote: > > Restarting the nodes causes the hanging again. This means that this is > workload dependent and not a transient state. > > I believe I've tracked down what is happening. One user was running > 1500-2000 jobs in a single directory with 92000+ f

Re: [ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart
Restarting the nodes causes the hanging again. This means that this is workload dependent and not a transient state. I believe I've tracked down what is happening. One user was running 1500-2000 jobs in a single directory with 92000+ files in it. I am wondering if the cluster was getting ready to

Re: [ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart
On a hunch, I shutdown the compute nodes for our HPC cluster, and 10 minutes after that restarted the mds daemon. It replayed the journal, evicted the dead compute nodes and is working again. This leads me to believe there was a broken transaction of some kind coming from the compute nodes (also a

[ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart
Hello all, I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6. We're using cephfs and rbd. Last night, one of our two active/active mds servers went laggy and upon restart once it goes active it immediately goes laggy again. I've got a log available here (debug_mds 20, debug

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-30 Thread Yan, Zheng
-- > From: Yan, Zheng [mailto:uker...@gmail.com] > Sent: Tuesday, April 29, 2014 10:13 PM > To: Mohd Bazli Ab Karim > Cc: Luke Jing Yuan; Wong Ming Tat > Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay > mds/journal.cc > > On Tue, Apr 29, 2014 at

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-30 Thread Mohd Bazli Ab Karim
that the mds has passed the beacon to mon or not? Thank you so much Zheng! Bazli -Original Message- From: Yan, Zheng [mailto:uker...@gmail.com] Sent: Tuesday, April 29, 2014 10:13 PM To: Mohd Bazli Ab Karim Cc: Luke Jing Yuan; Wong Ming Tat Subject: Re: [ceph-users] Ceph mds laggy and

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Luke Jing Yuan
: Tuesday, 29 April, 2014 3:36 PM To: Jingyuan Luke Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc On Tue, Apr 29, 2014 at 3:13 PM, Jingyuan Luke wrote: > Hi, > > Assuming we got MDS wor

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Yan, Zheng
On Tue, Apr 29, 2014 at 3:13 PM, Jingyuan Luke wrote: > Hi, > > Assuming we got MDS working back on track, should we still leave the > mds_wipe_sessions in the ceph.conf or remove it and restart MDS. > Thanks. No. It has been several hours. the MDS still does not finish replaying the journal? R

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Jingyuan Luke
Hi, Assuming we got MDS working back on track, should we still leave the mds_wipe_sessions in the ceph.conf or remove it and restart MDS. Thanks. Regards, Luke On Tue, Apr 29, 2014 at 2:12 PM, Yan, Zheng wrote: > On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke wrote: >> Hi, >> >> We had appli

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-28 Thread Yan, Zheng
On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke wrote: > Hi, > > We had applied the patch and recompile ceph as well as updated the > ceph.conf as per suggested, when we re-run ceph-mds we noticed the > following: > > > 2014-04-29 10:45:22.260798 7f90b971d700 0 log [WRN] : replayed op > client.3

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-28 Thread Jingyuan Luke
Hi, We had applied the patch and recompile ceph as well as updated the ceph.conf as per suggested, when we re-run ceph-mds we noticed the following: 2014-04-29 10:45:22.260798 7f90b971d700 0 log [WRN] : replayed op client.324186:51366457,12681393 no session for client.324186 2014-04-29 10:45:2

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-26 Thread Yan, Zheng
On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke wrote: > Hi Greg, > > Actually our cluster is pretty empty, but we suspect we had a temporary > network disconnection to one of our OSD, not sure if this caused the > problem. > > Anyway we don't mind try the method you mentioned, how can we do that?

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-26 Thread Jingyuan Luke
Hi Greg, Actually our cluster is pretty empty, but we suspect we had a temporary network disconnection to one of our OSD, not sure if this caused the problem. Anyway we don't mind try the method you mentioned, how can we do that? Regards, Luke On Saturday, April 26, 2014, Gregory Farnum wrote:

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-25 Thread Luke Jing Yuan
...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc Hmm, it looks like your on-disk SessionMap is horrendously out of date. Did your cluster get full at some point? In any case, we're working on tools to repair this no

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-25 Thread Gregory Farnum
Hmm, it looks like your on-disk SessionMap is horrendously out of date. Did your cluster get full at some point? In any case, we're working on tools to repair this now but they aren't ready for use yet. Probably the only thing you could do is create an empty sessionmap with a higher version than t

[ceph-users] Ceph mds laggy and failed to assert session in function mds/journal.cc line 1303

2014-04-25 Thread Bazli Karim
Dear Ceph-devel, ceph-users, I am currently facing issue with my ceph mds server. Ceph-mds daemon does not want to bring up back. I tried running that manually with ceph-mds –i mon01 –d but it got aborted and the log shows that it stucks at failed assert(session) line 1303 in mds/journal.cc.

[ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-24 Thread Mohd Bazli Ab Karim
Dear Ceph-devel, ceph-users, I am currently facing issue with my ceph mds server. Ceph-mds daemon does not want to bring up back. Tried running that manually with ceph-mds -i mon01 -d but it shows that it stucks at failed assert(session) line 1303 in mds/journal.cc and aborted. Can someone shed