Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-30 Thread Mohd Bazli Ab Karim
Hi Zheng,

Sorry for the late reply. For sure, I will try this again after we completely 
verifying all content in the file system. Hopefully all will be good.
And, please confirm this, I will set debug_mds=10 for the ceph-mds, and do you 
want me to send the ceph-mon log too?

BTW, how to confirm that the mds has passed the beacon to mon or not?

Thank you so much Zheng!

Bazli

-Original Message-
From: Yan, Zheng [mailto:uker...@gmail.com]
Sent: Tuesday, April 29, 2014 10:13 PM
To: Mohd Bazli Ab Karim
Cc: Luke Jing Yuan; Wong Ming Tat
Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay 
mds/journal.cc

On Tue, Apr 29, 2014 at 5:30 PM, Mohd Bazli Ab Karim bazli.abka...@mimos.my 
wrote:
 Hi Zheng,

 The another issue that Luke mentioned just now was like this.
 At first, we ran one mds (mon01) with the new compiled ceph-mds. It works 
 fine with only one MDS running at that time. However, when we ran two more 
 MDSes mon02 mon03 with the new compiled ceph-mds, it started acting weird.
 Mon01 which was became active at first, will have the error and started to 
 respawning. Once respawning happened, mon03 will take over from mon01 as 
 master mds, and replay happened again.
 Again, when mon03 became active, it will have the same error like below, and 
 respawning again. So, it seems to me that replay will continue to happen from 
 one mds to another when they got respawned.

 2014-04-29 15:36:24.917798 7f5c36476700  1 mds.0.server
 reconnect_clients -- 1 sessions
 2014-04-29 15:36:24.919620 7f5c2fb3e700  0 -- 10.4.118.23:6800/26401
  10.1.64.181:0/1558263174 pipe(0x2924f5780 sd=41 :6800 s=0 pgs=0
 cs=0 l=0 c=0x37056e0).accept peer addr is really
 10.1.64.181:0/1558263174 (socket is 10.1.64.181:57649/0)
 2014-04-29 15:36:24.921661 7f5c36476700  0 log [DBG] : reconnect by
 client.884169 10.1.64.181:0/1558263174 after 0.003774
 2014-04-29 15:36:24.921786 7f5c36476700  1 mds.0.12858 reconnect_done
 2014-04-29 15:36:25.109391 7f5c36476700  1 mds.0.12858 handle_mds_map
 i am now mds.0.12858
 2014-04-29 15:36:25.109413 7f5c36476700  1 mds.0.12858 handle_mds_map
 state change up:reconnect -- up:rejoin
 2014-04-29 15:36:25.109417 7f5c36476700  1 mds.0.12858 rejoin_start
 2014-04-29 15:36:26.918067 7f5c36476700  1 mds.0.12858
 rejoin_joint_start
 2014-04-29 15:36:33.520985 7f5c36476700  1 mds.0.12858 rejoin_done
 2014-04-29 15:36:36.252925 7f5c36476700  1 mds.0.12858 handle_mds_map
 i am now mds.0.12858
 2014-04-29 15:36:36.252927 7f5c36476700  1 mds.0.12858 handle_mds_map
 state change up:rejoin -- up:active
 2014-04-29 15:36:36.252932 7f5c36476700  1 mds.0.12858 recovery_done -- 
 successful recovery!
 2014-04-29 15:36:36.745833 7f5c36476700  1 mds.0.12858 active_start
 2014-04-29 15:36:36.987854 7f5c36476700  1 mds.0.12858 cluster recovered.
 2014-04-29 15:36:40.182604 7f5c36476700  0 mds.0.12858
 handle_mds_beacon no longer laggy
 2014-04-29 15:36:57.947441 7f5c2fb3e700  0 -- 10.4.118.23:6800/26401
  10.1.64.181:0/1558263174 pipe(0x2924f5780 sd=41 :6800 s=2 pgs=156
 cs=1 l=0 c=0x37056e0).fault with nothing to send, going to standby
 2014-04-29 15:37:10.534593 7f5c36476700  1 mds.-1.-1 handle_mds_map i
 (10.4.118.23:6800/26401) dne in the mdsmap, respawning myself
 2014-04-29 15:37:10.534604 7f5c36476700  1 mds.-1.-1 respawn
 2014-04-29 15:37:10.534609 7f5c36476700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
 2014-04-29 15:37:10.534612 7f5c36476700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
 2014-04-29 15:37:10.534616 7f5c36476700  1 mds.-1.-1  1: '--cluster=ceph'
 2014-04-29 15:37:10.534619 7f5c36476700  1 mds.-1.-1  2: '-i'
 2014-04-29 15:37:10.534621 7f5c36476700  1 mds.-1.-1  3: 'mon03'
 2014-04-29 15:37:10.534623 7f5c36476700  1 mds.-1.-1  4: '-f'
 2014-04-29 15:37:10.534641 7f5c36476700  1 mds.-1.-1  cwd /
 2014-04-29 15:37:12.155458 7f8907c8b780  0 ceph version  (), process
 ceph-mds, pid 26401
 2014-04-29 15:37:12.249780 7f8902d10700  1 mds.-1.0 handle_mds_map
 standby

 p/s. we ran ceph-mon and ceph-mds on same servers, (mon01,mon02,mon03)

 I sent to you two log files, mon01 and mon03 where the scenario of mon03 have 
 state-standby-replay-active-respawned. And also, mon01 which is now 
 running as active as a single MDS at this moment.


After the MDS became ative, it did not send beacon to the monitor. It seems 
like the MDS was busy doing something else. If this issue still happen, set 
debug_mds=10 and send the log to me.

Regards
Yan, Zheng

 Regards,
 Bazli
 -Original Message-
 From: Luke Jing Yuan
 Sent: Tuesday, April 29, 2014 4:46 PM
 To: Yan, Zheng
 Cc: Mohd Bazli Ab Karim; Wong Ming Tat
 Subject: RE: [ceph-users] Ceph mds laggy and failed assert in function
 replay mds/journal.cc

 Hi Zheng,

 Thanks for the information. Actually we encounter another issue, in our 
 original setup, we have 3 MDS running (say mon01, mon02 and mon03), when we 
 do the replay/recovery we did it on mon01. After we completed, we restarted 
 the mds again on mon02 and mon03 (without

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-30 Thread Yan, Zheng
On Wed, Apr 30, 2014 at 3:07 PM, Mohd Bazli Ab Karim
bazli.abka...@mimos.my wrote:
 Hi Zheng,

 Sorry for the late reply. For sure, I will try this again after we completely 
 verifying all content in the file system. Hopefully all will be good.
 And, please confirm this, I will set debug_mds=10 for the ceph-mds, and do 
 you want me to send the ceph-mon log too?

yes please.


 BTW, how to confirm that the mds has passed the beacon to mon or not?

read monitor's log

Regards
Yan, Zheng


 Thank you so much Zheng!

 Bazli

 -Original Message-
 From: Yan, Zheng [mailto:uker...@gmail.com]
 Sent: Tuesday, April 29, 2014 10:13 PM
 To: Mohd Bazli Ab Karim
 Cc: Luke Jing Yuan; Wong Ming Tat
 Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay 
 mds/journal.cc

 On Tue, Apr 29, 2014 at 5:30 PM, Mohd Bazli Ab Karim bazli.abka...@mimos.my 
 wrote:
 Hi Zheng,

 The another issue that Luke mentioned just now was like this.
 At first, we ran one mds (mon01) with the new compiled ceph-mds. It works 
 fine with only one MDS running at that time. However, when we ran two more 
 MDSes mon02 mon03 with the new compiled ceph-mds, it started acting weird.
 Mon01 which was became active at first, will have the error and started to 
 respawning. Once respawning happened, mon03 will take over from mon01 as 
 master mds, and replay happened again.
 Again, when mon03 became active, it will have the same error like below, and 
 respawning again. So, it seems to me that replay will continue to happen 
 from one mds to another when they got respawned.

 2014-04-29 15:36:24.917798 7f5c36476700  1 mds.0.server
 reconnect_clients -- 1 sessions
 2014-04-29 15:36:24.919620 7f5c2fb3e700  0 -- 10.4.118.23:6800/26401
  10.1.64.181:0/1558263174 pipe(0x2924f5780 sd=41 :6800 s=0 pgs=0
 cs=0 l=0 c=0x37056e0).accept peer addr is really
 10.1.64.181:0/1558263174 (socket is 10.1.64.181:57649/0)
 2014-04-29 15:36:24.921661 7f5c36476700  0 log [DBG] : reconnect by
 client.884169 10.1.64.181:0/1558263174 after 0.003774
 2014-04-29 15:36:24.921786 7f5c36476700  1 mds.0.12858 reconnect_done
 2014-04-29 15:36:25.109391 7f5c36476700  1 mds.0.12858 handle_mds_map
 i am now mds.0.12858
 2014-04-29 15:36:25.109413 7f5c36476700  1 mds.0.12858 handle_mds_map
 state change up:reconnect -- up:rejoin
 2014-04-29 15:36:25.109417 7f5c36476700  1 mds.0.12858 rejoin_start
 2014-04-29 15:36:26.918067 7f5c36476700  1 mds.0.12858
 rejoin_joint_start
 2014-04-29 15:36:33.520985 7f5c36476700  1 mds.0.12858 rejoin_done
 2014-04-29 15:36:36.252925 7f5c36476700  1 mds.0.12858 handle_mds_map
 i am now mds.0.12858
 2014-04-29 15:36:36.252927 7f5c36476700  1 mds.0.12858 handle_mds_map
 state change up:rejoin -- up:active
 2014-04-29 15:36:36.252932 7f5c36476700  1 mds.0.12858 recovery_done -- 
 successful recovery!
 2014-04-29 15:36:36.745833 7f5c36476700  1 mds.0.12858 active_start
 2014-04-29 15:36:36.987854 7f5c36476700  1 mds.0.12858 cluster recovered.
 2014-04-29 15:36:40.182604 7f5c36476700  0 mds.0.12858
 handle_mds_beacon no longer laggy
 2014-04-29 15:36:57.947441 7f5c2fb3e700  0 -- 10.4.118.23:6800/26401
  10.1.64.181:0/1558263174 pipe(0x2924f5780 sd=41 :6800 s=2 pgs=156
 cs=1 l=0 c=0x37056e0).fault with nothing to send, going to standby
 2014-04-29 15:37:10.534593 7f5c36476700  1 mds.-1.-1 handle_mds_map i
 (10.4.118.23:6800/26401) dne in the mdsmap, respawning myself
 2014-04-29 15:37:10.534604 7f5c36476700  1 mds.-1.-1 respawn
 2014-04-29 15:37:10.534609 7f5c36476700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
 2014-04-29 15:37:10.534612 7f5c36476700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
 2014-04-29 15:37:10.534616 7f5c36476700  1 mds.-1.-1  1: '--cluster=ceph'
 2014-04-29 15:37:10.534619 7f5c36476700  1 mds.-1.-1  2: '-i'
 2014-04-29 15:37:10.534621 7f5c36476700  1 mds.-1.-1  3: 'mon03'
 2014-04-29 15:37:10.534623 7f5c36476700  1 mds.-1.-1  4: '-f'
 2014-04-29 15:37:10.534641 7f5c36476700  1 mds.-1.-1  cwd /
 2014-04-29 15:37:12.155458 7f8907c8b780  0 ceph version  (), process
 ceph-mds, pid 26401
 2014-04-29 15:37:12.249780 7f8902d10700  1 mds.-1.0 handle_mds_map
 standby

 p/s. we ran ceph-mon and ceph-mds on same servers, (mon01,mon02,mon03)

 I sent to you two log files, mon01 and mon03 where the scenario of mon03 
 have state-standby-replay-active-respawned. And also, mon01 which is now 
 running as active as a single MDS at this moment.


 After the MDS became ative, it did not send beacon to the monitor. It seems 
 like the MDS was busy doing something else. If this issue still happen, set 
 debug_mds=10 and send the log to me.

 Regards
 Yan, Zheng

 Regards,
 Bazli
 -Original Message-
 From: Luke Jing Yuan
 Sent: Tuesday, April 29, 2014 4:46 PM
 To: Yan, Zheng
 Cc: Mohd Bazli Ab Karim; Wong Ming Tat
 Subject: RE: [ceph-users] Ceph mds laggy and failed assert in function
 replay mds/journal.cc

 Hi Zheng,

 Thanks for the information. Actually we encounter another issue, in our 
 original setup, we have 3 MDS running

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Yan, Zheng
On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi,

 We had applied the patch and recompile ceph as well as updated the
 ceph.conf as per suggested, when we re-run ceph-mds we noticed the
 following:


 2014-04-29 10:45:22.260798 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366457,12681393 no session for client.324186
 2014-04-29 10:45:22.262419 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366475,12681393 no session for client.324186
 2014-04-29 10:45:22.267699 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:5135,12681393 no session for client.324186
 2014-04-29 10:45:22.271664 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366724,12681393 no session for client.324186
 2014-04-29 10:45:22.281050 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366945,12681393 no session for client.324186
 2014-04-29 10:45:22.283196 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366996,12681393 no session for client.324186
 2014-04-29 10:45:22.287801 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367043,12681393 no session for client.324186
 2014-04-29 10:45:22.289967 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367082,12681393 no session for client.324186
 2014-04-29 10:45:22.291026 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367110,12681393 no session for client.324186
 2014-04-29 10:45:22.294459 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367192,12681393 no session for client.324186
 2014-04-29 10:45:22.297228 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367257,12681393 no session for client.324186
 2014-04-29 10:45:22.297477 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367264,12681393 no session for client.324186

 tcmalloc: large alloc 1136660480 bytes == 0xb2019000 @  0x7f90c2564da7
 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a
 0x7f90c0cca3fd
 tcmalloc: large alloc 2273316864 bytes == 0x15d73d000 @
 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed
 0x7f90c231de9a 0x7f90c0cca3fd

 ceph -s shows that MDS up:replay,

 Also the messages above seemed to be repeating again after a while but
 with a different session number. Is there a way for us to determine
 that we are on the right track? Thanks.


It's on the right track as long as the MDS doesn't crash.

 Regards,
 Luke

 On Sun, Apr 27, 2014 at 12:04 PM, Yan, Zheng uker...@gmail.com wrote:
 On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi Greg,

 Actually our cluster is pretty empty, but we suspect we had a temporary
 network disconnection to one of our OSD, not sure if this caused the
 problem.

 Anyway we don't mind try the method you mentioned, how can we do that?


 compile ceph-mds with the attached patch. add a line mds
 wipe_sessions = 1 to the ceph.conf,

 Yan, Zheng

 Regards,
 Luke


 On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
  [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
  function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' 
  thread
  7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED
  assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-de...@vger.kernel.org'; ceph-users@lists.ceph.com
  Subject: Ceph mds laggy and failed assert in function replay
  mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
  does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d but it shows that
  it stucks at failed assert(session) line 1303 in mds/journal.cc and 
  aborted.
 
  Can someone shed some light in this issue.
  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 
  Let me know if I need to send log with debug enabled.
 
  Regards,
  Bazli
 
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Jingyuan Luke
Hi,

Assuming we got MDS working back on track, should we still leave the
mds_wipe_sessions in the ceph.conf or remove it and restart MDS.
Thanks.

Regards,
Luke


On Tue, Apr 29, 2014 at 2:12 PM, Yan, Zheng uker...@gmail.com wrote:
 On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi,

 We had applied the patch and recompile ceph as well as updated the
 ceph.conf as per suggested, when we re-run ceph-mds we noticed the
 following:


 2014-04-29 10:45:22.260798 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366457,12681393 no session for client.324186
 2014-04-29 10:45:22.262419 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366475,12681393 no session for client.324186
 2014-04-29 10:45:22.267699 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:5135,12681393 no session for client.324186
 2014-04-29 10:45:22.271664 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366724,12681393 no session for client.324186
 2014-04-29 10:45:22.281050 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366945,12681393 no session for client.324186
 2014-04-29 10:45:22.283196 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366996,12681393 no session for client.324186
 2014-04-29 10:45:22.287801 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367043,12681393 no session for client.324186
 2014-04-29 10:45:22.289967 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367082,12681393 no session for client.324186
 2014-04-29 10:45:22.291026 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367110,12681393 no session for client.324186
 2014-04-29 10:45:22.294459 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367192,12681393 no session for client.324186
 2014-04-29 10:45:22.297228 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367257,12681393 no session for client.324186
 2014-04-29 10:45:22.297477 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367264,12681393 no session for client.324186

 tcmalloc: large alloc 1136660480 bytes == 0xb2019000 @  0x7f90c2564da7
 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a
 0x7f90c0cca3fd
 tcmalloc: large alloc 2273316864 bytes == 0x15d73d000 @
 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed
 0x7f90c231de9a 0x7f90c0cca3fd

 ceph -s shows that MDS up:replay,

 Also the messages above seemed to be repeating again after a while but
 with a different session number. Is there a way for us to determine
 that we are on the right track? Thanks.


 It's on the right track as long as the MDS doesn't crash.

 Regards,
 Luke

 On Sun, Apr 27, 2014 at 12:04 PM, Yan, Zheng uker...@gmail.com wrote:
 On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi Greg,

 Actually our cluster is pretty empty, but we suspect we had a temporary
 network disconnection to one of our OSD, not sure if this caused the
 problem.

 Anyway we don't mind try the method you mentioned, how can we do that?


 compile ceph-mds with the attached patch. add a line mds
 wipe_sessions = 1 to the ceph.conf,

 Yan, Zheng

 Regards,
 Luke


 On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
  [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
  function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' 
  thread
  7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: 
  FAILED
  assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-de...@vger.kernel.org'; ceph-users@lists.ceph.com
  Subject: Ceph mds laggy and failed assert in function replay
  mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
  does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d but it shows that
  it stucks at failed assert(session) line 1303 in mds/journal.cc and 
  aborted.
 
  Can someone shed some light in this issue.
  ceph version 0.72.2 

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Yan, Zheng
On Tue, Apr 29, 2014 at 3:13 PM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi,

 Assuming we got MDS working back on track, should we still leave the
 mds_wipe_sessions in the ceph.conf or remove it and restart MDS.
 Thanks.

No.

It has been several hours. the MDS still does not finish replaying the journal?

Regards
Yan, Zheng


 Regards,
 Luke


 On Tue, Apr 29, 2014 at 2:12 PM, Yan, Zheng uker...@gmail.com wrote:
 On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi,

 We had applied the patch and recompile ceph as well as updated the
 ceph.conf as per suggested, when we re-run ceph-mds we noticed the
 following:


 2014-04-29 10:45:22.260798 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366457,12681393 no session for client.324186
 2014-04-29 10:45:22.262419 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366475,12681393 no session for client.324186
 2014-04-29 10:45:22.267699 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:5135,12681393 no session for client.324186
 2014-04-29 10:45:22.271664 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366724,12681393 no session for client.324186
 2014-04-29 10:45:22.281050 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366945,12681393 no session for client.324186
 2014-04-29 10:45:22.283196 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366996,12681393 no session for client.324186
 2014-04-29 10:45:22.287801 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367043,12681393 no session for client.324186
 2014-04-29 10:45:22.289967 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367082,12681393 no session for client.324186
 2014-04-29 10:45:22.291026 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367110,12681393 no session for client.324186
 2014-04-29 10:45:22.294459 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367192,12681393 no session for client.324186
 2014-04-29 10:45:22.297228 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367257,12681393 no session for client.324186
 2014-04-29 10:45:22.297477 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367264,12681393 no session for client.324186

 tcmalloc: large alloc 1136660480 bytes == 0xb2019000 @  0x7f90c2564da7
 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a
 0x7f90c0cca3fd
 tcmalloc: large alloc 2273316864 bytes == 0x15d73d000 @
 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed
 0x7f90c231de9a 0x7f90c0cca3fd

 ceph -s shows that MDS up:replay,

 Also the messages above seemed to be repeating again after a while but
 with a different session number. Is there a way for us to determine
 that we are on the right track? Thanks.


 It's on the right track as long as the MDS doesn't crash.

 Regards,
 Luke

 On Sun, Apr 27, 2014 at 12:04 PM, Yan, Zheng uker...@gmail.com wrote:
 On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi Greg,

 Actually our cluster is pretty empty, but we suspect we had a temporary
 network disconnection to one of our OSD, not sure if this caused the
 problem.

 Anyway we don't mind try the method you mentioned, how can we do that?


 compile ceph-mds with the attached patch. add a line mds
 wipe_sessions = 1 to the ceph.conf,

 Yan, Zheng

 Regards,
 Luke


 On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
  [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
  function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' 
  thread
  7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: 
  FAILED
  assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-de...@vger.kernel.org'; ceph-users@lists.ceph.com
  Subject: Ceph mds laggy and failed assert in function replay
  mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
  does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d 

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Luke Jing Yuan
Hi,

MDS did finish the replay and working after that but we are wondering should we 
leave the mds_wipe_sessions in ceph.conf or remove it.

Regards,
Luke

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Yan, Zheng
Sent: Tuesday, 29 April, 2014 3:36 PM
To: Jingyuan Luke
Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay 
mds/journal.cc

On Tue, Apr 29, 2014 at 3:13 PM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi,

 Assuming we got MDS working back on track, should we still leave the
 mds_wipe_sessions in the ceph.conf or remove it and restart MDS.
 Thanks.

No.

It has been several hours. the MDS still does not finish replaying the journal?

Regards
Yan, Zheng




DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as regards personal data. If you are not the intended 
recipient, please note that any dealing, review, distribution, printing, 
copying or use of this e-mail is strictly prohibited. If you have received this 
email in error, please notify the sender immediately and delete the original 
message (including any attachments).


MIMOS Berhad is a research and development institution under the purview of the 
Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions 
and other information in this e-mail that do not relate to the official 
business of MIMOS Berhad and/or its subsidiaries shall be understood as neither 
given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS 
Berhad nor its subsidiaries accepts responsibility for the same. All liability 
arising from or in connection with computer viruses and/or corrupted e-mails is 
excluded to the fullest extent permitted by law.

--
-
-
DISCLAIMER: 

This e-mail (including any attachments) is for the addressee(s) 
only and may contain confidential information. If you are not the 
intended recipient, please note that any dealing, review, 
distribution, printing, copying or use of this e-mail is strictly 
prohibited. If you have received this email in error, please notify 
the sender  immediately and delete the original message. 
MIMOS Berhad is a research and development institution under 
the purview of the Malaysian Ministry of Science, Technology and 
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad 
and/or its subsidiaries shall be understood as neither given nor 
endorsed by MIMOS Berhad and/or its subsidiaries and neither 
MIMOS Berhad nor its subsidiaries accepts responsibility for the 
same. All liability arising from or in connection with computer 
viruses and/or corrupted e-mails is excluded to the fullest extent 
permitted by law.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-28 Thread Jingyuan Luke
Hi,

We had applied the patch and recompile ceph as well as updated the
ceph.conf as per suggested, when we re-run ceph-mds we noticed the
following:


2014-04-29 10:45:22.260798 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51366457,12681393 no session for client.324186
2014-04-29 10:45:22.262419 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51366475,12681393 no session for client.324186
2014-04-29 10:45:22.267699 7f90b971d700  0 log [WRN] :  replayed op
client.324186:5135,12681393 no session for client.324186
2014-04-29 10:45:22.271664 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51366724,12681393 no session for client.324186
2014-04-29 10:45:22.281050 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51366945,12681393 no session for client.324186
2014-04-29 10:45:22.283196 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51366996,12681393 no session for client.324186
2014-04-29 10:45:22.287801 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51367043,12681393 no session for client.324186
2014-04-29 10:45:22.289967 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51367082,12681393 no session for client.324186
2014-04-29 10:45:22.291026 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51367110,12681393 no session for client.324186
2014-04-29 10:45:22.294459 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51367192,12681393 no session for client.324186
2014-04-29 10:45:22.297228 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51367257,12681393 no session for client.324186
2014-04-29 10:45:22.297477 7f90b971d700  0 log [WRN] :  replayed op
client.324186:51367264,12681393 no session for client.324186

tcmalloc: large alloc 1136660480 bytes == 0xb2019000 @  0x7f90c2564da7
0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a
0x7f90c0cca3fd
tcmalloc: large alloc 2273316864 bytes == 0x15d73d000 @
0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed
0x7f90c231de9a 0x7f90c0cca3fd

ceph -s shows that MDS up:replay,

Also the messages above seemed to be repeating again after a while but
with a different session number. Is there a way for us to determine
that we are on the right track? Thanks.

Regards,
Luke

On Sun, Apr 27, 2014 at 12:04 PM, Yan, Zheng uker...@gmail.com wrote:
 On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi Greg,

 Actually our cluster is pretty empty, but we suspect we had a temporary
 network disconnection to one of our OSD, not sure if this caused the
 problem.

 Anyway we don't mind try the method you mentioned, how can we do that?


 compile ceph-mds with the attached patch. add a line mds
 wipe_sessions = 1 to the ceph.conf,

 Yan, Zheng

 Regards,
 Luke


 On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
  [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
  function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' 
  thread
  7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED
  assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-de...@vger.kernel.org'; ceph-users@lists.ceph.com
  Subject: Ceph mds laggy and failed assert in function replay
  mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
  does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d but it shows that
  it stucks at failed assert(session) line 1303 in mds/journal.cc and 
  aborted.
 
  Can someone shed some light in this issue.
  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 
  Let me know if I need to send log with debug enabled.
 
  Regards,
  Bazli
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-26 Thread Jingyuan Luke
Hi Greg,

Actually our cluster is pretty empty, but we suspect we had a temporary
network disconnection to one of our OSD, not sure if this caused the
problem.

Anyway we don't mind try the method you mentioned, how can we do that?

Regards,
Luke

On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
 EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
 EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
 [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
 function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread
 7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED
 assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-de...@vger.kernel.org javascript:;';
 ceph-users@lists.ceph.com javascript:;
  Subject: Ceph mds laggy and failed assert in function replay
 mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
 does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d but it shows that
 it stucks at failed assert(session) line 1303 in mds/journal.cc and aborted.
 
  Can someone shed some light in this issue.
  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 
  Let me know if I need to send log with debug enabled.
 
  Regards,
  Bazli
 
  
  DISCLAIMER:
 
 
  This e-mail (including any attachments) is for the addressee(s) only and
 may be confidential, especially as regards personal data. If you are not
 the intended recipient, please note that any dealing, review, distribution,
 printing, copying or use of this e-mail is strictly prohibited. If you have
 received this email in error, please notify the sender immediately and
 delete the original message (including any attachments).
 
 
  MIMOS Berhad is a research and development institution under the purview
 of the Malaysian Ministry of Science, Technology and Innovation. Opinions,
 conclusions and other information in this e-mail that do not relate to the
 official business of MIMOS Berhad and/or its subsidiaries shall be
 understood as neither given nor endorsed by MIMOS Berhad and/or its
 subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts
 responsibility for the same. All liability arising from or in connection
 with computer viruses and/or corrupted e-mails is excluded to the fullest
 extent permitted by law.
 
 
  --
  -
  -
  DISCLAIMER:
 
  This e-mail (including any attachments) is for the addressee(s)
  only and may contain confidential information. If you are not the
  intended recipient, please note that any dealing, review,
  distribution, printing, copying or use of this e-mail is strictly
  prohibited. If you have received this email in error, please notify
  the sender  immediately and delete the original message.
  MIMOS Berhad is a research and development institution under
  the purview of the Malaysian Ministry of Science, Technology and
  Innovation. Opinions, conclusions and other information in this e-
  mail that do not relate to the official business of MIMOS Berhad
  and/or its subsidiaries shall be understood as neither given nor
  endorsed by MIMOS Berhad and/or its subsidiaries and neither
  MIMOS Berhad nor its subsidiaries accepts responsibility for the
  same. All liability arising from or in connection with computer
  viruses and/or corrupted e-mails is excluded to the fullest extent
  permitted by law.
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com javascript:;
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org javascript:;
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-26 Thread Yan, Zheng
On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi Greg,

 Actually our cluster is pretty empty, but we suspect we had a temporary
 network disconnection to one of our OSD, not sure if this caused the
 problem.

 Anyway we don't mind try the method you mentioned, how can we do that?


compile ceph-mds with the attached patch. add a line mds
wipe_sessions = 1 to the ceph.conf,

Yan, Zheng

 Regards,
 Luke


 On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
  [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
  function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread
  7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED
  assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-de...@vger.kernel.org'; ceph-users@lists.ceph.com
  Subject: Ceph mds laggy and failed assert in function replay
  mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
  does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d but it shows that
  it stucks at failed assert(session) line 1303 in mds/journal.cc and 
  aborted.
 
  Can someone shed some light in this issue.
  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 
  Let me know if I need to send log with debug enabled.
 
  Regards,
  Bazli
 
  
  DISCLAIMER:
 
 
  This e-mail (including any attachments) is for the addressee(s) only and
  may be confidential, especially as regards personal data. If you are not 
  the
  intended recipient, please note that any dealing, review, distribution,
  printing, copying or use of this e-mail is strictly prohibited. If you have
  received this email in error, please notify the sender immediately and
  delete the original message (including any attachments).
 
 
  MIMOS Berhad is a research and development institution under the purview
  of the Malaysian Ministry of Science, Technology and Innovation. Opinions,
  conclusions and other information in this e-mail that do not relate to the
  official business of MIMOS Berhad and/or its subsidiaries shall be
  understood as neither given nor endorsed by MIMOS Berhad and/or its
  subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts
  responsibility for the same. All liability arising from or in connection
  with computer viruses and/or corrupted e-mails is excluded to the fullest
  extent permitted by law.
 
 
  --
  -
  -
  DISCLAIMER:
 
  This e-mail (including any attachments) is for the addressee(s)
  only and may contain confidential information. If you are not the
  intended recipient, please note that any dealing, review,
  distribution, printing, copying or use of this e-mail is strictly
  prohibited. If you have received this email in error, please notify
  the sender  immediately and delete the original message.
  MIMOS Berhad is a research and development institution under
  the purview of the Malaysian Ministry of Science, Technology and
  Innovation. Opinions, conclusions and other information in this e-
  mail that do not relate to the official business of MIMOS Berhad
  and/or its subsidiaries shall be understood as neither given nor
  endorsed by MIMOS Berhad and/or its subsidiaries and neither
  MIMOS Berhad nor its subsidiaries accepts responsibility for the
  same. All liability arising from or in connection with computer
  viruses and/or corrupted e-mails is excluded to the fullest extent
  permitted by law.
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-25 Thread Gregory Farnum
Hmm, it looks like your on-disk SessionMap is horrendously out of
date. Did your cluster get full at some point?

In any case, we're working on tools to repair this now but they aren't
ready for use yet. Probably the only thing you could do is create an
empty sessionmap with a higher version than the ones the journal
refers to, but that might have other fallout effects...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
bazli.abka...@mimos.my wrote:
 More logs. I ran ceph-mds  with debug-mds=20.

 -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal EMetaBlob.replay 
 inotable tablev 4316124 = table 4317932
 -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal EMetaBlob.replay 
 sessionmap v8632368 -(1|2) == table 7239603 prealloc [141df86~1] used 
 141db9e
   0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In function 
 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 
 7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED 
 assert(session)

 Please look at the attachment for more details.

 Regards,
 Bazli

 From: Mohd Bazli Ab Karim
 Sent: Friday, April 25, 2014 12:26 PM
 To: 'ceph-de...@vger.kernel.org'; ceph-users@lists.ceph.com
 Subject: Ceph mds laggy and failed assert in function replay mds/journal.cc

 Dear Ceph-devel, ceph-users,

 I am currently facing issue with my ceph mds server. Ceph-mds daemon does not 
 want to bring up back.
 Tried running that manually with ceph-mds -i mon01 -d but it shows that it 
 stucks at failed assert(session) line 1303 in mds/journal.cc and aborted.

 Can someone shed some light in this issue.
 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

 Let me know if I need to send log with debug enabled.

 Regards,
 Bazli

 
 DISCLAIMER:


 This e-mail (including any attachments) is for the addressee(s) only and may 
 be confidential, especially as regards personal data. If you are not the 
 intended recipient, please note that any dealing, review, distribution, 
 printing, copying or use of this e-mail is strictly prohibited. If you have 
 received this email in error, please notify the sender immediately and delete 
 the original message (including any attachments).


 MIMOS Berhad is a research and development institution under the purview of 
 the Malaysian Ministry of Science, Technology and Innovation. Opinions, 
 conclusions and other information in this e-mail that do not relate to the 
 official business of MIMOS Berhad and/or its subsidiaries shall be understood 
 as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and 
 neither MIMOS Berhad nor its subsidiaries accepts responsibility for the 
 same. All liability arising from or in connection with computer viruses 
 and/or corrupted e-mails is excluded to the fullest extent permitted by law.


 --
 -
 -
 DISCLAIMER:

 This e-mail (including any attachments) is for the addressee(s)
 only and may contain confidential information. If you are not the
 intended recipient, please note that any dealing, review,
 distribution, printing, copying or use of this e-mail is strictly
 prohibited. If you have received this email in error, please notify
 the sender  immediately and delete the original message.
 MIMOS Berhad is a research and development institution under
 the purview of the Malaysian Ministry of Science, Technology and
 Innovation. Opinions, conclusions and other information in this e-
 mail that do not relate to the official business of MIMOS Berhad
 and/or its subsidiaries shall be understood as neither given nor
 endorsed by MIMOS Berhad and/or its subsidiaries and neither
 MIMOS Berhad nor its subsidiaries accepts responsibility for the
 same. All liability arising from or in connection with computer
 viruses and/or corrupted e-mails is excluded to the fullest extent
 permitted by law.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-25 Thread Luke Jing Yuan
HI Greg,

Actually the cluster that my colleague and I is working is rather new and still 
have plenty of space left (less than 7% used). What we noticed just before the 
MDS gave us this problem, was a temporary network issue in the data center so 
we are not sure that could have been the root cause.

Anyway, how may we create an empty sessionmap?

Regards,
Luke

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum
Sent: Saturday, 26 April, 2014 2:02 AM
To: Mohd Bazli Ab Karim
Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay 
mds/journal.cc

Hmm, it looks like your on-disk SessionMap is horrendously out of date. Did 
your cluster get full at some point?

In any case, we're working on tools to repair this now but they aren't ready 
for use yet. Probably the only thing you could do is create an empty sessionmap 
with a higher version than the ones the journal refers to, but that might have 
other fallout effects...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim bazli.abka...@mimos.my 
wrote:
 More logs. I ran ceph-mds  with debug-mds=20.

 -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
 -2 EMetaBlob.replay inotable tablev 4316124 = table 4317932
 -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
 -1 EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603
 -1 prealloc [141df86~1] used 141db9e
   0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
 function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)'
 thread 7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc:
 1303: FAILED assert(session)

 Please look at the attachment for more details.

 Regards,
 Bazli

 From: Mohd Bazli Ab Karim
 Sent: Friday, April 25, 2014 12:26 PM
 To: 'ceph-de...@vger.kernel.org'; ceph-users@lists.ceph.com
 Subject: Ceph mds laggy and failed assert in function replay
 mds/journal.cc

 Dear Ceph-devel, ceph-users,

 I am currently facing issue with my ceph mds server. Ceph-mds daemon does not 
 want to bring up back.
 Tried running that manually with ceph-mds -i mon01 -d but it shows that it 
 stucks at failed assert(session) line 1303 in mds/journal.cc and aborted.

 Can someone shed some light in this issue.
 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

 Let me know if I need to send log with debug enabled.

 Regards,
 Bazli



DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as regards personal data. If you are not the intended 
recipient, please note that any dealing, review, distribution, printing, 
copying or use of this e-mail is strictly prohibited. If you have received this 
email in error, please notify the sender immediately and delete the original 
message (including any attachments).


MIMOS Berhad is a research and development institution under the purview of the 
Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions 
and other information in this e-mail that do not relate to the official 
business of MIMOS Berhad and/or its subsidiaries shall be understood as neither 
given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS 
Berhad nor its subsidiaries accepts responsibility for the same. All liability 
arising from or in connection with computer viruses and/or corrupted e-mails is 
excluded to the fullest extent permitted by law.

--
-
-
DISCLAIMER: 

This e-mail (including any attachments) is for the addressee(s) 
only and may contain confidential information. If you are not the 
intended recipient, please note that any dealing, review, 
distribution, printing, copying or use of this e-mail is strictly 
prohibited. If you have received this email in error, please notify 
the sender  immediately and delete the original message. 
MIMOS Berhad is a research and development institution under 
the purview of the Malaysian Ministry of Science, Technology and 
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad 
and/or its subsidiaries shall be understood as neither given nor 
endorsed by MIMOS Berhad and/or its subsidiaries and neither 
MIMOS Berhad nor its subsidiaries accepts responsibility for the 
same. All liability arising from or in connection with computer 
viruses and/or corrupted e-mails is excluded to the fullest extent 
permitted by law.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-24 Thread Mohd Bazli Ab Karim
Dear Ceph-devel, ceph-users,

I am currently facing issue with my ceph mds server. Ceph-mds daemon does not 
want to bring up back.
Tried running that manually with ceph-mds -i mon01 -d but it shows that it 
stucks at failed assert(session) line 1303 in mds/journal.cc and aborted.

Can someone shed some light in this issue.
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

Let me know if I need to send log with debug enabled.

Regards,
Bazli


DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as regards personal data. If you are not the intended 
recipient, please note that any dealing, review, distribution, printing, 
copying or use of this e-mail is strictly prohibited. If you have received this 
email in error, please notify the sender immediately and delete the original 
message (including any attachments).


MIMOS Berhad is a research and development institution under the purview of the 
Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions 
and other information in this e-mail that do not relate to the official 
business of MIMOS Berhad and/or its subsidiaries shall be understood as neither 
given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS 
Berhad nor its subsidiaries accepts responsibility for the same. All liability 
arising from or in connection with computer viruses and/or corrupted e-mails is 
excluded to the fullest extent permitted by law.


--
-
-
DISCLAIMER:

This e-mail (including any attachments) is for the addressee(s)
only and may contain confidential information. If you are not the
intended recipient, please note that any dealing, review,
distribution, printing, copying or use of this e-mail is strictly
prohibited. If you have received this email in error, please notify
the sender  immediately and delete the original message.
MIMOS Berhad is a research and development institution under
the purview of the Malaysian Ministry of Science, Technology and
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad
and/or its subsidiaries shall be understood as neither given nor
endorsed by MIMOS Berhad and/or its subsidiaries and neither
MIMOS Berhad nor its subsidiaries accepts responsibility for the
same. All liability arising from or in connection with computer
viruses and/or corrupted e-mails is excluded to the fullest extent
permitted by law.

2014-04-25 12:17:27.210367 7f3c30250780  0 ceph version 0.72.2 
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 5492
starting mds.mon01 at :/0
2014-04-25 12:17:27.441530 7f3c2b2d6700  1 mds.-1.0 handle_mds_map standby
2014-04-25 12:17:27.624820 7f3c2b2d6700  1 mds.0.12834 handle_mds_map i am now 
mds.0.12834
2014-04-25 12:17:27.624825 7f3c2b2d6700  1 mds.0.12834 handle_mds_map state 
change up:standby -- up:replay
2014-04-25 12:17:27.624830 7f3c2b2d6700  1 mds.0.12834 replay_start
2014-04-25 12:17:27.624836 7f3c2b2d6700  1 mds.0.12834  recovery set is 
2014-04-25 12:17:27.624837 7f3c2b2d6700  1 mds.0.12834  need osdmap epoch 
29082, have 29081
2014-04-25 12:17:27.624839 7f3c2b2d6700  1 mds.0.12834  waiting for osdmap 
29082 (which blacklists prior instance)
2014-04-25 12:17:30.138623 7f3c2b2d6700  0 mds.0.cache creating system inode 
with ino:100
2014-04-25 12:17:30.138890 7f3c2b2d6700  0 mds.0.cache creating system inode 
with ino:1
mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, 
MDSlaveUpdate*)' thread 7f3c26fbd700 time 2014-04-25 12:17:30.441635
mds/journal.cc: 1303: FAILED assert(session)
 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x7830) [0x5af890]
 2: (EUpdate::replay(MDS*)+0x3a) [0x5b67ea]
 3: (MDLog::_replay_thread()+0x678) [0x79dbb8]
 4: (MDLog::ReplayThread::entry()+0xd) [0x58bded]
 5: (()+0x7e9a) [0x7f3c2f675e9a]
 6: (clone()+0x6d) [0x7f3c2e56a3fd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
2014-04-25 12:17:30.442489 7f3c26fbd700 -1 mds/journal.cc: In function 'void 
EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f3c26fbd700 time 
2014-04-25 12:17:30.441635
mds/journal.cc: 1303: FAILED assert(session)

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x7830) [0x5af890]
 2: (EUpdate::replay(MDS*)+0x3a) [0x5b67ea]
 3: (MDLog::_replay_thread()+0x678) [0x79dbb8]
 4: (MDLog::ReplayThread::entry()+0xd) [0x58bded]
 5: (()+0x7e9a) [0x7f3c2f675e9a]
 6: (clone()+0x6d) [0x7f3c2e56a3fd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

--- begin dump of recent events ---
  -172 2014-04-25 12:17:27.208884 7f3c30250780  5 asok(0x1b99000) 
register_command