Re: [ceph-users] Troubles MDS

Georg Höllrigl Thu, 17 Apr 2014 01:11:01 -0700

Whatever happened - It fixed itself!?

When restarting, I got ~ 165k log messages like:
2014-04-17 07:30:14.856421 7fc50b991700  0 log [WRN] :  ino 10000f24fe0
2014-04-17 07:30:14.856422 7fc50b991700  0 log [WRN] :  ino 10000f24fe1
2014-04-17 07:30:14.856423 7fc50b991700  0 log [WRN] :  ino 10000f24fe2
2014-04-17 07:30:14.856424 7fc50b991700  0 log [WRN] :  ino 10000f24fe3
2014-04-17 07:30:14.856427 7fc50b991700  0 log [WRN] :  ino 10000f24fe4
2014-04-17 07:30:14.856428 7fc50b991700  0 log [WRN] :  ino 10000f24fe5


And the clients recovered!?

I would be really interested, what happend!

Georg

On 17.04.2014 09:45, Georg Höllrigl wrote:

Hello Greg,

I've searched - but don't see any backtraces... I've tried to get some
more info out of the logs. I really hope, there is something interesting
in it:

It all started two days ago with an authentication error:

2014-04-14 21:08:55.929396 7fd93d53f700  1 mds.0.0
standby_replay_restart (as standby)
2014-04-14 21:09:07.989547 7fd93b62e700  1 mds.0.0 replay_done (as standby)
2014-04-14 21:09:08.989647 7fd93d53f700  1 mds.0.0
standby_replay_restart (as standby)
2014-04-14 21:09:10.633786 7fd93b62e700  1 mds.0.0 replay_done (as standby)
2014-04-14 21:09:11.633886 7fd93d53f700  1 mds.0.0
standby_replay_restart (as standby)
2014-04-14 21:09:17.995105 7fd93f644700  0 mds.0.0 handle_mds_beacon no
longer laggy
2014-04-14 21:09:39.798798 7fd93f644700  0 monclient: hunting for new mon
2014-04-14 21:09:39.955078 7fd93f644700  1 mds.-1.-1 handle_mds_map i
(10.0.1.107:6800/16503) dne in the mdsmap, respawning myself
2014-04-14 21:09:39.955094 7fd93f644700  1 mds.-1.-1 respawn
2014-04-14 21:09:39.955106 7fd93f644700  1 mds.-1.-1  e:
'/usr/bin/ceph-mds'
2014-04-14 21:09:39.955109 7fd93f644700  1 mds.-1.-1  0:
'/usr/bin/ceph-mds'
2014-04-14 21:09:39.955110 7fd93f644700  1 mds.-1.-1  1: '-i'
2014-04-14 21:09:39.955112 7fd93f644700  1 mds.-1.-1  2: 'ceph-m-02'
2014-04-14 21:09:39.955113 7fd93f644700  1 mds.-1.-1  3: '--pid-file'
2014-04-14 21:09:39.955114 7fd93f644700  1 mds.-1.-1  4:
'/var/run/ceph/mds.ceph-m-02.pid'
2014-04-14 21:09:39.955116 7fd93f644700  1 mds.-1.-1  5: '-c'
2014-04-14 21:09:39.955117 7fd93f644700  1 mds.-1.-1  6:
'/etc/ceph/ceph.conf'
2014-04-14 21:09:39.979138 7fd93f644700  1 mds.-1.-1  cwd /
2014-04-14 19:09:40.922683 7f8ba9973780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 16505
2014-04-14 19:09:40.975024 7f8ba9973780 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2014-04-14 19:09:40.975070 7f8ba9973780  1 mds.-1.0 suicide.  wanted
down:dne, now up:boot

That was fixed with restarting mds (+ the whole server).

2014-04-15 07:07:15.948650 7f9fdec0d780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 506
2014-04-15 07:07:15.954386 7f9fdec0d780 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2014-04-15 07:07:15.954422 7f9fdec0d780  1 mds.-1.0 suicide.  wanted
down:dne, now up:boot
2014-04-15 07:15:49.177861 7fe8a1d60780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 26401
2014-04-15 07:15:49.184027 7fe8a1d60780 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2014-04-15 07:15:49.184046 7fe8a1d60780  1 mds.-1.0 suicide.  wanted
down:dne, now up:boot
2014-04-15 07:17:32.598031 7fab123e6780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 30531
2014-04-15 07:17:32.604560 7fab123e6780 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2014-04-15 07:17:32.604592 7fab123e6780  1 mds.-1.0 suicide.  wanted
down:dne, now up:boot
2014-04-15 07:21:56.099203 7fd37b951780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 11335
2014-04-15 07:21:56.105229 7fd37b951780 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2014-04-15 07:21:56.105254 7fd37b951780  1 mds.-1.0 suicide.  wanted
down:dne, now up:boot
2014-04-15 07:22:09.345800 7f23392ef780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 11461
2014-04-15 07:22:09.390001 7f23392ef780 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2014-04-15 07:22:09.391087 7f23392ef780  1 mds.-1.0 suicide.  wanted
down:dne, now up:boot
2014-04-15 07:28:01.762191 7fab6d14b780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 28263
2014-04-15 07:28:01.779485 7fab6d14b780 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2014-04-15 07:28:01.779507 7fab6d14b780  1 mds.-1.0 suicide.  wanted
down:dne, now up:boot
2014-04-15 07:35:49.065110 7fe4f6b0d780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 1233
2014-04-15 07:35:52.191856 7fe4f6b05700  0 -- 10.0.1.107:6800/1233 >>
10.0.1.108:6789/0 pipe(0x2f9f500 sd=8 :0 s=1 pgs=0 cs=0 l=1
c=0x2f81580).fault
2014-04-15 07:35:56.352553 7fe4f1b96700  1 mds.-1.0 handle_mds_map standby
2014-04-15 07:35:56.419499 7fe4f1b96700  1 mds.0.0 handle_mds_map i am
now mds.7905854.0replaying mds.0.0
2014-04-15 07:35:56.419507 7fe4f1b96700  1 mds.0.0 handle_mds_map state
change up:standby --> up:standby-replay
2014-04-15 07:35:56.419512 7fe4f1b96700  1 mds.0.0 replay_start
2014-04-15 07:35:56.425229 7fe4f1b96700  1 mds.0.0  recovery set is
2014-04-15 07:35:56.425241 7fe4f1b96700  1 mds.0.0  need osdmap epoch
1391, have 3696
2014-04-15 07:35:56.497573 7fe4f1b96700  0 mds.0.cache creating system
inode with ino:100
2014-04-15 07:35:56.501393 7fe4f1b96700  0 mds.0.cache creating system
inode with ino:1
2014-04-15 07:35:58.294776 7fe4ed67b700  1 mds.0.0 replay_done (as standby)
2014-04-15 07:35:59.294874 7fe4ef990700  1 mds.0.0
standby_replay_restart (as standby)

Here everything runs normal...


2014-04-15 20:26:19.724564 7fbcca70d700  0 mds.0.3 handle_mds_beacon no
longer laggy
2014-04-16 04:31:13.485103 7fbcc55ed700  0 -- 10.0.1.108:6800/22033 >>
10.0.1.81:0/61463 pipe(0xf930780 sd=55 :6800 s=2 pgs=2 cs=1 l=0
c=0x71db6e0).fault with nothing to send, going to standby
2014-04-16 04:31:13.485222 7fbcc46de700  0 -- 10.0.1.108:6800/22033 >>
10.0.1.68:0/4128280551 pipe(0xf6ac780 sd=28 :6800 s=2 pgs=4 cs=1 l=0
c=0xe31d6e0).fault with nothing to send, going to standby
2014-04-16 04:31:14.506269 7fbcc7505700  0 -- 10.0.1.108:6800/22033 >>
10.0.1.42:0/2817517925 pipe(0x1a7c9780 sd=36 :6800 s=2 pgs=2 cs=1 l=0
c=0xf6749a0).fault with nothing to send, going to standby
2014-04-16 04:31:19.852504 7fbcc12aa700  0 -- 10.0.1.108:6800/22033 >>
10.0.1.156:0/4413 pipe(0x128165c80 sd=66 :6800 s=2 pgs=2 cs=1 l=0
c=0x12b5b9dc0).fault with nothing to send, going to standby
2014-04-16 04:31:38.971665 7fbcc30c8700  0 -- 10.0.1.108:6800/22033 >>
10.0.0.25:0/31945 pipe(0xf930500 sd=56 :6800 s=2 pgs=2 cs=1 l=0
c=0x71db420).fault with nothing to send, going to standby
2014-04-16 04:31:38.971680 7fbcc1bb3700  0 -- 10.0.1.108:6800/22033 >>
10.0.1.39:0/13485 pipe(0xf6aca00 sd=29 :6800 s=2 pgs=2 cs=1 l=0
c=0xe31d2c0).fault with nothing to send, going to standby
2014-04-16 04:31:38.971714 7fbcc63f4700  0 -- 10.0.1.108:6800/22033 >>
10.0.1.40:0/12395 pipe(0x31fe000 sd=46 :6800 s=2 pgs=2 cs=1 l=0
c=0x726b160).fault with nothing to send, going to standby
2014-04-16 04:31:38.979886 7fbcc2cc4700  0 -- 10.0.1.108:6800/22033 >>
10.0.1.152:0/14639 pipe(0x1a7c9000 sd=64 :6800 s=2 pgs=2 cs=1 l=0
c=0x71db580).fault with nothing to send, going to standby
2014-04-16 04:31:38.984613 7fbcc37cf700  0 -- 10.0.1.108:6800/22033 >>
10.0.1.66:0/114165773 pipe(0x16624000 sd=48 :6800 s=2 pgs=4 cs=1 l=0
c=0xf674c60).fault with nothing to send, going to standby
2014-04-16 04:32:41.689287 7fbcca70d700  1 mds.-1.-1 handle_mds_map i
(10.0.1.108:6800/22033) dne in the mdsmap, respawning myself
2014-04-16 04:32:41.689294 7fbcca70d700  1 mds.-1.-1 respawn
2014-04-16 04:32:41.689301 7fbcca70d700  1 mds.-1.-1  e:
'/usr/bin/ceph-mds'
2014-04-16 04:32:41.689304 7fbcca70d700  1 mds.-1.-1  0:
'/usr/bin/ceph-mds'
2014-04-16 04:32:41.689306 7fbcca70d700  1 mds.-1.-1  1: '-i'
2014-04-16 04:32:41.689308 7fbcca70d700  1 mds.-1.-1  2: 'ceph-m-03'
2014-04-16 04:32:41.689309 7fbcca70d700  1 mds.-1.-1  3: '--pid-file'
2014-04-16 04:32:41.689310 7fbcca70d700  1 mds.-1.-1  4:
'/var/run/ceph/mds.ceph-m-03.pid'
2014-04-16 04:32:41.689312 7fbcca70d700  1 mds.-1.-1  5: '-c'
2014-04-16 04:32:41.689313 7fbcca70d700  1 mds.-1.-1  6:
'/etc/ceph/ceph.conf'
2014-04-16 04:32:41.742899 7fbcca70d700  1 mds.-1.-1  cwd /
2014-04-16 02:32:43.116287 7ff7375ad780  0 ceph version 0.72.2
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 22035
2014-04-16 02:32:43.226015 7ff7375ad780 -1 mds.-1.0 ERROR: failed to
authenticate: (1) Operation not permitted
2014-04-16 02:32:43.226040 7ff7375ad780  1 mds.-1.0 suicide.  wanted
down:dne, now up:boot

Nothing worked - mounting not possible. Restarting of the process not
possible.
Logs filled with

2014-04-16 12:29:46.417441 7f1de27f5700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.108:6789/0 pipe(0xabc80000 sd=250 :44413 s=4 pgs=0 cs=0 l=1
c=0x178870580).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:47.065881 7f1de22f0700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0x25b25280 sd=277 :39688 s=4 pgs=0 cs=0 l=1
c=0x159b4c840).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:47.231550 7f1de05d3700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0x1049a5280 sd=252 :39919 s=4 pgs=0 cs=0 l=1
c=0x4e53d840).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:47.359261 7f1de1ae8700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0x103d3b500 sd=244 :40070 s=4 pgs=0 cs=0 l=1
c=0x1130f92c0).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:47.486590 7f1de30fe700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0x47889780 sd=245 :40175 s=4 pgs=0 cs=0 l=1
c=0xc6c09420).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:47.904427 7f1de3603700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0x88069000 sd=262 :40554 s=4 pgs=0 cs=0 l=1
c=0x116652580).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:48.272279 7f1de12e0700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0xed8f9c80 sd=257 :40929 s=4 pgs=0 cs=0 l=1
c=0xf2c2f160).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:49.822148 7f1de18e6700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0x110926780 sd=264 :42222 s=4 pgs=0 cs=0 l=1
c=0x10508e9a0).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:50.769693 7f1ddf5c3700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0x6dbc6280 sd=243 :43169 s=4 pgs=0 cs=0 l=1
c=0x1358ed580).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:51.621605 7f1de05d3700  0 -- 10.0.1.107:6803/21953 >>
10.0.1.107:6789/0 pipe(0x21154780 sd=243 :43929 s=4 pgs=0 cs=0 l=1
c=0xecb1a580).connect got RESETSESSION but no longer connecting
2014-04-16 12:29:51.886405 7f1dea867700 -1 mds.0.13 *** got signal
Terminated ***
2014-04-16 12:29:51.886894 7f1dea867700  1 mds.0.13 suicide.  wanted
down:dne, now up:rejoin


Also see lots of these:
ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.900657 7f937a5e1700  0
log [ERR] : dir 10000c4639d.10000c4639d object missing on disk; some
files may be lost
ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.913245 7f937a5e1700  0
log [ERR] : dir 10000c4617e.10000c4617e object missing on disk; some
files may be lost
ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.925811 7f937a5e1700  0
log [ERR] : dir 10000c45d08.10000c45d08 object missing on disk; some
files may be lost
ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.941476 7f937a5e1700  0
log [ERR] : dir 10000c45d9e.10000c45d9e object missing on disk; some
files may be lost
ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.956158 7f937a5e1700  0
log [ERR] : dir 10000c461e5.10000c461e5 object missing on disk; some
files may be lost
ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.968524 7f937a5e1700  0
log [ERR] : dir 10000c46608.10000c46608 object missing on disk; some
files may be lost
ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.979229 7f937a5e1700  0
log [ERR] : dir 10000c468b6.10000c468b6 object missing on disk; some
files may be lost

At the moment, I've only one mds running - but clients (mainly using
fuse) can't connect.



Regards,
Georg



On 16.04.2014 16:27, Gregory Farnum wrote:

What's the backtrace from the MDS crash?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 16, 2014 at 7:11 AM, Georg Höllrigl
<georg.hoellr...@xidras.com> wrote:

Hello,

Using Ceph MDS with one active and one standby server - a day ago one
of the
mds crashed and I restarted it.
Tonight it crashed again, a few hours later, also the second mds
crashed.

#ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

At the moment cephfs is dead - with following health status:

#ceph -s
     cluster b04fc583-9e71-48b7-a741-92f4dff4cfef
      health HEALTH_WARN mds cluster is degraded; mds c is laggy
      monmap e3: 3 mons at
{ceph-m-01=10.0.0.176:6789/0,ceph-m-02=10.0.1.107:6789/0,ceph-m-03=10.0.1.108:6789/0},

election epoch 6274, quorum 0,1,2 ceph-m-01,ceph-m-02,ceph-m-03
      mdsmap e2055: 1/1/1 up {0=ceph-m-03=up:rejoin(laggy or crashed)}
      osdmap e3752: 39 osds: 39 up, 39 in
       pgmap v3277576: 8328 pgs, 17 pools, 6461 GB data, 17066 kobjects
             13066 GB used, 78176 GB / 91243 GB avail
                 8328 active+clean
   client io 1193 B/s rd, 0 op/s

I couldn't really find any useful infos in the logfiles nor searching in
documentations. Any ideas how to get cephfs up and running?

Here is part of mds log:
2014-04-16 14:07:05.603501 7ff184c64700  1 mds.0.server reconnect
gave up on
client.7846580 10.0.1.152:0/14639
2014-04-16 14:07:05.603525 7ff184c64700  1 mds.0.46 reconnect_done
2014-04-16 14:07:05.674990 7ff186d69700  1 mds.0.46 handle_mds_map i
am now
mds.0.46
2014-04-16 14:07:05.674996 7ff186d69700  1 mds.0.46 handle_mds_map state
change up:reconnect --> up:rejoin
2014-04-16 14:07:05.674998 7ff186d69700  1 mds.0.46 rejoin_start
2014-04-16 14:07:22.347521 7ff17f825700  0 -- 10.0.1.107:6815/17325 >>
10.0.1.68:0/4128280551 pipe(0x5e2ac80 sd=930 :6815 s=2 pgs=153 cs=1 l=0
c=0x5e2e160).fault with nothing to send, going to standby

Any ideas, how to solve "laggy or crashed" ?


Georg
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Troubles MDS

Reply via email to