On Sun, Apr 10, 2016 at 4:12 AM, Don Waterloo <don.water...@gmail.com> wrote: > I have a 6 osd system (w/ 3 mon, and 3 mds). > it is running cephfs as part of its task. > > i have upgraded the 3 mon nodes to Ubuntu 16.04 and the bundled ceph > 10.1.0-0ubuntu1. > > (upgraded from Ubuntu 15.10 with ceph 0.94.6-0ubuntu0.15.10.1). > > 2 of the mon nodes are happy and up. But the 3rd is giving an asset failure > on start. > specifically the assert is: > mds/FSMap.cc: 555: FAILED assert(i.second.state == MDSMap::STATE_STANDBY) > > The 'ceph status' is showing 3 mds (1 up active, 2 up standby); > > # ceph status > 2016-04-10 03:08:24.522804 7f2be870c700 0 -- :/1760247070 >> > 10.100.10.62:6789/0 pipe(0x7f2be405a2f0 sd=3 :0 s=1 pgs=0 cs=0 l=1 > c=0x7f2be405bf90).fault > cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded > health HEALTH_WARN > crush map has legacy tunables (require bobtail, min is firefly) > 1 mons down, quorum 0,1 nubo-1,nubo-2 > monmap e1: 3 mons at > {nubo-1=10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0} > election epoch 2778, quorum 0,1 nubo-1,nubo-2 > mdsmap e1279: 1/1/1 up {0:0=nubo-2=up:active}, 2 up:standby > osdmap e5666: 6 osds: 6 up, 6 in > pgmap v1476810: 712 pgs, 5 pools, 41976 MB data, 109 kobjects > 86310 MB used, 5538 GB / 5622 GB avail > 712 active+clean > > I'm not sure what to do @ this stage. I've rebooted all of them, i've tried > taking the 2 standby MDS down. I don't see why this mon fails when the > others succeed. > > Does anyone have any suggestions? > > The stack trace from the assert gives: > 1: (()+0x51fb9d) [0x5572d9e42b9d] > 2: (()+0x113e0) [0x7fa285f8b3e0] > 3: (gsignal()+0x38) [0x7fa28416b518] > 4: (abort()+0x16a) [0x7fa28416d0ea] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x26b) [0x5572d9f7082b] > 6: (FSMap::sanity() const+0x9ae) [0x5572d9e84f4e] > 7: (MDSMonitor::update_from_paxos(bool*)+0x313) [0x5572d9c7e8f3] > 8: (PaxosService::refresh(bool*)+0x3dd) [0x5572d9c012dd] > 9: (Monitor::refresh_from_paxos(bool*)+0x193) [0x5572d9b99693] > 10: (Monitor::init_paxos()+0x115) [0x5572d9b99ad5] > 11: (Monitor::preinit()+0x902) [0x5572d9bca252] > 12: (main()+0x255b) [0x5572d9b3ec9b] > 13: (__libc_start_main()+0xf1) [0x7fa284156841] > 14: (_start()+0x29) [0x5572d9b8b869]
Please provide the full log from the mon starting up to it crashing, with "debug mon = 10" set. If the mons are really all running the same code but only one is failing, presumably that one has somehow during the upgrade process ended up storing something invalid in its local stores while the others have somehow proceeded past that version already. v10.1.1 (i.e. Jewel, when it is released) has a configuration option (mon_mds_skip_sanity) that may allow you to get past this, assuming what's in the leader's store is indeed valid (guessing it is since your other two mons are apparently happy). I don't know exactly how the Ubuntu release process works, but you should be aware that the Ceph version you're running is pre-release code from the jewel branch. If your CephFS data pool happens to have ID 0, you will also hit a severe bug in that code, and you should stop using it now (see the note here: http://blog.gmane.org/gmane.comp.file-systems.ceph.announce) John _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com