Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
The "out" OSD was "out" before the crash and doesn't hold any data as it was weighted out prior. Restarting OSDs named as repeat offenders as listed by 'ceph health detail' has cleared problems. Thanks to all for the guidance and suffering my panic, -- Eric On 4/12/16 12:38 PM, Eric Hall wr

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread LOPEZ Jean-Charles
Hi, looks like one of your OSDs has been marked as out. Just make sure it’s in so you can read '67 osds: 67 up, 67 in' rather than '67 osds: 67 up, 66 in’ in the ‘ceph -s’ output You can quickly check which one is not in with the ‘ceph old tree’ command JC > On Apr 12, 2016, at 11:21, Joao Ed

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis
On 04/12/2016 07:16 PM, Eric Hall wrote: Removed mon on mon1, added mon on mon1 via ceph-deply. mons now have quorum. I am left with: cluster 5ee52b50-838e-44c4-be3c-fc596dc46f4e health HEALTH_WARN 1086 pgs peering; 1086 pgs stuck inactive; 1086 pgs stuck unclean; pool vms has too few

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
Removed mon on mon1, added mon on mon1 via ceph-deply. mons now have quorum. I am left with: cluster 5ee52b50-838e-44c4-be3c-fc596dc46f4e health HEALTH_WARN 1086 pgs peering; 1086 pgs stuck inactive; 1086 pgs stuck unclean; pool vms has too few pgs monmap e5: 3 mons at {cephsecur

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis
On 04/12/2016 06:38 PM, Eric Hall wrote: Ok, mon2 and mon3 are happy together, but mon1 dies with mon/MonitorDBStore.h: 287: FAILED assert(0 == "failed to write to db") I take this to mean mon1:store.db is corrupt as I see no permission issues. So... remove mon1 and add a mon? Nothing special

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
Ok, mon2 and mon3 are happy together, but mon1 dies with mon/MonitorDBStore.h: 287: FAILED assert(0 == "failed to write to db") I take this to mean mon1:store.db is corrupt as I see no permission issues. So... remove mon1 and add a mon? Nothing special to worry about re-adding a mon on mon1, o

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis
On 04/12/2016 05:06 PM, Joao Eduardo Luis wrote: On 04/12/2016 04:27 PM, Eric Hall wrote: On 4/12/16 9:53 AM, Joao Eduardo Luis wrote: So this looks like the monitors didn't remove version 1, but this may just be a red herring. What matters, really, is the values in 'first_committed' and 'las

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis
On 04/12/2016 04:27 PM, Eric Hall wrote: On 4/12/16 9:53 AM, Joao Eduardo Luis wrote: So this looks like the monitors didn't remove version 1, but this may just be a red herring. What matters, really, is the values in 'first_committed' and 'last_committed'. If either first or last_committed ha

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
On 4/12/16 9:53 AM, Joao Eduardo Luis wrote: So this looks like the monitors didn't remove version 1, but this may just be a red herring. What matters, really, is the values in 'first_committed' and 'last_committed'. If either first or last_committed happens to be '1', then there may be a bug s

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Joao Eduardo Luis
On 04/12/2016 03:33 PM, Eric Hall wrote: On 4/12/16 9:02 AM, Gregory Farnum wrote: On Tue, Apr 12, 2016 at 4:41 AM, Eric Hall wrote: On 4/12/16 12:01 AM, Gregory Farnum wrote: Exactly what values are you reading that's giving you those values? The "real" OSDMap epoch is going to be at least 3

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
On 4/12/16 9:02 AM, Gregory Farnum wrote: On Tue, Apr 12, 2016 at 4:41 AM, Eric Hall wrote: On 4/12/16 12:01 AM, Gregory Farnum wrote: Exactly what values are you reading that's giving you those values? The "real" OSDMap epoch is going to be at least 38630...if you're very lucky it will be exa

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Gregory Farnum
On Tue, Apr 12, 2016 at 4:41 AM, Eric Hall wrote: > On 4/12/16 12:01 AM, Gregory Farnum wrote: >> >> On Mon, Apr 11, 2016 at 3:45 PM, Eric Hall >> wrote: >>> >>> Power failure in data center has left 3 mons unable to start with >>> mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch) >>

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-12 Thread Eric Hall
On 4/12/16 12:01 AM, Gregory Farnum wrote: On Mon, Apr 11, 2016 at 3:45 PM, Eric Hall wrote: Power failure in data center has left 3 mons unable to start with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch) Have found simliar problem discussed at http://irclogs.ceph.widodh.nl/in

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-11 Thread Gregory Farnum
On Mon, Apr 11, 2016 at 3:45 PM, Eric Hall wrote: > Power failure in data center has left 3 mons unable to start with > mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch) > > Have found simliar problem discussed at > http://irclogs.ceph.widodh.nl/index.php?date=2015-05-29, but am unsur

[ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-11 Thread Eric Hall
Power failure in data center has left 3 mons unable to start with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch) Have found simliar problem discussed at http://irclogs.ceph.widodh.nl/index.php?date=2015-05-29, but am unsure how to proceed. If I read ceph-kvstore-tool /var/lib