Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-10 Thread Philippe D'Anjou
 After trying to disable the paxos service trim temporarily (since that seemed 
to trigger it initially), we now see this:
    "assert_condition": "from != to",
    "assert_func": "void PaxosService::trim(MonitorDBStore::TransactionRef, 
version_t, version_t)",
    "assert_file": "/build/ceph-14.2.4/src/mon/PaxosService.cc",
    "assert_line": 412,
    "assert_thread_name": "safe_timer",
    "assert_msg": "/build/ceph-14.2.4/src/mon/PaxosService.cc: In function 
'void PaxosService::trim(MonitorDBStore::TransactionRef, version_t, version_t)' 
thread 7fd31cb9a700 time 2019-10-10 
13:13:59.394987\n/build/ceph-14.2.4/src/mon/PaxosService.cc: 412: FAILED 
ceph_assert(from != to)\n",

We need some crutch...all I need is a running mon to mount Cephfs, data is 
still fine.

Am Mittwoch, 9. Oktober 2019, 20:19:42 OESZ hat Gregory Farnum 
 Folgendes geschrieben:  
 
 On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou
 wrote:
>
> Hi,
> unfortunately it's single mon, because we had major outage on this cluster 
> and it's just being used to copy off data now. We werent able to add more 
> mons because once a second mon was added it crashed the first one (there's a 
> bug tracker ticket).
> I still have old rocksdb files before I ran a repair on it, but well it had 
> the rocksdb corruption issue (not sure why that happened, it ran fine for 
> 2months now).
>
> Any options? I mean everything still works, data is accessible, RBDs run, 
> only cephfs mount is obviously not working. For that short amount of time the 
> mon starts it reports no issues and all commands run fine.

Sounds like you actually lost some data. You'd need to manage a repair
by trying to figure out why CephFS needs that map and performing
surgery on either the monitor (to give it a fake map or fall back to
something else) or the CephFS data structures.

You might also be able to rebuild the CephFS metadata using the
disaster recovery tools to work around it, but no guarantees there
since I don't understand why CephFS is digging up OSD maps that nobody
else in the cluster cares about.
-Greg


> Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
>  wrote:
> >
> > I had to use rocksdb repair tool before because the rocksdb files got 
> > corrupted, for another reason (another bug possibly). Maybe that is why now 
> > it crash loops, although it ran fine for a day.
>
> Yeah looks like it lost a bit of data. :/
>
> > What is meant with "turn it off and rebuild from remainder"?
>
> If only one monitor is crashing, you can remove it from the quorum,
> zap all the disks, and add it back so that it recovers from its
> healthy peers.
> -Greg
>
>
> >
> > Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
> >  Folgendes geschrieben:
> >
> >
> > Hmm, that assert means the monitor tried to grab an OSDMap it had on
> > disk but it didn't work. (In particular, a "pinned" full map which we
> > kept around after trimming the others to save on disk space.)
> >
> > That *could* be a bug where we didn't have the pinned map and should
> > have (or incorrectly thought we should have), but this code was in
> > Mimic as well as Nautilus and I haven't seen similar reports. So it
> > could also mean that something bad happened to the monitor's disk or
> > Rocksdb store. Can you turn it off and rebuild from the remainder, or
> > do they all exhibit this bug?
> >
> >
> > On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
> >  wrote:
> > >
> > > Hi,
> > > our mon is acting up all of a sudden and dying in crash loop with the 
> > > following:
> > >
> > >
> > > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> > >    -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > >mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 
> > >4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 
> > >lease_expire=0.00 has v0 lc 4549352
> > >    -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > >mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > >closest pinned map ver 252615 not available! error: (2) No such file or 
> > >directory
> > >    -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > >/build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > >OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> > >7f6e5d461700 time 2019-10-04 14:00:24.347580
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 
> > > 0)
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > >(stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > >const*)+0x152) [0x7f6e68eb064e]
> > >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
> > >const*, char const*, ...)+0) [0x7f6e68eb0829]
> > >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > >ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> > >  4: 

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-10 Thread Philippe D'Anjou
 How do I Import an osdmap in Nautilus? I saw documentation for older version 
but it seems one now can only export but not import anymore?

Am Donnerstag, 10. Oktober 2019, 08:52:03 OESZ hat Philippe D'Anjou 
 Folgendes geschrieben:  
 
  I dont think this has anything to do with CephFS, the mon crashes for same 
reason even without the mds running.I have still the old rocksdb files but they 
had a corruption issue, not sure if that's easier to fix, there havent been any 
changes on the cluster in between.
This is a disaster rebuild, we managed to get all cephfs data back online, 
apart from some metadata, and we're copying for the last weeks now but suddenly 
the mon died first of rocksdb corruption and now after the repair because of 
that osdmap issue.

Am Mittwoch, 9. Oktober 2019, 20:19:42 OESZ hat Gregory Farnum 
 Folgendes geschrieben:  
 
 On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou
 wrote:
>
> Hi,
> unfortunately it's single mon, because we had major outage on this cluster 
> and it's just being used to copy off data now. We werent able to add more 
> mons because once a second mon was added it crashed the first one (there's a 
> bug tracker ticket).
> I still have old rocksdb files before I ran a repair on it, but well it had 
> the rocksdb corruption issue (not sure why that happened, it ran fine for 
> 2months now).
>
> Any options? I mean everything still works, data is accessible, RBDs run, 
> only cephfs mount is obviously not working. For that short amount of time the 
> mon starts it reports no issues and all commands run fine.

Sounds like you actually lost some data. You'd need to manage a repair
by trying to figure out why CephFS needs that map and performing
surgery on either the monitor (to give it a fake map or fall back to
something else) or the CephFS data structures.

You might also be able to rebuild the CephFS metadata using the
disaster recovery tools to work around it, but no guarantees there
since I don't understand why CephFS is digging up OSD maps that nobody
else in the cluster cares about.
-Greg


> Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
>  wrote:
> >
> > I had to use rocksdb repair tool before because the rocksdb files got 
> > corrupted, for another reason (another bug possibly). Maybe that is why now 
> > it crash loops, although it ran fine for a day.
>
> Yeah looks like it lost a bit of data. :/
>
> > What is meant with "turn it off and rebuild from remainder"?
>
> If only one monitor is crashing, you can remove it from the quorum,
> zap all the disks, and add it back so that it recovers from its
> healthy peers.
> -Greg
>
>
> >
> > Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
> >  Folgendes geschrieben:
> >
> >
> > Hmm, that assert means the monitor tried to grab an OSDMap it had on
> > disk but it didn't work. (In particular, a "pinned" full map which we
> > kept around after trimming the others to save on disk space.)
> >
> > That *could* be a bug where we didn't have the pinned map and should
> > have (or incorrectly thought we should have), but this code was in
> > Mimic as well as Nautilus and I haven't seen similar reports. So it
> > could also mean that something bad happened to the monitor's disk or
> > Rocksdb store. Can you turn it off and rebuild from the remainder, or
> > do they all exhibit this bug?
> >
> >
> > On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
> >  wrote:
> > >
> > > Hi,
> > > our mon is acting up all of a sudden and dying in crash loop with the 
> > > following:
> > >
> > >
> > > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> > >    -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > >mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 
> > >4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 
> > >lease_expire=0.00 has v0 lc 4549352
> > >    -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > >mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > >closest pinned map ver 252615 not available! error: (2) No such file or 
> > >directory
> > >    -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > >/build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > >OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> > >7f6e5d461700 time 2019-10-04 14:00:24.347580
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 
> > > 0)
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > >(stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > >const*)+0x152) [0x7f6e68eb064e]
> > >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
> > >const*, char const*, ...)+0) [0x7f6e68eb0829]
> > >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > >ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> > >  4: 

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-09 Thread Philippe D'Anjou
 I dont think this has anything to do with CephFS, the mon crashes for same 
reason even without the mds running.I have still the old rocksdb files but they 
had a corruption issue, not sure if that's easier to fix, there havent been any 
changes on the cluster in between.
This is a disaster rebuild, we managed to get all cephfs data back online, 
apart from some metadata, and we're copying for the last weeks now but suddenly 
the mon died first of rocksdb corruption and now after the repair because of 
that osdmap issue.

Am Mittwoch, 9. Oktober 2019, 20:19:42 OESZ hat Gregory Farnum 
 Folgendes geschrieben:  
 
 On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou
 wrote:
>
> Hi,
> unfortunately it's single mon, because we had major outage on this cluster 
> and it's just being used to copy off data now. We werent able to add more 
> mons because once a second mon was added it crashed the first one (there's a 
> bug tracker ticket).
> I still have old rocksdb files before I ran a repair on it, but well it had 
> the rocksdb corruption issue (not sure why that happened, it ran fine for 
> 2months now).
>
> Any options? I mean everything still works, data is accessible, RBDs run, 
> only cephfs mount is obviously not working. For that short amount of time the 
> mon starts it reports no issues and all commands run fine.

Sounds like you actually lost some data. You'd need to manage a repair
by trying to figure out why CephFS needs that map and performing
surgery on either the monitor (to give it a fake map or fall back to
something else) or the CephFS data structures.

You might also be able to rebuild the CephFS metadata using the
disaster recovery tools to work around it, but no guarantees there
since I don't understand why CephFS is digging up OSD maps that nobody
else in the cluster cares about.
-Greg


> Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
>  wrote:
> >
> > I had to use rocksdb repair tool before because the rocksdb files got 
> > corrupted, for another reason (another bug possibly). Maybe that is why now 
> > it crash loops, although it ran fine for a day.
>
> Yeah looks like it lost a bit of data. :/
>
> > What is meant with "turn it off and rebuild from remainder"?
>
> If only one monitor is crashing, you can remove it from the quorum,
> zap all the disks, and add it back so that it recovers from its
> healthy peers.
> -Greg
>
>
> >
> > Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
> >  Folgendes geschrieben:
> >
> >
> > Hmm, that assert means the monitor tried to grab an OSDMap it had on
> > disk but it didn't work. (In particular, a "pinned" full map which we
> > kept around after trimming the others to save on disk space.)
> >
> > That *could* be a bug where we didn't have the pinned map and should
> > have (or incorrectly thought we should have), but this code was in
> > Mimic as well as Nautilus and I haven't seen similar reports. So it
> > could also mean that something bad happened to the monitor's disk or
> > Rocksdb store. Can you turn it off and rebuild from the remainder, or
> > do they all exhibit this bug?
> >
> >
> > On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
> >  wrote:
> > >
> > > Hi,
> > > our mon is acting up all of a sudden and dying in crash loop with the 
> > > following:
> > >
> > >
> > > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> > >    -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > >mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 
> > >4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 
> > >lease_expire=0.00 has v0 lc 4549352
> > >    -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > >mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > >closest pinned map ver 252615 not available! error: (2) No such file or 
> > >directory
> > >    -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > >/build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > >OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> > >7f6e5d461700 time 2019-10-04 14:00:24.347580
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 
> > > 0)
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > >(stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > >const*)+0x152) [0x7f6e68eb064e]
> > >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
> > >const*, char const*, ...)+0) [0x7f6e68eb0829]
> > >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > >ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> > >  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > >ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> > >  5: 
> > >(OSDMonitor::encode_trim_extra(std::shared_ptr,
> > > unsigned long)+0x8c) [0x717c3c]
> > >  6: (PaxosService::maybe_trim()+0x473) 

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-09 Thread Gregory Farnum
On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou
 wrote:
>
> Hi,
> unfortunately it's single mon, because we had major outage on this cluster 
> and it's just being used to copy off data now. We werent able to add more 
> mons because once a second mon was added it crashed the first one (there's a 
> bug tracker ticket).
> I still have old rocksdb files before I ran a repair on it, but well it had 
> the rocksdb corruption issue (not sure why that happened, it ran fine for 
> 2months now).
>
> Any options? I mean everything still works, data is accessible, RBDs run, 
> only cephfs mount is obviously not working. For that short amount of time the 
> mon starts it reports no issues and all commands run fine.

Sounds like you actually lost some data. You'd need to manage a repair
by trying to figure out why CephFS needs that map and performing
surgery on either the monitor (to give it a fake map or fall back to
something else) or the CephFS data structures.

You might also be able to rebuild the CephFS metadata using the
disaster recovery tools to work around it, but no guarantees there
since I don't understand why CephFS is digging up OSD maps that nobody
else in the cluster cares about.
-Greg


> Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
>  wrote:
> >
> > I had to use rocksdb repair tool before because the rocksdb files got 
> > corrupted, for another reason (another bug possibly). Maybe that is why now 
> > it crash loops, although it ran fine for a day.
>
> Yeah looks like it lost a bit of data. :/
>
> > What is meant with "turn it off and rebuild from remainder"?
>
> If only one monitor is crashing, you can remove it from the quorum,
> zap all the disks, and add it back so that it recovers from its
> healthy peers.
> -Greg
>
>
> >
> > Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
> >  Folgendes geschrieben:
> >
> >
> > Hmm, that assert means the monitor tried to grab an OSDMap it had on
> > disk but it didn't work. (In particular, a "pinned" full map which we
> > kept around after trimming the others to save on disk space.)
> >
> > That *could* be a bug where we didn't have the pinned map and should
> > have (or incorrectly thought we should have), but this code was in
> > Mimic as well as Nautilus and I haven't seen similar reports. So it
> > could also mean that something bad happened to the monitor's disk or
> > Rocksdb store. Can you turn it off and rebuild from the remainder, or
> > do they all exhibit this bug?
> >
> >
> > On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
> >  wrote:
> > >
> > > Hi,
> > > our mon is acting up all of a sudden and dying in crash loop with the 
> > > following:
> > >
> > >
> > > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> > >-3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > > mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 
> > > 4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 
> > > lease_expire=0.00 has v0 lc 4549352
> > >-2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > > mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > > closest pinned map ver 252615 not available! error: (2) No such file or 
> > > directory
> > >-1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > > OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' 
> > > thread 7f6e5d461700 time 2019-10-04 14:00:24.347580
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 
> > > 0)
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > > (stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > > const*)+0x152) [0x7f6e68eb064e]
> > >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
> > > const*, char const*, ...)+0) [0x7f6e68eb0829]
> > >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > > ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> > >  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > > ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> > >  5: 
> > > (OSDMonitor::encode_trim_extra(std::shared_ptr,
> > >  unsigned long)+0x8c) [0x717c3c]
> > >  6: (PaxosService::maybe_trim()+0x473) [0x707443]
> > >  7: (Monitor::tick()+0xa9) [0x5ecf39]
> > >  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> > >  9: (Context::complete(int)+0x9) [0x6070d9]
> > >  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> > >  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> > >  12: (()+0x76ba) [0x7f6e67cab6ba]
> > >  13: (clone()+0x6d) [0x7f6e674d441d]
> > >
> > >  0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal 
> > > (Aborted) **
> > >  in thread 7f6e5d461700 thread_name:safe_timer
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > > 

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-08 Thread Philippe D'Anjou
 Hi,unfortunately it's single mon, because we had major outage on this cluster 
and it's just being used to copy off data now. We werent able to add more mons 
because once a second mon was added it crashed the first one (there's a bug 
tracker ticket). 
I still have old rocksdb files before I ran a repair on it, but well it had the 
rocksdb corruption issue (not sure why that happened, it ran fine for 2months 
now).
Any options? I mean everything still works, data is accessible, RBDs run, only 
cephfs mount is obviously not working. For that short amount of time the mon 
starts it reports no issues and all commands run fine.

Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum 
 Folgendes geschrieben:  
 
 On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
 wrote:
>
> I had to use rocksdb repair tool before because the rocksdb files got 
> corrupted, for another reason (another bug possibly). Maybe that is why now 
> it crash loops, although it ran fine for a day.

Yeah looks like it lost a bit of data. :/

> What is meant with "turn it off and rebuild from remainder"?

If only one monitor is crashing, you can remove it from the quorum,
zap all the disks, and add it back so that it recovers from its
healthy peers.
-Greg

>
> Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> Hmm, that assert means the monitor tried to grab an OSDMap it had on
> disk but it didn't work. (In particular, a "pinned" full map which we
> kept around after trimming the others to save on disk space.)
>
> That *could* be a bug where we didn't have the pinned map and should
> have (or incorrectly thought we should have), but this code was in
> Mimic as well as Nautilus and I haven't seen similar reports. So it
> could also mean that something bad happened to the monitor's disk or
> Rocksdb store. Can you turn it off and rebuild from the remainder, or
> do they all exhibit this bug?
>
>
> On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
>  wrote:
> >
> > Hi,
> > our mon is acting up all of a sudden and dying in crash loop with the 
> > following:
> >
> >
> > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> >    -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> >mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) 
> >is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.00 has 
> >v0 lc 4549352
> >    -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> >mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> >closest pinned map ver 252615 not available! error: (2) No such file or 
> >directory
> >    -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> >/build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> >OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> >7f6e5d461700 time 2019-10-04 14:00:24.347580
> > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
> >
> >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> >(stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> >const*)+0x152) [0x7f6e68eb064e]
> >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> >char const*, ...)+0) [0x7f6e68eb0829]
> >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> >ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> >  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> >ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> >  5: 
> >(OSDMonitor::encode_trim_extra(std::shared_ptr, 
> >unsigned long)+0x8c) [0x717c3c]
> >  6: (PaxosService::maybe_trim()+0x473) [0x707443]
> >  7: (Monitor::tick()+0xa9) [0x5ecf39]
> >  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> >  9: (Context::complete(int)+0x9) [0x6070d9]
> >  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> >  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> >  12: (()+0x76ba) [0x7f6e67cab6ba]
> >  13: (clone()+0x6d) [0x7f6e674d441d]
> >
> >      0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) 
> >**
> >  in thread 7f6e5d461700 thread_name:safe_timer
> >
> >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> >(stable)
> >  1: (()+0x11390) [0x7f6e67cb5390]
> >  2: (gsignal()+0x38) [0x7f6e67402428]
> >  3: (abort()+0x16a) [0x7f6e6740402a]
> >  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> >const*)+0x1a3) [0x7f6e68eb069f]
> >  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> >char const*, ...)+0) [0x7f6e68eb0829]
> >  6: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> >ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> >  7: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> >ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> >  8: 
> >(OSDMonitor::encode_trim_extra(std::shared_ptr, 
> >unsigned long)+0x8c) [0x717c3c]
> >  9: (PaxosService::maybe_trim()+0x473) [0x707443]
> >  10: (Monitor::tick()+0xa9) 

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-07 Thread Gregory Farnum
On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
 wrote:
>
> I had to use rocksdb repair tool before because the rocksdb files got 
> corrupted, for another reason (another bug possibly). Maybe that is why now 
> it crash loops, although it ran fine for a day.

Yeah looks like it lost a bit of data. :/

> What is meant with "turn it off and rebuild from remainder"?

If only one monitor is crashing, you can remove it from the quorum,
zap all the disks, and add it back so that it recovers from its
healthy peers.
-Greg

>
> Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> Hmm, that assert means the monitor tried to grab an OSDMap it had on
> disk but it didn't work. (In particular, a "pinned" full map which we
> kept around after trimming the others to save on disk space.)
>
> That *could* be a bug where we didn't have the pinned map and should
> have (or incorrectly thought we should have), but this code was in
> Mimic as well as Nautilus and I haven't seen similar reports. So it
> could also mean that something bad happened to the monitor's disk or
> Rocksdb store. Can you turn it off and rebuild from the remainder, or
> do they all exhibit this bug?
>
>
> On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
>  wrote:
> >
> > Hi,
> > our mon is acting up all of a sudden and dying in crash loop with the 
> > following:
> >
> >
> > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> >-3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) 
> > is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.00 has 
> > v0 lc 4549352
> >-2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > closest pinned map ver 252615 not available! error: (2) No such file or 
> > directory
> >-1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> > 7f6e5d461700 time 2019-10-04 14:00:24.347580
> > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
> >
> >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > const*)+0x152) [0x7f6e68eb064e]
> >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> > char const*, ...)+0) [0x7f6e68eb0829]
> >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> >  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> >  5: 
> > (OSDMonitor::encode_trim_extra(std::shared_ptr,
> >  unsigned long)+0x8c) [0x717c3c]
> >  6: (PaxosService::maybe_trim()+0x473) [0x707443]
> >  7: (Monitor::tick()+0xa9) [0x5ecf39]
> >  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> >  9: (Context::complete(int)+0x9) [0x6070d9]
> >  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> >  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> >  12: (()+0x76ba) [0x7f6e67cab6ba]
> >  13: (clone()+0x6d) [0x7f6e674d441d]
> >
> >  0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) 
> > **
> >  in thread 7f6e5d461700 thread_name:safe_timer
> >
> >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > (stable)
> >  1: (()+0x11390) [0x7f6e67cb5390]
> >  2: (gsignal()+0x38) [0x7f6e67402428]
> >  3: (abort()+0x16a) [0x7f6e6740402a]
> >  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > const*)+0x1a3) [0x7f6e68eb069f]
> >  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> > char const*, ...)+0) [0x7f6e68eb0829]
> >  6: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> >  7: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> >  8: 
> > (OSDMonitor::encode_trim_extra(std::shared_ptr,
> >  unsigned long)+0x8c) [0x717c3c]
> >  9: (PaxosService::maybe_trim()+0x473) [0x707443]
> >  10: (Monitor::tick()+0xa9) [0x5ecf39]
> >  11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> >  12: (Context::complete(int)+0x9) [0x6070d9]
> >  13: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> >  14: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> >  15: (()+0x76ba) [0x7f6e67cab6ba]
> >  16: (clone()+0x6d) [0x7f6e674d441d]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
> > to interpret this.
> >
> >
> > This was running fine for 2months now, it's a crashed cluster that is in 
> > recovery.
> >
> > Any suggestions?
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-06 Thread Philippe D'Anjou
 I had to use rocksdb repair tool before because the rocksdb files got 
corrupted, for another reason (another bug possibly). Maybe that is why now it 
crash loops, although it ran fine for a day.What is meant with "turn it off and 
rebuild from remainder"?

Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
 Folgendes geschrieben:  
 
 Hmm, that assert means the monitor tried to grab an OSDMap it had on
disk but it didn't work. (In particular, a "pinned" full map which we
kept around after trimming the others to save on disk space.)

That *could* be a bug where we didn't have the pinned map and should
have (or incorrectly thought we should have), but this code was in
Mimic as well as Nautilus and I haven't seen similar reports. So it
could also mean that something bad happened to the monitor's disk or
Rocksdb store. Can you turn it off and rebuild from the remainder, or
do they all exhibit this bug?


On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
 wrote:
>
> Hi,
> our mon is acting up all of a sudden and dying in crash loop with the 
> following:
>
>
> 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
>    -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
>mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) 
>is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.00 has v0 
>lc 4549352
>    -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
>mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
>closest pinned map ver 252615 not available! error: (2) No such file or 
>directory
>    -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
>/build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
>OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
>7f6e5d461700 time 2019-10-04 14:00:24.347580
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
>(stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>const*)+0x152) [0x7f6e68eb064e]
>  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
>char const*, ...)+0) [0x7f6e68eb0829]
>  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
>ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
>  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
>ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
>  5: 
>(OSDMonitor::encode_trim_extra(std::shared_ptr, 
>unsigned long)+0x8c) [0x717c3c]
>  6: (PaxosService::maybe_trim()+0x473) [0x707443]
>  7: (Monitor::tick()+0xa9) [0x5ecf39]
>  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
>  9: (Context::complete(int)+0x9) [0x6070d9]
>  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
>  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
>  12: (()+0x76ba) [0x7f6e67cab6ba]
>  13: (clone()+0x6d) [0x7f6e674d441d]
>
>      0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) **
>  in thread 7f6e5d461700 thread_name:safe_timer
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
>(stable)
>  1: (()+0x11390) [0x7f6e67cb5390]
>  2: (gsignal()+0x38) [0x7f6e67402428]
>  3: (abort()+0x16a) [0x7f6e6740402a]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>const*)+0x1a3) [0x7f6e68eb069f]
>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
>char const*, ...)+0) [0x7f6e68eb0829]
>  6: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
>ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
>  7: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
>ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
>  8: 
>(OSDMonitor::encode_trim_extra(std::shared_ptr, 
>unsigned long)+0x8c) [0x717c3c]
>  9: (PaxosService::maybe_trim()+0x473) [0x707443]
>  10: (Monitor::tick()+0xa9) [0x5ecf39]
>  11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
>  12: (Context::complete(int)+0x9) [0x6070d9]
>  13: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
>  14: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
>  15: (()+0x76ba) [0x7f6e67cab6ba]
>  16: (clone()+0x6d) [0x7f6e674d441d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>interpret this.
>
>
> This was running fine for 2months now, it's a crashed cluster that is in 
> recovery.
>
> Any suggestions?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-04 Thread Gregory Farnum
Hmm, that assert means the monitor tried to grab an OSDMap it had on
disk but it didn't work. (In particular, a "pinned" full map which we
kept around after trimming the others to save on disk space.)

That *could* be a bug where we didn't have the pinned map and should
have (or incorrectly thought we should have), but this code was in
Mimic as well as Nautilus and I haven't seen similar reports. So it
could also mean that something bad happened to the monitor's disk or
Rocksdb store. Can you turn it off and rebuild from the remainder, or
do they all exhibit this bug?


On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
 wrote:
>
> Hi,
> our mon is acting up all of a sudden and dying in crash loop with the 
> following:
>
>
> 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) 
> is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.00 has v0 
> lc 4549352
> -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> closest pinned map ver 252615 not available! error: (2) No such file or 
> directory
> -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> 7f6e5d461700 time 2019-10-04 14:00:24.347580
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x152) [0x7f6e68eb064e]
>  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> char const*, ...)+0) [0x7f6e68eb0829]
>  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
>  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
>  5: 
> (OSDMonitor::encode_trim_extra(std::shared_ptr, 
> unsigned long)+0x8c) [0x717c3c]
>  6: (PaxosService::maybe_trim()+0x473) [0x707443]
>  7: (Monitor::tick()+0xa9) [0x5ecf39]
>  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
>  9: (Context::complete(int)+0x9) [0x6070d9]
>  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
>  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
>  12: (()+0x76ba) [0x7f6e67cab6ba]
>  13: (clone()+0x6d) [0x7f6e674d441d]
>
>  0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) **
>  in thread 7f6e5d461700 thread_name:safe_timer
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
>  1: (()+0x11390) [0x7f6e67cb5390]
>  2: (gsignal()+0x38) [0x7f6e67402428]
>  3: (abort()+0x16a) [0x7f6e6740402a]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x1a3) [0x7f6e68eb069f]
>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> char const*, ...)+0) [0x7f6e68eb0829]
>  6: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
>  7: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
>  8: 
> (OSDMonitor::encode_trim_extra(std::shared_ptr, 
> unsigned long)+0x8c) [0x717c3c]
>  9: (PaxosService::maybe_trim()+0x473) [0x707443]
>  10: (Monitor::tick()+0xa9) [0x5ecf39]
>  11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
>  12: (Context::complete(int)+0x9) [0x6070d9]
>  13: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
>  14: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
>  15: (()+0x76ba) [0x7f6e67cab6ba]
>  16: (clone()+0x6d) [0x7f6e674d441d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
>
> This was running fine for 2months now, it's a crashed cluster that is in 
> recovery.
>
> Any suggestions?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com