Re: [ceph-users] another assertion failure in monitor

2014-03-14 Thread Pawel Veselov
  This whole thing started with migrating from 0.56.7 to 0.72.2. First, we
 started seeing failed assertions of (version == pg_map.version) in
 PGMonitor.cc:273, but on one monitor (d) only. I attempted to resync the
 failing monitor (d) with --force-sync from (c). (d) started to work, but
 (c) started to fail with (version==pg_map.version) assertion. So, I
 tried re-syncing (c) from (d) with --force-resync. That's when (c)
 started to fail with this particular (ret==0) assertion. I don't really
 think that resyncing actually worked any at that point.

 Based on this, my guess is that you managed to bork the mon stores of both
 'c' and 'd'.  See, when you force a sync you're basically telling the
 monitor to delete its store's contents and sync from somebody else.  If 'c'
 had a broken store after the conversion, that would have been propagated to
 'd'.  Once you forced the sync of 'c', then the problem would have been
 propagated from 'd' to 'c'.


Well, nothing suggested that (c) was having any problems, besides being
lonely. That's why I asked (d) to re-sync from it (expecting exactly that
it will rebuild the monitor store on (d), which was failing). Apparently,
(c) wasn't any good either, but it wasn't obvious.





 I didn't find a way to fix this quickly enough, so I restored the mon
 directories from back up, and started again. The (version ==
 pg_map.version) came back, but my back-up was taken before I was trying
 to do force-resync, but not before the migration started (that was
 stupid of me to not have backed up before migration). (That's the point
 when I tried all kindsa crazy stuff for a while).

 After some poking around, what I ended up doing is plain removing
 'store.db' directory from the monitor fs, and starting the monitors.
 That just re-initiated the migration, and this time it was done in the
 absence of client requests, and one monitor at a time.


 And in a case like this, I would think this was a smart choice, allowing
 the monitors to reconvert the store from the old plain, file-based format
 to the new store.db format.  Given it worked, my guess is that the source
 of all your issues was an improperly converted monitor store -- but, once
 again, without the logs we can't ever be sure. :(


Well, at this point I simply glad it worked. The situation was OMG, the
deployment is upside down, things get lost easy :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] another assertion failure in monitor

2014-03-13 Thread Joao Eduardo Luis

On 03/11/2014 05:59 PM, Pawel Veselov wrote:

On Tue, Mar 11, 2014 at 9:15 AM, Joao Eduardo Luis
joao.l...@inktank.com mailto:joao.l...@inktank.com wrote:

On 03/10/2014 10:30 PM, Pawel Veselov wrote:


Now, I'm getting this. May be any idea what can be done to
straighten
this up?


This is weird.  Can you please share the steps taken until this was
triggered, as well as the rest of the log?


At this point, no, sorry.

This whole thing started with migrating from 0.56.7 to 0.72.2. First, we
started seeing failed assertions of (version == pg_map.version) in
PGMonitor.cc:273, but on one monitor (d) only. I attempted to resync the
failing monitor (d) with --force-sync from (c). (d) started to work, but
(c) started to fail with (version==pg_map.version) assertion. So, I
tried re-syncing (c) from (d) with --force-resync. That's when (c)
started to fail with this particular (ret==0) assertion. I don't really
think that resyncing actually worked any at that point.


Considering you were upgrading from bobtail, any issues after the 
upgrade you may have found may have had something to do with improper 
store conversion -- usually due to somehow (explicitly or inadvertently) 
killing the monitor during conversion.  Or it may have not, but we will 
never know without logs from back then.


Based on this, my guess is that you managed to bork the mon stores of 
both 'c' and 'd'.  See, when you force a sync you're basically telling 
the monitor to delete its store's contents and sync from somebody else. 
 If 'c' had a broken store after the conversion, that would have been 
propagated to 'd'.  Once you forced the sync of 'c', then the problem 
would have been propagated from 'd' to 'c'.




I didn't find a way to fix this quickly enough, so I restored the mon
directories from back up, and started again. The (version ==
pg_map.version) came back, but my back-up was taken before I was trying
to do force-resync, but not before the migration started (that was
stupid of me to not have backed up before migration). (That's the point
when I tried all kindsa crazy stuff for a while).

After some poking around, what I ended up doing is plain removing
'store.db' directory from the monitor fs, and starting the monitors.
That just re-initiated the migration, and this time it was done in the
absence of client requests, and one monitor at a time.


And in a case like this, I would think this was a smart choice, allowing 
the monitors to reconvert the store from the old plain, file-based 
format to the new store.db format.  Given it worked, my guess is that 
the source of all your issues was an improperly converted monitor store 
-- but, once again, without the logs we can't ever be sure. :(


  -Joao






   0 2014-03-10 22:26:23.757166 7fc0397e5700 -1
mon/AuthMonitor.cc:
In function 'virtual void AuthMonitor::create_initial()' thread
7fc0397e5700 time 2014-03-10 22:26:23.755442
mon/AuthMonitor.cc: 101: FAILED assert(ret == 0)

   ceph version 0.72.2
(__a913ded2ff138aefb8cb84d347d721__64099cfd60)
   1: (AuthMonitor::create_initial()__+0x4d8) [0x637bb8]
   2: (PaxosService::_active()+__0x51b) [0x594fcb]
   3: (Context::complete(int)+0x9) [0x565499]
   4: (finish_contexts(CephContext*, std::listContext*,
std::allocatorContext* , int)+0x95) [0x5698b5]
   5: (Paxos::handle_accept(__MMonPaxos*)+0x885) [0x589595]
   6: (Paxos::dispatch(__PaxosServiceMessage*)+0x28b) [0x58d66b]
   7: (Monitor::dispatch(MonSession*__, Message*, bool)+0x4f0)
[0x563620]
   8: (Monitor::_ms_dispatch(__Message*)+0x1fb) [0x5639fb]
   9: (Monitor::ms_dispatch(Message*__)+0x32) [0x57f212]




--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] another assertion failure in monitor

2014-03-11 Thread Joao Eduardo Luis

On 03/10/2014 10:30 PM, Pawel Veselov wrote:


Now, I'm getting this. May be any idea what can be done to straighten
this up?


This is weird.  Can you please share the steps taken until this was 
triggered, as well as the rest of the log?


  -Joao



-12 2014-03-10 22:26:23.748783 7fc0397e5700  0 log [INF] : mdsmap
e1: 0/0/1 up
-11 2014-03-10 22:26:23.748793 7fc0397e5700 10 send_log to self
-10 2014-03-10 22:26:23.748795 7fc0397e5700 10  log_queue is 4
last_log 4 sent 2 num 4 unsent 2 sending 2
 -9 2014-03-10 22:26:23.748800 7fc0397e5700 10  will send
2014-03-10 22:26:23.715607 mon.0 10.16.20.11:6789/0
http://10.16.20.11:6789/0 2 : [INF] mon.c@0 won leader election with
quorum 0,1
 -8 2014-03-10 22:26:23.748809 7fc0397e5700 10  will send
2014-03-10 22:26:23.748677 mon.0 10.16.20.11:6789/0
http://10.16.20.11:6789/0 3 : [INF] pgmap v1: 0 pgs: ; 0 bytes data, 0
kB used, 0 kB / 0 kB avail
 -7 2014-03-10 22:26:23.748817 7fc0397e5700  1 --
10.16.20.11:6789/0 http://10.16.20.11:6789/0 -- mon.0
10.16.20.11:6789/0 http://10.16.20.11:6789/0 -- log(2 entries) v1 --
?+0 0x25f21c0
 -6 2014-03-10 22:26:23.749128 7fc0397e5700  5
mon.c@0(leader).paxos(paxos active c 1..2) queue_proposal bl 1200 bytes;
ctx = 0x2568240
 -5 2014-03-10 22:26:23.754248 7fc0397e5700  1 --
10.16.20.11:6789/0 http://10.16.20.11:6789/0 -- mon.1
10.16.43.12:6789/0 http://10.16.43.12:6789/0 -- paxos(begin lc 2 fc 0
pn 100 opn 0) v3 -- ?+0 0x26b2300
 -4 2014-03-10 22:26:23.754743 7fc0397e5700  5
mon.c@0(leader).paxos(paxos updating c 1..2) queue_proposal bl 449
bytes; ctx = 0x2568230
 -3 2014-03-10 22:26:23.754761 7fc0397e5700  5
mon.c@0(leader).paxos(paxos updating c 1..2) propose_new_value not
active; proposal queued
 -2 2014-03-10 22:26:23.754838 7fc0397e5700  5
mon.c@0(leader).paxos(paxos updating c 1..2) queue_proposal bl 471
bytes; ctx = 0x2568290
 -1 2014-03-10 22:26:23.754853 7fc0397e5700  5
mon.c@0(leader).paxos(paxos updating c 1..2) propose_new_value not
active; proposal queued
  0 2014-03-10 22:26:23.757166 7fc0397e5700 -1 mon/AuthMonitor.cc:
In function 'virtual void AuthMonitor::create_initial()' thread
7fc0397e5700 time 2014-03-10 22:26:23.755442
mon/AuthMonitor.cc: 101: FAILED assert(ret == 0)

  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
  1: (AuthMonitor::create_initial()+0x4d8) [0x637bb8]
  2: (PaxosService::_active()+0x51b) [0x594fcb]
  3: (Context::complete(int)+0x9) [0x565499]
  4: (finish_contexts(CephContext*, std::listContext*,
std::allocatorContext* , int)+0x95) [0x5698b5]
  5: (Paxos::handle_accept(MMonPaxos*)+0x885) [0x589595]
  6: (Paxos::dispatch(PaxosServiceMessage*)+0x28b) [0x58d66b]
  7: (Monitor::dispatch(MonSession*, Message*, bool)+0x4f0) [0x563620]
  8: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
  9: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212]
  10: (DispatchQueue::entry()+0x582) [0x7de6c2]
  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d994d]
  12: (()+0x7ddb) [0x7fc03e736ddb]
  13: (clone()+0x6d) [0x7fc03d46ca1d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] another assertion failure in monitor

2014-03-11 Thread Pawel Veselov
On Tue, Mar 11, 2014 at 9:15 AM, Joao Eduardo Luis joao.l...@inktank.comwrote:

 On 03/10/2014 10:30 PM, Pawel Veselov wrote:


 Now, I'm getting this. May be any idea what can be done to straighten
 this up?


 This is weird.  Can you please share the steps taken until this was
 triggered, as well as the rest of the log?


At this point, no, sorry.

This whole thing started with migrating from 0.56.7 to 0.72.2. First, we
started seeing failed assertions of (version == pg_map.version) in
PGMonitor.cc:273, but on one monitor (d) only. I attempted to resync the
failing monitor (d) with --force-sync from (c). (d) started to work, but
(c) started to fail with (version==pg_map.version) assertion. So, I tried
re-syncing (c) from (d) with --force-resync. That's when (c) started to
fail with this particular (ret==0) assertion. I don't really think that
resyncing actually worked any at that point.

I didn't find a way to fix this quickly enough, so I restored the mon
directories from back up, and started again. The (version ==
pg_map.version) came back, but my back-up was taken before I was trying to
do force-resync, but not before the migration started (that was stupid of
me to not have backed up before migration). (That's the point when I tried
all kindsa crazy stuff for a while).

After some poking around, what I ended up doing is plain removing
'store.db' directory from the monitor fs, and starting the monitors. That
just re-initiated the migration, and this time it was done in the absence
of client requests, and one monitor at a time.



   0 2014-03-10 22:26:23.757166 7fc0397e5700 -1 mon/AuthMonitor.cc:
 In function 'virtual void AuthMonitor::create_initial()' thread
 7fc0397e5700 time 2014-03-10 22:26:23.755442
 mon/AuthMonitor.cc: 101: FAILED assert(ret == 0)

   ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
   1: (AuthMonitor::create_initial()+0x4d8) [0x637bb8]
   2: (PaxosService::_active()+0x51b) [0x594fcb]
   3: (Context::complete(int)+0x9) [0x565499]
   4: (finish_contexts(CephContext*, std::listContext*,
 std::allocatorContext* , int)+0x95) [0x5698b5]
   5: (Paxos::handle_accept(MMonPaxos*)+0x885) [0x589595]
   6: (Paxos::dispatch(PaxosServiceMessage*)+0x28b) [0x58d66b]
   7: (Monitor::dispatch(MonSession*, Message*, bool)+0x4f0) [0x563620]
   8: (Monitor::_ms_dispatch(Message*)+0x1fb) [0x5639fb]
   9: (Monitor::ms_dispatch(Message*)+0x32) [0x57f212]


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com