>  This whole thing started with migrating from 0.56.7 to 0.72.2. First, we
>> started seeing failed assertions of (version == pg_map.version) in
>> PGMonitor.cc:273, but on one monitor (d) only. I attempted to resync the
>> failing monitor (d) with --force-sync from (c). (d) started to work, but
>> (c) started to fail with (version==pg_map.version) assertion. So, I
>> tried re-syncing (c) from (d) with --force-resync. That's when (c)
>> started to fail with this particular (ret==0) assertion. I don't really
>> think that resyncing actually worked any at that point.
>>
> Based on this, my guess is that you managed to bork the mon stores of both
> 'c' and 'd'.  See, when you force a sync you're basically telling the
> monitor to delete its store's contents and sync from somebody else.  If 'c'
> had a broken store after the conversion, that would have been propagated to
> 'd'.  Once you forced the sync of 'c', then the problem would have been
> propagated from 'd' to 'c'.


Well, nothing suggested that (c) was having any problems, besides being
lonely. That's why I asked (d) to re-sync from it (expecting exactly that
it will rebuild the monitor store on (d), which was failing). Apparently,
(c) wasn't any good either, but it wasn't obvious.


>
>
>
>> I didn't find a way to fix this quickly enough, so I restored the mon
>> directories from back up, and started again. The (version ==
>> pg_map.version) came back, but my back-up was taken before I was trying
>> to do force-resync, but not before the migration started (that was
>> stupid of me to not have backed up before migration). (That's the point
>> when I tried all kindsa crazy stuff for a while).
>>
>> After some poking around, what I ended up doing is plain removing
>> 'store.db' directory from the monitor fs, and starting the monitors.
>> That just re-initiated the migration, and this time it was done in the
>> absence of client requests, and one monitor at a time.
>>
>
> And in a case like this, I would think this was a smart choice, allowing
> the monitors to reconvert the store from the old plain, file-based format
> to the new store.db format.  Given it worked, my guess is that the source
> of all your issues was an improperly converted monitor store -- but, once
> again, without the logs we can't ever be sure. :(
>

Well, at this point I simply glad it worked. The situation was "OMG, the
deployment is upside down", things get lost easy :)
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to