> This whole thing started with migrating from 0.56.7 to 0.72.2. First, we >> started seeing failed assertions of (version == pg_map.version) in >> PGMonitor.cc:273, but on one monitor (d) only. I attempted to resync the >> failing monitor (d) with --force-sync from (c). (d) started to work, but >> (c) started to fail with (version==pg_map.version) assertion. So, I >> tried re-syncing (c) from (d) with --force-resync. That's when (c) >> started to fail with this particular (ret==0) assertion. I don't really >> think that resyncing actually worked any at that point. >> > Based on this, my guess is that you managed to bork the mon stores of both > 'c' and 'd'. See, when you force a sync you're basically telling the > monitor to delete its store's contents and sync from somebody else. If 'c' > had a broken store after the conversion, that would have been propagated to > 'd'. Once you forced the sync of 'c', then the problem would have been > propagated from 'd' to 'c'.
Well, nothing suggested that (c) was having any problems, besides being lonely. That's why I asked (d) to re-sync from it (expecting exactly that it will rebuild the monitor store on (d), which was failing). Apparently, (c) wasn't any good either, but it wasn't obvious. > > > >> I didn't find a way to fix this quickly enough, so I restored the mon >> directories from back up, and started again. The (version == >> pg_map.version) came back, but my back-up was taken before I was trying >> to do force-resync, but not before the migration started (that was >> stupid of me to not have backed up before migration). (That's the point >> when I tried all kindsa crazy stuff for a while). >> >> After some poking around, what I ended up doing is plain removing >> 'store.db' directory from the monitor fs, and starting the monitors. >> That just re-initiated the migration, and this time it was done in the >> absence of client requests, and one monitor at a time. >> > > And in a case like this, I would think this was a smart choice, allowing > the monitors to reconvert the store from the old plain, file-based format > to the new store.db format. Given it worked, my guess is that the source > of all your issues was an improperly converted monitor store -- but, once > again, without the logs we can't ever be sure. :( > Well, at this point I simply glad it worked. The situation was "OMG, the deployment is upside down", things get lost easy :)
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com