At UMich we ran into the same issue a while ago, and I wound up back porting the patch to 1.4. Since then, haven't had the negative transaction ID issue crop up again.
-- Mike Garrison On Thu, Feb 13, 2014 at 10:29 AM, Arne Wiebalck <arne.wieba...@cern.ch> wrote: > Just to confirm: I restarted all VL servers in our cell, the transaction IDs > were reset and > so far I wasn't able to reproduce the problem. So it seems that it was > indeed the > negative transaction ID problem in 1.4 VL servers mentioned earlier in this > thread. > > Cheers, > Arne > > > On Feb 11, 2014, at 6:25 PM, Arne Wiebalck <arne.wieba...@cern.ch> wrote: > > Thanks Andrew and Derrick! > > We've seen the "major synchronisation error" as well when trying to provoke > that problem. > This and the fact that we have the very same quorum issue about one year ago > when > restarting the VLDB servers made the problem go away for some time seem to > indicate > it's indeed the issue you mention. This was when we first added 1.6 servers > to our cell, btw. > Apparently, we're pretty lucky ;) > > I'll restart our VLDB servers ... > > Thanks! > Arne > > > On Feb 11, 2014, at 5:48 PM, D Brashear <sha...@gmail.com> > wrote: > > The 1.4/1.6 issue is surely a red herring. You hit the nail when you > mentioned negative transaction IDs. There was a bugfix early in the 1.6 > series which handled that; you probably want to just restart all your > dbservers so you can start counting up to rollover again, until you get to > the point of updating them. > > > On Tue, Feb 11, 2014 at 4:50 AM, Arne Wiebalck <arne.wieba...@cern.ch> > wrote: >> >> Hi, >> >> We've recently added some 1.6.6 servers into our cell which is mainly on >> 1.4.15 (i.e. most of the file servers >> and the DB servers). We now encounter quorum problems with our VLDB >> servers. >> >> The primary symptom is that releases fail with "u: no quorum elected". >> >> VLLog on the sync site shows at that moment: >> --> >> Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.51 is back up: will be >> contacted through 137.138.246.51 >> Tue Feb 11 09:51:47 2014 ubik:server 137.138.246.50 is back up: will be >> contacted through 137.138.246.50 >> <-- >> where 137.138.246.50 and 137.138.246.51 are the non-sync sites. >> >> We can relatively easy trigger this problem by moving volumes between 1.4 >> and 1.6 based servers (1.4/1.4 >> and 1.6/1.6 transfers seems OK): the move gets stuck and udebug shows >> >> --> >> Host's addresses are: 137.138.128.148 >> Host's 137.138.128.148 time is Tue Feb 11 09:38:02 2014 >> Local time is Tue Feb 11 09:38:02 2014 (time differential 0 secs) >> Last yes vote for 137.138.128.148 was 5 secs ago (sync site); >> Last vote started 5 secs ago (at Tue Feb 11 09:37:57 2014) >> Local db version is 1392106724.154 >> I am sync site until 54 secs from now (at Tue Feb 11 09:38:56 2014) (3 >> servers) >> Recovery state f >> I am currently managing write trans 1392106724.-1892301558 >> Sync site's db version is 1392106724.154 >> 0 locked pages, 0 of them for write >> There are write locks held >> Last time a new db version was labelled was: >> 1158 secs ago (at Tue Feb 11 09:18:44 2014) >> >> Server (137.138.246.51): (db 1392106724.153) >> last vote rcvd 20 secs ago (at Tue Feb 11 09:37:42 2014), >> last beacon sent 20 secs ago (at Tue Feb 11 09:37:42 2014), last vote >> was yes >> dbcurrent=0, up=0 beaconSince=0 >> >> Server (137.138.246.50): (db 1392106724.154) >> last vote rcvd 6 secs ago (at Tue Feb 11 09:37:56 2014), >> last beacon sent 5 secs ago (at Tue Feb 11 09:37:57 2014), last vote >> was yes >> dbcurrent=1, up=1 beaconSince=1 >> <-- >> >> Note that the sync site has gone to Recovery state f and that the time at >> which the last vote was received >> on the other two servers has quite a time gap which gets larger with time. >> Is the negative trans ID ok? >> >> At some point the sync site loses its sync site state: >> >> --> >> Host's addresses are: 137.138.128.148 >> Host's 137.138.128.148 time is Tue Feb 11 09:41:01 2014 >> Local time is Tue Feb 11 09:41:02 2014 (time differential 1 secs) >> Last yes vote for 137.138.128.148 was 4 secs ago (sync site); >> Last vote started 4 secs ago (at Tue Feb 11 09:40:58 2014) >> Local db version is 1392106724.206 >> I am not sync site >> Lowest host 137.138.128.148 was set 4 secs ago >> Sync host 137.138.128.148 was set 4 secs ago >> I am currently managing write trans 1392106724.-1892283756 >> Sync site's db version is 1392106724.206 >> 0 locked pages, 0 of them for write >> There are write locks held >> Last time a new db version was labelled was: >> 1337 secs ago (at Tue Feb 11 09:18:45 2014) >> <-- >> >> so there is no sync site any longer and the vos command gets a no quorum >> error. >> >> As this also happens when we do not move volumes around (like at 3am), but >> other operations such >> as the backup touch the volumes, I would suspect that VLDB operations in >> general can trigger this. >> >> Is this a known issue? >> >> I had understood that it should be OK to run 1.4 and 1.6 file servers in >> parallel and that the DB servers >> could be updated after the file servers, but maybe that is not correct? >> >> Thanks! >> Arne >> >> >> -- >> Arne Wiebalck >> CERN IT >> > > > > -- > D > > > _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info