It might actually be worth valgrinding. On Fri, Apr 16, 2010 at 12:30 PM, Derrick Brashear <sha...@gmail.com> wrote: > On Fri, Apr 16, 2010 at 12:19 PM, Marcus Watts <m...@umich.edu> wrote: >> Derrick Brashear <sha...@gmail.com> sent: >> >>> Date: Thu, 15 Apr 2010 23:02:33 EDT >>> To: Russ Allbery <r...@stanford.edu> >>> cc: openafs-i...@openafs.org >>> From: Derrick Brashear <sha...@gmail.com> >>> Subject: Re: [OpenAFS] Re: Ubik problem >>> >>> On Thu, Apr 15, 2010 at 9:13 PM, Russ Allbery <r...@stanford.edu> wrote: >>> > Andrew Deason <adea...@sinenomine.net> writes: >>> >> Atro Tossavainen <atro.tossavainen+open...@helsinki.fi> wrote: >>> > >>> >>> Derrick, >>> >>> >>> >>> > I'd suggest just using the IBM binary for the kaserver (and only the >>> >>> > kaserver) in your OpenAFS installation >>> >>> >>> >>> That's an interesting thought, but unfortunately it's nowhere near >>> >>> an option. =A0sunx86_ is quite simply not a supported platform for >>> >>> IBM AFS at all, even at 3.6 Patch 19 (August 2009). >>> > >>> >> Older OpenAFS releases could be another option, but I don't know how >>> >> useful of an answer that is. I'm not sure what could have caused that, >>> >> so I don't have a particular range in mind; maybe just earlier 1.4... >>> >> 1.4.9? 1.4.2? >>> > >>> > We were successfully running a 1.2.x version of kaserver on SPARC Solaris= >>> , >>> > and upgrading to 1.4.2 on Linux failed (albeit with different symptoms; i= >>> t >>> > would just stop successfully giving out tickets for a while and then come >>> > back, regularly), so we stuck with 1.2.x on SPARC until we turned it off >>> > entirely. >>> >>> I'm pretty sure it "broke" between 1.2.11 and 1.4.1. >>> >>> --=20 >>> Derrick >> >> Gah. You made me drag out my kaserver notes! Worse! You made me >> *run* the thing! Bad! Bad! >> >> "broke" is a pretty vague description, so... >> >> From the previous descriptions, it sounds like there might be ubik sync >> issues. > > That's not what I was referring to. I think it's between ubik database > reads and the clients. > >> That could be caused either by problems in ubik, or unrelated problems >> that cause server crashes. The reports do not include notes on any resulting >> core dumps, and the ubik problem reports clearly indicate another serious >> problem with server address determination. >> >> I experimented with building a version of 1.2.11, running it and using some >> of the diagnostic tools, followed by trying to run the resulting database >> with >> 1.4.12. I certainly didn't thoroughly explore things. I now have an >> interesting >> list of "problems". >> >> /1/ ubik_hdr.size got changed to be a short, not a long. ntohl is wrong. >> This >> is in ubik proper as well as kaserver diagnostics. Fortunately, this >> doesn't seem to break too much. >> /2/ udebug address output byte swap issues. Previously mentioned as fixed. >> /3/ kadb_check complains about a lot of stuff, and the output does not >> make much sense. A lot of this looks like endian issues, but >> also I think this tool probably started as a temporary hack and >> never well cleaned up. The output was probably never really >> 'clean" in the first place. >> /4/ I never got kaserver to core dump (granted, I'm not pushing it real >> hard.) >> >> I think at least in some basic way, the kaserver in 1.4.12 still "works". >> So I am still curious as to what Derrick meant by "broke". >> >> possible generic action items, >> /1/ fix uhdr.size usage issues. (ntohs/htons not ntohl/htonl). >> /2/ fix kadb_check to produce correct output. Should match on little >> and big-endian machines. >> /3/ fix kadb_check to produce "better" output? >> >> For Atro Tossavainen, I think my recommendations are: >> /1/ can he only run one source version of kaserver on all db hosts (not a >> mixed ibm/openafs env), >> /2/ can he resolve the server setup such that when udebug is >> run, it only reports "correct" IP addresses? (Ideally only >> the primary, but the other interfaces should be ok so long >> as packets sent through them get to the same place.) >> /3/ can he resolve time so that he never sees "last beacon sent -3 secs >> ago"?, >> ubik does care, even more than kerberos, about time. >> /4/ can he resolve his keyfile reference such that he never gets >> "unknown key version number"? >> (My suspicion, he's got path issues between differently built >> binaries.) > > no, because i suspect 4 is the "real issue" > > > -- > Derrick >
-- Derrick _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info