Tom: The 1991 Zayas specifications are lacking in many regards. For starters, the Vxxx error codes are only defined for the Vol/VL RPCs and not for the FS/CM RPCs. The use of the Vxxxx error codes in the FS/CM RPCs is left undefined and yet those errors are reported to cache managers by file servers.
I think it was 2004 or perhaps early 2005 when a large user was concerned about VLDB scalability due to the introduction of tens of thousands of Windows clients into the environment. Each time a VNOVOL, VMOVED, VOFFLINE, VSALVAGE or VNOSERVICE error was received the Windows client would query the VLDB and retry the request after 2 seconds. If a volume couldn't be served from a file server this process would be repeated. This is exacerbated by the behavior of the Explorer Shell which reads the contents of directories it displays searching for various metadata. As a result the VLDB servers were struggling under the load. It wasn't going to be possible to make the VLDB servers process more requests so it was important to reduce the number of requests that were sent. The discussions that took place came to the conclusion that the description of VNOVOL was ambiguous and its meaning based upon usage should be that the volume is not present. With that interpretation a client could restrict the number of VLDB lookups for a volume. I do not remember if these discussions took place at a hackathon, a workshop, or on Zephyr. Such use of the error codes didn't make a difference to deployed clients since they acted on all error codes in an identical fashion nor did it result in a protocol change given existing use in the file server. Perhaps others can find a reference in Zephyr logs. I no longer have access to them. Jeffrey Altman On 5/4/2012 5:40 PM, Tom Keiser wrote: > Hi, > > As some of you already know, sites have recently run into troubles > caused by interpretation of various volume package special error > codes. After looking at the Ed Zayas spec, and how the unix and > Windows clients interpret the various codes in master and OpenAFS 1.0, > I wanted to start a discussion about the slight redefinition of > protocol error handling semantics over the past decade. According to > the Zayas VVL spec, the relevant error codes have the following > meanings: > > - VSALVAGE: volume needs to be salvaged > > - VNOVOL: the given volume is either not attached, doesn't exist, or > is not online > > - VNOSERVICE: the volume is currently not in service > > - VOFFLINE: the specified volume is offline, for the reason given in > the offline message field (a subield within the volume field in struct > volser_trans) > > - VBUSY: the named volume is temporarily unavailable, and the client > is encouraged to retry the operation shortly > > > By my reading of the above specification, VOFFLINE is strictly for use > when offlineMessage is set in the VolumeDiskData file, whereas VNOVOL > was intended to be the catch-all "it's not online" error code. > Indeed, OpenAFS 1.0 volume.c more-or-less follows the above rubric. > When working on DAFS many years ago, I tried to follow these > definitions (although, admittedly, I got it wrong in a number of > cases). > > Now, I must concede that the definitions in the Zayas spec are not > terribly useful: they do not differentiate between "I don't have it", > and "I won't give it to you", which is typically the fundamental > question the cm is trying to answer. In this strict sense, I much > prefer the way recent versions of the Windows CM utilize > VNOVOL/VOFFLINE as a means of satisfying the existence question. > However, as much as I like the cleanliness this approach provides, I > am concerned about the seeming divergence between our implementations > and our specification... > > It's certainly possible that I'm not privy to protocol discussions > where it was decided that redefining VNOVOL, VNOSERVICE[*], and > VOFFLINE was ok (given that legacy CMs seem to make little distinction > between VOFFLINE, VNOVOL, VSALVAGE, VNOSERVICE, etc.). If that is the > case, could someone provide more information from these discussions? > > Obviously, the current mismatch in behavior between DAFS and the > Windows CM needs to be resolved posthaste. That we already have a > wide deployment base of nodes in disagreement about the denotation of > certain critical error codes is troubling--to the point that > pragmatism may preclude us from strict adherence to the extant AFS-3 > specification. > > This leaves me with two questions: > > 1) is there something that OpenAFS can do to resolve this issue > without requiring any standards involvement? > > 2) if not, what is our stop-gap until we can fix this at the afs3-stds level? > > > With regard to (1), I have some patches that modify DAFS to behave > more like the Windows CM expects. However, before I consider pushing > these patches to gerrit, I want to solicit opinions regarding these > underlying questions... >
signature.asc
Description: OpenPGP digital signature
