On Mon, 17 Feb 2014 14:13:00 -0600 Andrew Deason <adea...@sinenomine.net> wrote:
> On Thu, 13 Feb 2014 00:13:41 -0600 > Tracy Di Marco White <genda...@gmail.com> wrote: > > > VolserLog (yesteday) > > Wed Feb 12 01:04:48 2014 SYNC_ask: length field in response inconsistent > > on circuit 'FSSYNC' command 65543, 200 != 292 > > Thanks for that. Command 65543 is FSYNC_VOL_QUERY, which we don't use > in the volserver, so this is a little confusing. That would be confusing, except that's wrong. We do send that command for 'vos listvol' processing and similar requests; that's just my mistake. Anyway, to clarify a few points after some off-list communication: These messages: Sat Feb 8 00:02:42 2014 _VLockFd: conflicting lock held on fd 222, offset 537011871 by pid 21378 (locktype=1) Sat Feb 8 00:02:42 2014 VAttachVolume: another program has vol 537011871 locked [...] Sat Feb 8 00:02:42 2014 VPreattachVolumeByVp_r: volume 537011871 not in quiescent state (state 2 flags 0x18) Indicate a situation where a volume gets "stuck" in an inaccessible state. This was a bug fixed in 1.6.6, commit 8ff592b56d2ed74a01da8090637e06cab2cc1c19. That doesn't prevent the messages from appearing and isn't the root cause of what's going on (as far as I can tell), but it would make it so the volume recovers and would be accessible afterwards. As for what is triggering this behavior, as far as I can tell it's just a normal short read from the fssync socket. That is, one side sends 292 bytes, and the other receives 200 bytes (and then would receive another 92 bytes if it continued reading). The code currently expects that reads and writes on the socket will complete with a single read/write call. We "should" account for the case where that does not happen, but I'm not sure of why that would ever indeed not happen. (These are all very short requests on a unix domain socket whose internal buffer cannot be anywhere near full.) So, I assume it is some quirk of the networking code in netbsd that doesn't seem to appear anywhere else. We should be handling this, but as far as I can tell this only happens on netbsd, because I'm not aware of any other instance of this appearing. An easy way to fix this would be to use SOCK_SEQPACKET sockets, but I'm not sure if those are really portable (as in, if they actually work correctly on all platforms over unix domain sockets). What would be more robust is just a simple read/write loop on both ends with some small timeout, which isn't difficult. So that's what I'll do whenever I can. Tracy has a quickfix patch (just to see if the analysis above was vaguely correct), but I need to implement it a bit more "correctly" for an actual release. -- Andrew Deason adea...@sinenomine.net _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info