> There was a change made to the delegation monitor for setattr that went > in snv_76. Previously, setattr only caused a recall when the size, > group, owner, or permissions of the file changed. Now, any setattr will > cause a recall.
Hmm, yes, using a "chmod 777" instead of "touch" on the nfs4 client gets the same 3 minute timeout... And this probably explains why the issue isn't reproducible on the sparc box nfs4 server, which is running snv_72. > If the client in your example had a write delegation, the recall would > not happen and you wouldn't see the "hang". The client is actually running an imap server, which apparently is trying to find out the format of the user's mailbox by opening the mailbox file O_RDONLY, reading and analysing the first few bytes and closing the mailbox file. After imapd has figured out the mailbox's format it apparently tries to restore the mailbox access time using utimes(2). These mailbox files were located on an nfs4 server on a remote machine, and the imap daemon was very slow opening the mailbox files from the nfs4 filesystem. I think my procedure with "date", "cat" and "touch" reproduces the issue that I was observing with the imapd. > The hang is caused by the > client not returning the delegation and the monitor holding the setattr > operation until the delegation is returned or, in this case, revoked. > After waiting a lease period (90 seconds), the server sends another > cb_recall. When the delegation still isn't returned after another lease > period, the server revokes the delegation and the setattr operation > completes. > > Since the client doesn't return the delegation until the setattr > completes, I would speculate that the client can't perform the > delegation return over the wire until the previous over the wire > operation (setattr) completes. > > I'm not sure if there is a dependency that needs to be fixed, or if the > client should just return the delegation before performing the setattr > operation. However, there was another change to the monitors which got > putback to build snv_80 which may solve the problem (though I need to > test that). The monitors have been changed to return EAGAIN when the > caller doesn't want to block while waiting for the delegation to be > returned. The changes were put in for the NFSv3 and NFSv2 servers, so > I'm not sure it will fix the problem for v4. If it doesn't, it is just > a simple code change to make it work (adding a flag to caller context > and have the server check the return from VOP_SETATTR). Ahh, the "cc_flags" member in the caller_context_t was added on December 5th with the putback for "PSARC 2007/632 Caller context flags"? Isn't the new ct.cc_flags member uninitialized in nfs4_srv.c, in function do_rfs4_op_setattr()? And in rfs4_op_read(), rfs4_op_write(), rfs4_createfile(), rfs4_do_open(), ... ? http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/nfs/nfs4_srv.c#4891 4891 /* Check stateid only if size has been set */ 4892 if (sarg.vap->va_mask & AT_SIZE) { 4893 trunc = (sarg.vap->va_size == 0); 4894 status = rfs4_check_stateid(FWRITE, cs->vp, stateid, 4895 trunc, &cs->deleg, sarg.vap->va_mask & AT_SIZE, &ct); 4896 if (status != NFS4_OK) 4897 goto done; 4898 } else { 4899 ct.cc_sysid = 0; 4900 ct.cc_pid = 0; 4901 ct.cc_caller_id = nfs4_srv_caller_id; <<<<<<<<<<< ct.cc_flags ???? 4902 } Maybe in those cases where I couldn't reliably reproduce the 3 min "hang" for the touch command, the uninitialized ct.cc_flags on the server had the low bit set, so that recall_all_delegations() didn't wait and returned NFS4ERR_DELAY, the monitor returned that as EAGAIN, ... This message posted from opensolaris.org