Hi,

what was updated to Opensolaris 2009.06 SU7? NFS server or NFS clients 
or both? What was the previous version before updating to s 2009.06 SU7? 
You also mention that this is seen on solaris 10u8. Can you confirm that 
it was not seen on s10u7?

It is not obvious if this is a bug on the client or on the server. The 
snoop and the nfs diagnostic messages from syslog do not show enough to 
get closer to the root cause.

In  s10u8 patch 141734-02 there were many changes for both nfs server 
and nfs client. So one of the possibilities is that the problem is 
related to that patch, otoh only few  fixes  from 141734-02 were 
integrated to Opensolaris 2009.06 updates.

Thanks,
Pavel

On 02/08/10 13:56, Udo Grabowski wrote:
> Hello,
>
> we have the same problem here after update to Opensolaris 2009.06 SU7 
> (+IDR30+35)
> on a 40 clients production system.
> About every second day one of our servers looses state,  and gives the output 
> below on snoop
> (numerous reopen attempts with a  NFS4ERR_NO_GRACE error). All clients hang, 
> since 
> they don't get their data, only a restart of the nfs/server process cures 
> this problem.
> This seems to be a serious bug in the NFS module introduced recently in both 
> Opensolaris
> support line and solaris 10 u8.
>
>    sun -> local-s11    NFS R 4 (open        ) NFS4ERR_STALE_CLIENTID PUTFH 
> NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
>    local-s11 -> sun   NFS C 4 (reopen      ) PUTFH FH=B5F3 OPEN OT=NC 
> SQ=14571 CT=P DT=N AC=W DN=N OO=0955 GE
> TFH GETATTR 10011a b0a23a 
>    sun -> local-s11    NFS R 4 (reopen      ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
> OPEN NFS4ERR_NO_GRACE 
>            ? -> (multicast)  ETHER Type=0001 (LLC/802.3), size=52 bytes
>    local-s11 -> sun    NFS C 4 (reopen      ) PUTFH FH=92CA OPEN OT=NC 
> SQ=14539 CT=P DT=N AC=R DN=N OO=0A4C GE
> TFH GETATTR 10011a b0a23a 
>    local-s11 -> sun    TCP D=2049 S=1023 Ack=293438085 Seq=4069541650 Len=0 
> Win=64074 Options=<nop,nop,tstamp 
> 25776951 188898831>
>    local-s11 -> sun    TCP D=2049 S=1023 Ack=293438153 Seq=4069541650 Len=0 
> Win=64074 Options=<nop,nop,tstamp 
> 25776951 188898907>
>    sun -> local-s11    NFS R 4 (reopen      ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
> OPEN NFS4ERR_NO_GRACE 
>    local-s11 -> sun    NFS C 4 (reopen      ) PUTFH FH=AA7B OPEN 
> imksuns11:1.log OT=NC SQ=14572 CT=N AC=W DN=N
>  OO=0955 GETFH GETATTR 10011a b0a...
>    local-s11 -> sun    NFS C 4 (reopen      ) PUTFH FH=95AC OPEN 
> start_version OT=NC SQ=14540 CT=N AC=R DN=N O
> O=0A4C GETFH GETATTR 10011a b0a23a 
>    sun -> local-s11    TCP D=1023 S=2049 Ack=4069542150 Seq=293438289 Len=0 
> Win=64074 Options=<nop,nop,tstamp 
> 188898915 25776951>
>    sun -> local-s11    NFS R 4 (reopen      ) NFS4_OK PUTFH NFS4_OK OPEN 
> NFS4_OK ST=1A6F:7271 RF=PL DT=N GETFH
>  NFS4_OK FH=92CA GETATTR NFS4_OK 
>    local-s11 -> sun    NFS C 4 (reopen      ) PUTFH FH=92CA OPEN OT=NC 
> SQ=14539 CT=P DT=N AC=R DN=N OO=0A0F GE
> TFH GETATTR 10011a b0a23a 
>    sun -> local-s11    NFS R 4 (reopen      ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
> OPEN NFS4ERR_NO_GRACE 
>    local-s11 -> sun    NFS C 4 (reopen      ) PUTFH FH=95AC OPEN 
> start_version OT=NC SQ=14540 CT=N AC=R DN=N O
> O=0A0F GETFH GETATTR 10011a b0a23a 
>    sun -> local-s11    NFS R 4 (reopen      ) NFS4_OK PUTFH NFS4_OK OPEN 
> NFS4_OK ST=1B35:7271 RF=PL DT=N GETFH
>  NFS4_OK FH=92CA GETATTR NFS4_OK 
>    local-s11 -> sun    NFS C 4 (reopen      ) PUTFH FH=92CA OPEN OT=NC 
> SQ=14539 CT=P DT=N AC=R DN=N OO=0AEE GE
> TFH GETATTR 10011a b0a23a 
>    sun -> local-s11    NFS R 4 (reopen      ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
> OPEN NFS4ERR_NO_GRACE 
>    local-s11 -> sun    NFS C 4 (reopen      ) PUTFH FH=95AC OPEN 
> start_version OT=NC SQ=14540 CT=N AC=R DN=N O
>
> NFS4ERR_NO_GRACE
>      A reclaim of client state was attempted in circumstances in 
>       which the server cannot guarantee that conflicting state has 
>       not been provided to another client.  This can occur because 
>       the reclaim has been done outside of the grace period of the
>       server, after the client has done a RECLAIM_COMPLETE operation,
>       or because previous operations have created a situation in which
>       the server is not able to determine that a reclaim-interfering
>       edge condition does not exist.
>   

Reply via email to