Bug#632074: linux-2.6: NFS4 client sends NULL calls in the TCP session

2011-11-24 Thread franck . eyraud

On 24/11/2011 06:31, Jonathan Nieder wrote:

Hi Franck,

Franck Eyraud wrote:

On 29/06/2011 15:38, Bastian Blank wrote:

linux-...@vger.kernel.org.

I will ask them also.

Did you get in touch with linux-nfs@?  If so, do you have the
date and subject or message-id of a message so we can track
the discussion and conclusion?

Thanks,
Jonathan


Hi Jonathan,

Thank you for your message.
I'm sorry I didn't give further more notice. The answer from linux-nfs 
list was more or less that the problem was on NetApp OnTap side, maybe 
logging too many messages.


Here is the thread where I got in touch with Thomas Haynes and Trond 
Myklebust from NetApp.

http://article.gmane.org/gmane.linux.nfs/41786

They kindly offered me to analyze the traces I already sent to the 
NetApp Technical Support.


Here is what Thomas Haynes wrote to me (off-list):
-
Recapping what I see in the case notes:

From the case notes, it appears the filer is objecting to the v4 NULL 
probes

because the GSS context is no longer valid.

My guess is that we don't even know it is a NULL probe at this point and
are kicked out at a higher level. Even if I am wrong, we probably need to
process the context in order to construct a reply.

The client changed behaviour from Debian 4 to both Debian 5 and Debian 6.
I have no clue what changed in there.

It appears that the customer doesn't mind this occurring and would be happy
if we could dial down the number of messages logged? I.e., it appears too
chatty?

I don't see in the notes that a request was made to reduce the log messages.
I.e., customer support probably focused on v4 NULL probes.

If reducing the cadence of the messages would help, I can file a bug
on this. Note, we do occasionally want to see these as there may
be other times when this legitimately occurs.
--

We recently tested with a linux (ubuntu) with the linux kernel 3.0 and 
the problem still appears. In the mean time, the OnTap server software 
has been upgraded to version 7.3.6P1.


So we basically still not really sure where this comes from (client or 
server ?), and we are still experiencing this problem (it seems the bug 
they opened at netapp side isn't solved yet).
I don't have the competences to analyze further the problem, but I can 
provide traces if someone wants to look into it. It could be also a 
kerberos 5 issue...


Hope that helps,

Franck Eyraud




--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4ece05d9.6080...@jrc.ec.europa.eu



Bug#632074: linux-2.6: NFS4 client sends NULL calls in the TCP session

2011-06-29 Thread Franck Eyraud
Package: linux-2.6
Severity: important

After kernel upgrade, our debian machines connected through NFSv4 to a NetApp 
filer cause the log of warning messages by the NAS :
Client 1XX.1XX.2XX.73 has an authentication error 2
Client 1XX.1XX.2XX.73 is sending bad rpc requests with error: RPC version 
mismatch or authentication error(73)

This does not prevent the use of the file system, but seems to be a violation 
of the NFS 4 protocol (see below)

This is known to happen with the following version of linux-image :

2.6.32-5-amd64
2.6.32-5-686-bigmem
2.6.29-xs5.5.0.17

and is known not to happen with the follwing versions :
2.6.18.xs4.0.1.900.5799
2.6.26-2-686

Some of the system listed above are virtual machines on XenServer 5.5 or 5.6 
hosts, but some are physical machine, so I discard the possibility of a problem 
on the xen 
version of the kernel.

A ticket has been filed bus us to the NetApp support, and after an analysis of 
the tcp trace they sent back the problem to the client side.

Below see the analysis by NetApp of this issue, and the description of the 
protocl exchange creating this issue.

---
Tue May 24 10:05:59 MEST [vcid@s-jrciprna004p: nfsd.rpc.request.bad:warning]: 
Client 1XX.1XX.2XX.73 is sending bad rpc requests with error: RPC version 
mismatch or 
authentication error(73)
 
We looked in the code and the (73) has no significance here and is simply the 
error code number for "RPC version mismatch or authentication error".
 
What we see is that the following occurs at the time of these errors:
- The client has an established TCP session on which it does NFSv4.
- The NFSv4 calls uses Kerberos.
- On that TCP session, the client occasionally does a NULL call.
- The filer rejects it with an authentication error (auth state 2, client must 
begin new session)
- The client does a new NULL call on a separate TCP session without a GSS 
context.
- The filer responds and a new context is established.
- The client continues on the original TCP session with the new context.
 
This explains why no side effect is seen: the client simply establishes a new 
context and continues as if nothing had happened.
We have checked through the trace for vlan 240 and the pattern is the same 
throughout and the error always happens for NULL calls only (occasionally two 
replies may be 
sent in the same TCP payload, but the error is always on the NULL reply, then).
We know that some Debian kernels do not exhibit this problem at any time, but 
others do. This (along with the problem being tied to NULL calls only) 
suggest to us that this is due to client side behaviour.

Anyway, we tried to check for the first occurrence of the error, which warrants 
some chronology. We'll do references per clock second for ease.
- The first client call is at 10:04:16 in an established NFS mount.
- The initial part of the trace, the client only uses TCP port 1006.
- The client uses the same GSS context, with the exception of a SETCLIENTID and 
a SETCLIENTID_CONFIRM call.
- At 10:05:00 the client tears down four GSS sessions (used for Kerberos) using 
RPCSEC_GSS_DESTROY in an NFSv4 NULL call. This is done from TCP port 1006 but 
for four 
different contexts. None of these have been used in the trace at that point.
- The client continues with more cals on port 1006 using the the same GSS 
context.
- Still at 10:05:00 (frame 1982792), the client uses an NFSv4 call to do a 
RPCSEC_GSS_INIT to establish a new GSS context.
- The client continues using the new GSS context and does not reuse the old 
context.
- The sequence described above on the NULL calls start.
 
Looking closer at these steps, we notice something important in the NULL calls.
Above, the client destroyed four GSS contexts that were not used during the 
trace. However, it did not destroy the GSS context it was using for a while 
there.
 
However, we now note the client actually does a RPCSEC_GSS_DESTROY in each of 
the NFSv4 NULL calls where we respond with an authentication error. As the 
error 
indicates that the client has to begin a new session, this seems like a 
reasonable response to the call.
 
So to summarise:
- The filer logs these errors when the client destroys a GSS context.
- The error message is a logical response.
The decission to tear down the GSS context is with the client. So this would 
seem to be a client side issue after all, which just happens to get logged on 
the filer.
--



-- System Information:
Debian Release: 5.0.8
  APT prefers oldstable
  APT policy: (500, 'oldstable')
Architecture: i386 (i686)

Kernel: Linux 2.6.32-5-686-bigmem (SMP w/2 CPU cores)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /bin/bash



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20110629114800.2859.34873.report...@s-jrciprcid73v.cidsn.jrc.it



Bug#632074: linux-2.6: NFS4 client sends NULL calls in the TCP session

2011-06-29 Thread Franck Eyraud


On 29/06/2011 15:38, Bastian Blank wrote:

According to this, a response to a NULL request simply must not return
an authentication error. Also no erata exists for this part.
As I'm not an NFS expert, if someone can prove that the problem is on 
the server part, despite the analysis from NetApp, then I'll send back 
the case to them.




You saw that the versions of the ones with the behaviour are strictly
higher then the others?
Yes I did, that's why I inserted this information. I guess some 
behaviour changed in some version upgrade.

Please fix your MUA, it produces too long lines, and ask on
linux-...@vger.kernel.org.

I will ask them also.

Thanks



--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4e0b33c6.8040...@jrc.ec.europa.eu