Bug#440676: nfs-kernel-server: Simultaneous transfer on mirrored nfs mounts causes service freezes

2007-09-04 Thread Moritz Mühlenhoff
Jeffrey B. Green wrote:
> (I've delayed a few weeks in reporting this bug and so unsure whether other
> work on it has already occurred. However I could not find anything
> relevant.)

A very similar behaviour can be reproduced with i386 as well (with the Etch 
versions of  the Linux kernel and nfs-kernel-server):
We have a setup where a block device is replicated with drbd. On this device 
an ext3 partition has been created, which is exported over NFS. Reading from 
the NFS share works fairly well, however concurrent writes to the share lead 
to lockups. The client processes copying data to the share are stalling and 
sometimes the system is locked up requiring a hard reboot.

This is reproducable with both NFS over UDP and NFS over TCP.

I originally assumed this was only triggerable with the block dev on a drbd 
device, but per Jeffrey's mail is seems to be a more widespread problem.

Cheers,
Moritz


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#440676: nfs-kernel-server: Simultaneous transfer on mirrored nfs mounts causes service freezes

2007-09-04 Thread Jeffrey B. Green
Actually, the incident is primarily on i386 machines. Sorry about the 
report being a bit deceptive by submitting from my standard work machine 
which is a powerpc. The powerpc kernel may not exhibit this behavior.


-jeff

Moritz Mühlenhoff wrote:

Jeffrey B. Green wrote:
  

(I've delayed a few weeks in reporting this bug and so unsure whether other
work on it has already occurred. However I could not find anything
relevant.)



A very similar behaviour can be reproduced with i386 as well (with the Etch 
versions of  the Linux kernel and nfs-kernel-server):
We have a setup where a block device is replicated with drbd. On this device 
an ext3 partition has been created, which is exported over NFS. Reading from 
the NFS share works fairly well, however concurrent writes to the share lead 
to lockups. The client processes copying data to the share are stalling and 
sometimes the system is locked up requiring a hard reboot.


This is reproducable with both NFS over UDP and NFS over TCP.

I originally assumed this was only triggerable with the block dev on a drbd 
device, but per Jeffrey's mail is seems to be a more widespread problem.


Cheers,
Moritz

  




Bug#440676: nfs-kernel-server: Simultaneous transfer on mirrored nfs mounts causes service freezes

2007-09-07 Thread Moritz Mühlenhoff
severity 440676 important
thanks

Jeffrey B. Green:
> > A very similar behaviour can be reproduced with i386 as well (with the
> > Etch versions of  the Linux kernel and nfs-kernel-server):
> > We have a setup where a block device is replicated with drbd. On this
> > device an ext3 partition has been created, which is exported over NFS.
> > Reading from the NFS share works fairly well, however concurrent writes
> > to the share lead to lockups. The client processes copying data to the
> > share are stalling and sometimes the system is locked up requiring a hard
> > reboot.
> >
> > This is reproducable with both NFS over UDP and NFS over TCP.

> Actually, the incident is primarily on i386 machines. Sorry about the
> report being a bit deceptive by submitting from my standard work machine
> which is a powerpc. The powerpc kernel may not exhibit this behavior.

Could you please test the attached patch from RHEL5 and report if it resolves
the problem for you?

Instructions on how to compile a modified kernel can be found at
http://wiki.debian.org/DebianKernelCustomCompilation

Cheers,
Moritz
-- 
Moritz Muehlenhoff [EMAIL PROTECTED] fon: +49 421 22 232- 0
DevelopmentLinux for Your Business   fax: +49 421 22 232-99
Univention GmbHhttp://www.univention.de/   mobil: +49 175 22 999 23
From: Steve Dickson <[EMAIL PROTECTED]>
Subject: [RHEL5][PATCH] NFS: system stall on NFS stress under high memory  pressure
Date: Fri, 15 Dec 2006 08:41:22 -0500
Bugzilla: 213137
Message-Id: <[EMAIL PROTECTED]>
Changelog: NFS: system stall on NFS stress under high memory  pressure


The following 3 attached patches solve an NFS hang that an number
of upstream and RHEL5 user saw... The hang, which was introduced
in the 2.6.11 kernel, was caused by an RPC task continuously
getting ignored due the a certain combination of task states.
This state combination only seem to happen when there
was memory pressure

The upstream email thread is at:
http://sourceforge.net/mailarchive/message.php?msg_id=37252769

The bz is:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213137

These patch have been tested by a number of upstream people,
as well as locally on a RHEL5 B2 kernel...

steved.

The sunrpc scheduler contains a race condition that can let an RPC
task end up being neither running nor on any wait queue. The race takes
place between rpc_make_runnable (called from rpc_wake_up_task) and
__rpc_execute under the following condition:
First __rpc_execute calls tk_action which puts the task on some wait
queue. The task is dequeued by another process before __rpc_execute
continues its execution. While executing rpc_make_runnable exactly after
setting the task `running' bit and before clearing the `queued' bit
__rpc_execute picks up execution, clears `running' and subsequently
both functions fall through, both under the false assumption somebody
else took the job.

Swapping rpc_test_and_set_running with rpc_clear_queued in
rpc_make_runnable fixes that hole. This introduces another possible
race condition that can be handled by checking for `queued' after
setting the `running' bit.

Bug noticed on a 4-way x86_64 system under XEN with an NFSv4 server
on the same physical machine, apparently one of the few ways to hit
this race condition at all.

Cc: Trond Myklebust <[EMAIL PROTECTED]>
Cc: J. Bruce Fields <[EMAIL PROTECTED]>
Signed-off-by: Christophe Saout <[EMAIL PROTECTED]>

--- linux-2.6.18.i686/net/sunrpc/sched.c.orig	2006-09-19 23:42:06.0 -0400
+++ linux-2.6.18.i686/net/sunrpc/sched.c	2006-12-05 07:28:24.251992000 -0500
@@ -302,13 +302,15 @@ EXPORT_SYMBOL(__rpc_wait_for_completion_
  */
 static void rpc_make_runnable(struct rpc_task *task)
 {
-	int do_ret;
-
 	BUG_ON(task->tk_timeout_fn);
-	do_ret = rpc_test_and_set_running(task);
 	rpc_clear_queued(task);
-	if (do_ret)
+	if (rpc_test_and_set_running(task))
+		return;
+	/* We might have raced */
+	if (RPC_IS_QUEUED(task)) {
+		rpc_clear_running(task);
 		return;
+	}
 	if (RPC_IS_ASYNC(task)) {
 		int status;
 

Fix a second potential rpc_wakeup race...

Signed-off-by: Trond Myklebust <[EMAIL PROTECTED]>
---

--- linux-2.6.18.i686/fs/nfs/nfs4proc.c.orig	2006-12-06 10:00:51.31643 -0500
+++ linux-2.6.18.i686/fs/nfs/nfs4proc.c	2006-12-12 12:07:21.607308000 -0500
@@ -636,7 +636,7 @@ static int _nfs4_proc_open_confirm(struc
 		smp_wmb();
 	} else
 		status = data->rpc_status;
-	rpc_release_task(task);
+	rpc_put_task(task);
 	return status;
 }
 
@@ -742,7 +742,7 @@ static int _nfs4_proc_open(struct nfs4_o
 		smp_wmb();
 	} else
 		status = data->rpc_status;
-	rpc_release_task(task);
+	rpc_put_task(task);
 	if (status != 0)
 		return status;
 
@@ -3059,7 +3059,7 @@ static int _nfs4_proc_delegreturn(struct
 		if (status == 0)
 			nfs_post_op_update_inode(inode, &data->fattr);
 	}
-	rpc_release_task(task);
+	rpc_put_task(task);
 	return status;
 }
 
@@ -3306,7 +3306,7 @@ static int nfs4_proc_unlck(struct nfs4_s

Bug#440676: nfs-kernel-server: Simultaneous transfer on mirrored nfs mounts causes service freezes

2007-09-09 Thread Jeffrey B. Green

Moritz,

Moritz Mühlenhoff wrote:

severity 440676 important
thanks

Jeffrey B. Green:
  

A very similar behaviour can be reproduced with i386 as well (with the
Etch versions of  the Linux kernel and nfs-kernel-server):
We have a setup where a block device is replicated with drbd. On this
device an ext3 partition has been created, which is exported over NFS.
Reading from the NFS share works fairly well, however concurrent writes
to the share lead to lockups. The client processes copying data to the
share are stalling and sometimes the system is locked up requiring a hard
reboot.

This is reproducable with both NFS over UDP and NFS over TCP.
  


  

Actually, the incident is primarily on i386 machines. Sorry about the
report being a bit deceptive by submitting from my standard work machine
which is a powerpc. The powerpc kernel may not exhibit this behavior.



Could you please test the attached patch from RHEL5 and report if it resolves
the problem for you?

Instructions on how to compile a modified kernel can be found at
http://wiki.debian.org/DebianKernelCustomCompilation

Cheers,
Moritz
  


(Resending to document in the bug trail.)

The patch interacts with previous a previous patch. When I explicitly 
patch the kernel source first, and then try to build, I get:


tail make.out
(+) OKbugfix/2.6.16.30
(+) OKbugfix/2.6.16.31
(+) OKbugfix/2.6.16.32
(+) OKbugfix/2.6.16.33
(+) OKbugfix/2.6.16.34
(+) OKbugfix/2.6.16.35
(+) FAIL  bugfix/2.6.16.37
make[1]: *** [debian/stamps/source] Error 1
make[1]: Leaving directory `/home/jeff/linux-2.6-2.6.18.dfsg.1'
make: *** [binary-arch-i386-none-686-real] Error 2

-jeff




Bug#440676: nfs-kernel-server: Simultaneous transfer on mirrored nfs mounts causes service freezes

2007-09-03 Thread Jeffrey B. Green
Package: nfs-kernel-server
Version: 1:1.0.10-6+etch.1
Severity: normal


The setup is having nfs mounts on various machines and backups to "non-local" 
mounts. Having cron jobs starting
on more than one machine and backing up to the other while the other is backing 
up to it causes various 
service freezes which need a system reboot the next morning to fix. Oftentimes 
the reboot is a manual "button" reboot 
since the reboot command will not follow through at that point.

My workaround is to just be sure to schedule the backups so that they do not 
overlap and that seems to "fix"
the problem. However I do occasionally get bitten when complications create an 
overlap.

(I've delayed a few weeks in reporting this bug and so unsure whether other 
work on it has already occurred.
However I could not find anything relevant.)

-- System Information:
Debian Release: 4.0
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: powerpc (ppc)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.18-5-powerpc
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)

Versions of packages nfs-kernel-server depends on:
ii  libc62.3.6.ds1-13etch2   GNU C Library: Shared libraries
ii  libcomer 1.39+1.40-WIP-2006.11.14+dfsg-2 common error description library
ii  libgssap 0.10-4  A mechanism-switch gssapi library
ii  libkrb53 1.4.4-7etch2MIT Kerberos runtime libraries
ii  libnfsid 0.18-0  An nfs idmapping library
ii  librpcse 0.14-2  allows secure rpc communication us
ii  libwrap0 7.6.dbs-13  Wietse Venema's TCP wrappers libra
ii  lsb-base 3.1-23.2etch1   Linux Standard Base 3.1 init scrip
ii  nfs-comm 1:1.0.10-6+etch.1   NFS support files common to client
ii  ucf  2.0020  Update Configuration File: preserv

nfs-kernel-server recommends no packages.

-- no debconf information


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#440676: nfs-kernel-server: Simultaneous transfer on mirrored nfs mounts causes service freezes

2007-09-03 Thread Steinar H. Gunderson
reassign 440676 linux-image-2.6.18-5-powerpc
thanks

On Mon, Sep 03, 2007 at 11:38:31AM -0400, Jeffrey B. Green wrote:
> The setup is having nfs mounts on various machines and backups to
> "non-local" mounts. Having cron jobs starting on more than one machine and
> backing up to the other while the other is backing up to it causes various
> service freezes which need a system reboot the next morning to fix.
> Oftentimes the reboot is a manual "button" reboot since the reboot command
> will not follow through at that point.

It sounds to me like this is a kernel bug (nfs-kernel-server only contains
the userspace part required to set up the connection; it's all the kernel
from there). Reassigning.

/* Steinar */
-- 
Homepage: http://www.sesse.net/


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#440676: nfs-kernel-server: Simultaneous transfer on mirrored nfs mounts causes service freezes

2007-10-04 Thread Jens Seidel
On Fri, Sep 07, 2007 at 03:23:05PM +0200, Moritz Mühlenhoff wrote:
> Jeffrey B. Green:
> > > A very similar behaviour can be reproduced with i386 as well (with the
> > > Etch versions of  the Linux kernel and nfs-kernel-server):

This patch fixed also a NFS freeze for me in my XEN domU. I just build
a software project both in dom0 and domU in parallel (but in different
build directories) and it freezes during the first 5 minutes. (Hardware:
a quad core intel CPU, dom0 OpenSuse 10.2, domU Debian etch).

It's now also much faster (by a factor 5--10 for a configure script). It
still not as fast as expected so you may extent the patch :-)

> Could you please test the attached patch from RHEL5 and report if it resolves
> the problem for you?

It does at least for me, so please apply the patch.

> Instructions on how to compile a modified kernel can be found at
> http://wiki.debian.org/DebianKernelCustomCompilation

I just copied the patch into patches/ of Xen 3.1 source and added it to
patches/series and did a normal build of Xen. Fairly simple (at least if
one does not forget to updates the modules in the domU)!

Thanks!

PS: Where does the patch come from? Are there more?

Jens