I just started seeing this bug on an NFS server (nfs-kernel-server
1:1.1.2-2ubuntu2.2) on 8.04LTS (2.6.24-27-server #1 SMP Fri Mar 12
01:23:09 UTC 2010 x86_64 GNU/Linux / Ubuntu 2.6.24-27.68-server) with
four NFS exports from two iSCSI initiated volumes (open-iscsi
2.0.865-1ubuntu3.3).  This NFS server is a virtual machine (VMware ESXi
3.5.0 build 169697) that was setup in October 2009.  The NFS server is
strictly serving an interim need of offloading data from an old
(OpenSuSE 10.0) Samba server's overflowing hard drives.

Up until last week the machine ran without  any trouble.

Last week we added the second of the two iSCSI volumes and added an NFS
share to the space on that volumes.  (All volumes, local disk and iSCSI,
are ext3.)  We mounted the new NFS volume from the Samba machine and
moved about 100GB of data off the old Samba server's local drives via
rsync.  No problem doing that.  We then deleted the data from the Samba
server and created symlinks in place of each moved folder, pointing to
the respective folder on the new NFS volume.

The 100GB of data we just moved was backup data which the Windows users
were backing up to using robocopy.exe.  This system had worked just fine
for years.  But now, nearly every time a robocopy runs, we see the NFS
server's kernel hang with the softlockup on 11s error being discussed on
this thread.  When this happens, the virtual machine is totally
unresponsive, and we have to do a hard reset.  The other virtual
machines on the VMware server do not seem to be impacted in any way.

When we manually drag-and-drop 4GB of data (a typical amount being
robocopied by the users) we do not have the problem.  This is the first
NFS folder which has to handle data being copied (through the Samba
server, remember) using robocopy.

I'm no linux kernel developer, but my two cents are that the kernel is
seeing a slow response from the iSCSI initiator when a heavy write load
is placed on the iSCSI driver and it doesn't respond for a few seconds.
After doing some research into this, we are going to try increasing the
/proc/sys/kernel/softlockup_thresh from 10 to 60 seconds (the maximum
allowed value short of turning off the threshold check) for now and see
if that changes anything.  If my hypothesis is correct, it likely would.

Perhaps these observations will be of some value among the community and
developers in piecing this puzzle together...

-- 
Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond - bond0
https://bugs.launchpad.net/bugs/245779
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to