Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Fri, Nov 02, 2007 at 08:56:55PM +0800, Fengguang Wu wrote: > > > --- > > > fs/reiserfs/stree.c |3 --- > > > 1 file changed, 3 deletions(-) > > > > > > --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c > > > +++ linux-2.6.24-git17/fs/reiserfs/stree.c > > > @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p > > > } > > > bh = next; > > > } while (bh != head); > > > - if (PAGE_SIZE == bh->b_size) { > > > - cancel_dirty_page(page, PAGE_CACHE_SIZE); > > > - } > > > } > > > } > > > } > > > > ... and with the above patch applied. > > > > Copying 300 MB from root (ext3) to the new file system did not trigger > > the pdflush condition. But then I did a > >cd $MOUNTPOINT && find . -exec md5sum {} \; > > and that brought one cpu to 75% iowait. > > Immediately? Do you have the debug printk messages this time(with the > above patch)? No, but I will add them this afternoon. > > I have attached my .config, if it helps. > > It's really curious - I tried your .config and commands, and still > could not trigger the high iowait. I'm running 64bit Intel Core 2, > and kernel 2.6.24-rc1-git6 with the above patch. Curious but 100% reproducible, at least on my box. What I'm going to try is booting into the kernel with your patch and just doing the find / md5sum. It would be really interesting if the read-only access triggers it. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Fri, Nov 02, 2007 at 08:56:55PM +0800, Fengguang Wu wrote: --- fs/reiserfs/stree.c |3 --- 1 file changed, 3 deletions(-) --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c +++ linux-2.6.24-git17/fs/reiserfs/stree.c @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p } bh = next; } while (bh != head); - if (PAGE_SIZE == bh-b_size) { - cancel_dirty_page(page, PAGE_CACHE_SIZE); - } } } } ... and with the above patch applied. Copying 300 MB from root (ext3) to the new file system did not trigger the pdflush condition. But then I did a cd $MOUNTPOINT find . -exec md5sum {} \; and that brought one cpu to 75% iowait. Immediately? Do you have the debug printk messages this time(with the above patch)? No, but I will add them this afternoon. I have attached my .config, if it helps. It's really curious - I tried your .config and commands, and still could not trigger the high iowait. I'm running 64bit Intel Core 2, and kernel 2.6.24-rc1-git6 with the above patch. Curious but 100% reproducible, at least on my box. What I'm going to try is booting into the kernel with your patch and just doing the find / md5sum. It would be really interesting if the read-only access triggers it. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Fri, Nov 02, 2007 at 09:33:21AM +0800, Fengguang Wu wrote: > > I will try that with a USB disk - I hope that won't make a difference. > > Thank you. I guess a reiserfs on loop file would also be OK. > > > > btw, what's the exact kernel version you are running? > > > > I noticed it with the kernel in the $SUBJECT, as reported by 'git > > describe'. I have pulled in new changesets since then. > > And with the following patch applied? > > --- > fs/reiserfs/stree.c |3 --- > 1 file changed, 3 deletions(-) > > --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c > +++ linux-2.6.24-git17/fs/reiserfs/stree.c > @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p > } > bh = next; > } while (bh != head); > - if (PAGE_SIZE == bh->b_size) { > - cancel_dirty_page(page, PAGE_CACHE_SIZE); > - } > } > } > } ... and with the above patch applied. Copying 300 MB from root (ext3) to the new file system did not trigger the pdflush condition. But then I did a cd $MOUNTPOINT && find . -exec md5sum {} \; and that brought one cpu to 75% iowait. I have attached my .config, if it helps. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 # # Automatically generated make config: don't edit # Linux kernel version: 2.6.24-rc1-7-fw2 # Wed Oct 31 07:27:14 2007 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y # CONFIG_IKCONFIG_PROC is not set CONFIG_LOG_BUF_SHIFT=19 CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not set CONFIG_CGROUP_NS=y CONFIG_CGROUP_CPUACCT=y # CONFIG_CPUSETS is not set CONFIG_FAIR_GROUP_SCHED=y CONFIG_FAIR_USER_SCHED=y # CONFIG_FAIR_CGROUP_SCHED is not set # CONFIG_SYSFS_DEPRECATED is not set CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set CONFIG_BLK_DEV_BSG=y CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set CONFIG_MK8=y # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=m CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y # CONFIG_NUMA is not set CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Thu, Nov 01, 2007 at 09:03:33PM +0800, Fengguang Wu wrote: > Or will the system or fs size/age make any difference? If you happen > to have a spare/swap partition, could you make a new reiserfs and > mount it and copy several less-than-4KB files into it and wait for 30s > and see what happen to pdflush? I will try that with a USB disk - I hope that won't make a difference. > btw, what's the exact kernel version you are running? I noticed it with the kernel in the $SUBJECT, as reported by 'git describe'. I have pulled in new changesets since then. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Thu, Nov 01, 2007 at 03:15:32PM +0800, Fengguang Wu wrote: > On Wed, Oct 31, 2007 at 12:53:18PM -0500, Florin Iucha wrote: > > This patch does not fix anything for me. Even such light use of the > > reiserfs filesystem as pulling the linux-2.6 git tree updates caused > > one CPU to go to 75% iowait. > > Thank you, Florin. Could you provide more details about sda7, such as > the mount option and output of `reiserfstune /dev/sda7`? I'll try to > reproduce it before asking for your help. Fengguang, [EMAIL PROTECTED]:~# mount | grep sda7 /dev/sda7 on /scratch type reiserfs (rw,noatime) [EMAIL PROTECTED]:~# df -h /scratch/ FilesystemSize Used Avail Use% Mounted on /dev/sda7 38G 32G 5.7G 85% /scratch [EMAIL PROTECTED]:~# umount /dev/sda7 [EMAIL PROTECTED]:~# reiserfstune /dev/sda7 reiserfstune: Journal device has not been specified. Assuming journal is on the main device (/dev/sda7). Current parameters: Filesystem state: consistent /scratch: Reiserfs super block in block 16 on 0x807 of format 3.6 with standard journal Count of blocks on the device: 9765504 Number of bitmaps: 299 Blocksize: 4096 Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 1471399 Root block: 2359332 Filesystem is clean Tree height: 5 Hash function used to sort names: "r5" Objectid map size 916, max 972 Journal parameters: Device [0x0] Magic [0x31037e64] Size 8193 blocks (including 1 for journal header) (first block 18) Max transaction length 1024 blocks Max batch size 900 blocks Max commit age 30 Blocks reserved by journal: 0 Fs state field: 0x0: sb_version: 2 inode generation number: 5856766 UUID: a0191e80-be6e-47f6-8fd0-047e2d763a4a LABEL: /scratch Set flags in SB: ATTRIBUTES CLEAN And for bonus points: ### reiserfsck --check started at Thu Nov 1 07:09:56 2007 ### Replaying journal.. Reiserfs journal '/dev/sda7' in blocks [18..8211]: 0 transactions replayed Checking internal tree..finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 247231 Internal nodes 1570 Directories 11330 Other files 722878 Data block pointers 8040675 (3880 of them are zero) Safe links 0 ### reiserfsck finished at Thu Nov 1 07:18:43 2007 ### Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Thu, Nov 01, 2007 at 03:15:32PM +0800, Fengguang Wu wrote: On Wed, Oct 31, 2007 at 12:53:18PM -0500, Florin Iucha wrote: This patch does not fix anything for me. Even such light use of the reiserfs filesystem as pulling the linux-2.6 git tree updates caused one CPU to go to 75% iowait. Thank you, Florin. Could you provide more details about sda7, such as the mount option and output of `reiserfstune /dev/sda7`? I'll try to reproduce it before asking for your help. Fengguang, [EMAIL PROTECTED]:~# mount | grep sda7 /dev/sda7 on /scratch type reiserfs (rw,noatime) [EMAIL PROTECTED]:~# df -h /scratch/ FilesystemSize Used Avail Use% Mounted on /dev/sda7 38G 32G 5.7G 85% /scratch [EMAIL PROTECTED]:~# umount /dev/sda7 [EMAIL PROTECTED]:~# reiserfstune /dev/sda7 reiserfstune: Journal device has not been specified. Assuming journal is on the main device (/dev/sda7). Current parameters: Filesystem state: consistent /scratch: Reiserfs super block in block 16 on 0x807 of format 3.6 with standard journal Count of blocks on the device: 9765504 Number of bitmaps: 299 Blocksize: 4096 Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 1471399 Root block: 2359332 Filesystem is clean Tree height: 5 Hash function used to sort names: r5 Objectid map size 916, max 972 Journal parameters: Device [0x0] Magic [0x31037e64] Size 8193 blocks (including 1 for journal header) (first block 18) Max transaction length 1024 blocks Max batch size 900 blocks Max commit age 30 Blocks reserved by journal: 0 Fs state field: 0x0: sb_version: 2 inode generation number: 5856766 UUID: a0191e80-be6e-47f6-8fd0-047e2d763a4a LABEL: /scratch Set flags in SB: ATTRIBUTES CLEAN And for bonus points: ### reiserfsck --check started at Thu Nov 1 07:09:56 2007 ### Replaying journal.. Reiserfs journal '/dev/sda7' in blocks [18..8211]: 0 transactions replayed Checking internal tree..finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 247231 Internal nodes 1570 Directories 11330 Other files 722878 Data block pointers 8040675 (3880 of them are zero) Safe links 0 ### reiserfsck finished at Thu Nov 1 07:18:43 2007 ### Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Thu, Nov 01, 2007 at 09:03:33PM +0800, Fengguang Wu wrote: Or will the system or fs size/age make any difference? If you happen to have a spare/swap partition, could you make a new reiserfs and mount it and copy several less-than-4KB files into it and wait for 30s and see what happen to pdflush? I will try that with a USB disk - I hope that won't make a difference. btw, what's the exact kernel version you are running? I noticed it with the kernel in the $SUBJECT, as reported by 'git describe'. I have pulled in new changesets since then. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Fri, Nov 02, 2007 at 09:33:21AM +0800, Fengguang Wu wrote: I will try that with a USB disk - I hope that won't make a difference. Thank you. I guess a reiserfs on loop file would also be OK. btw, what's the exact kernel version you are running? I noticed it with the kernel in the $SUBJECT, as reported by 'git describe'. I have pulled in new changesets since then. And with the following patch applied? --- fs/reiserfs/stree.c |3 --- 1 file changed, 3 deletions(-) --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c +++ linux-2.6.24-git17/fs/reiserfs/stree.c @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p } bh = next; } while (bh != head); - if (PAGE_SIZE == bh-b_size) { - cancel_dirty_page(page, PAGE_CACHE_SIZE); - } } } } ... and with the above patch applied. Copying 300 MB from root (ext3) to the new file system did not trigger the pdflush condition. But then I did a cd $MOUNTPOINT find . -exec md5sum {} \; and that brought one cpu to 75% iowait. I have attached my .config, if it helps. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 # # Automatically generated make config: don't edit # Linux kernel version: 2.6.24-rc1-7-fw2 # Wed Oct 31 07:27:14 2007 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION= # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y # CONFIG_IKCONFIG_PROC is not set CONFIG_LOG_BUF_SHIFT=19 CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not set CONFIG_CGROUP_NS=y CONFIG_CGROUP_CPUACCT=y # CONFIG_CPUSETS is not set CONFIG_FAIR_GROUP_SCHED=y CONFIG_FAIR_USER_SCHED=y # CONFIG_FAIR_CGROUP_SCHED is not set # CONFIG_SYSFS_DEPRECATED is not set CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set CONFIG_BLK_DEV_BSG=y CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=cfq # # Processor type and features # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set CONFIG_MK8=y # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=m CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y # CONFIG_NUMA is not set CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set #
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Wed, Oct 31, 2007 at 07:16:06AM -0500, Florin Iucha wrote: > On Wed, Oct 31, 2007 at 02:53:25PM +0800, Fengguang Wu wrote: > > On Tue, Oct 30, 2007 at 10:52:45PM -0500, Florin Iucha wrote: > > > On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote: > > > > I have added the patches and started a linux kernel compilation, and > > > > something really interesting happens. I run the build with the > > > > equivalent of "make -j3" and in a separate console I am watching the > > > > build with 'top'. The build consumes 98% of both CPUs. If I stop the > > > > output in the build console with "Ctrl-S", one core goes to idle, > > > > while the other is in 50% waiting, then goes to 75% waiting. When I > > > > resume the build with "Ctrl-Q", the build starts to use both CPUs at > > > > 98-99%. The NFS4 use was minimal, as I did not login with Gnome, but > > > > just logged on the console. Also, the CPU that is in 75% waiting > > > > state changes occasionally. 'Top' shows pdflush in D state, using > > > > 5-6% of CPU. > > > > > > I forgot the traces: > > > > > >http://iucha.net/2.6.24-rc1/fw.1.gz > > >http://iucha.net/2.6.24-rc1/fw.2.gz > > >http://iucha.net/2.6.24-rc1/fw.3.gz > > > > Sorry for the delay - I've been fixing our server today. > > > > [ 263.685691] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 > > global 4593 0 0 wc _M tw 1024 sk 0 > > [ 263.789648] requeue_io 301: inode 4031199 size 562 at 08:07(sda7) > > [ 263.789656] requeue_io 301: inode 4031231 size 329 at 08:07(sda7) > > [ 263.789660] requeue_io 301: inode 4031255 size 177 at 08:07(sda7) > > [ 263.789664] requeue_io 301: inode 4031268 size 94 at 08:07(sda7) > > [ 263.789667] requeue_io 301: inode 4031329 size 88 at 08:07(sda7) > > [ 263.789671] requeue_io 301: inode 4031351 size 74 at 08:07(sda7) > > [ 263.789674] requeue_io 301: inode 4031408 size 175 at 08:07(sda7) > > [ 263.789678] requeue_io 301: inode 4031413 size 129 at 08:07(sda7) > > [ 263.789681] requeue_io 301: inode 4031415 size 391 at 08:07(sda7) > > [ 263.789690] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 > > global 4593 0 0 wc _M tw 1024 sk 0 > > [ 263.890184] requeue_io 301: inode 4031199 size 562 at 08:07(sda7) > > [ 263.890191] requeue_io 301: inode 4031231 size 329 at 08:07(sda7) > > [ 263.890195] requeue_io 301: inode 4031255 size 177 at 08:07(sda7) > > [ 263.890198] requeue_io 301: inode 4031268 size 94 at 08:07(sda7) > > [ 263.890202] requeue_io 301: inode 4031329 size 88 at 08:07(sda7) > > [ 263.890205] requeue_io 301: inode 4031351 size 74 at 08:07(sda7) > > [ 263.890208] requeue_io 301: inode 4031408 size 175 at 08:07(sda7) > > [ 263.890212] requeue_io 301: inode 4031413 size 129 at 08:07(sda7) > > [ 263.890215] requeue_io 301: inode 4031415 size 391 at 08:07(sda7) > > [ 263.890223] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 > > global 4593 0 0 wc _M tw 1024 sk 0 > > > > It's about sda7, not NFSv4. > > > > Is it a Reiserfs? We have a fresh fix for it: > > http://lkml.org/lkml/2007/10/23/93 > > Yes, it is a Reiserfs. Incidentally it is the partition that holds > the kernel sources and build directory. The message states that the > same bug exists in 2.6.23 but I do not see the same behavior in > 2.6.23. Anyway, I will apply the patch and see what I get. Fengguang, This patch does not fix anything for me. Even such light use of the reiserfs filesystem as pulling the linux-2.6 git tree updates caused one CPU to go to 75% iowait. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Wed, Oct 31, 2007 at 02:53:25PM +0800, Fengguang Wu wrote: > On Tue, Oct 30, 2007 at 10:52:45PM -0500, Florin Iucha wrote: > > On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote: > > > I have added the patches and started a linux kernel compilation, and > > > something really interesting happens. I run the build with the > > > equivalent of "make -j3" and in a separate console I am watching the > > > build with 'top'. The build consumes 98% of both CPUs. If I stop the > > > output in the build console with "Ctrl-S", one core goes to idle, > > > while the other is in 50% waiting, then goes to 75% waiting. When I > > > resume the build with "Ctrl-Q", the build starts to use both CPUs at > > > 98-99%. The NFS4 use was minimal, as I did not login with Gnome, but > > > just logged on the console. Also, the CPU that is in 75% waiting > > > state changes occasionally. 'Top' shows pdflush in D state, using > > > 5-6% of CPU. > > > > I forgot the traces: > > > >http://iucha.net/2.6.24-rc1/fw.1.gz > >http://iucha.net/2.6.24-rc1/fw.2.gz > >http://iucha.net/2.6.24-rc1/fw.3.gz > > Sorry for the delay - I've been fixing our server today. > > [ 263.685691] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global > 4593 0 0 wc _M tw 1024 sk 0 > [ 263.789648] requeue_io 301: inode 4031199 size 562 at 08:07(sda7) > [ 263.789656] requeue_io 301: inode 4031231 size 329 at 08:07(sda7) > [ 263.789660] requeue_io 301: inode 4031255 size 177 at 08:07(sda7) > [ 263.789664] requeue_io 301: inode 4031268 size 94 at 08:07(sda7) > [ 263.789667] requeue_io 301: inode 4031329 size 88 at 08:07(sda7) > [ 263.789671] requeue_io 301: inode 4031351 size 74 at 08:07(sda7) > [ 263.789674] requeue_io 301: inode 4031408 size 175 at 08:07(sda7) > [ 263.789678] requeue_io 301: inode 4031413 size 129 at 08:07(sda7) > [ 263.789681] requeue_io 301: inode 4031415 size 391 at 08:07(sda7) > [ 263.789690] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global > 4593 0 0 wc _M tw 1024 sk 0 > [ 263.890184] requeue_io 301: inode 4031199 size 562 at 08:07(sda7) > [ 263.890191] requeue_io 301: inode 4031231 size 329 at 08:07(sda7) > [ 263.890195] requeue_io 301: inode 4031255 size 177 at 08:07(sda7) > [ 263.890198] requeue_io 301: inode 4031268 size 94 at 08:07(sda7) > [ 263.890202] requeue_io 301: inode 4031329 size 88 at 08:07(sda7) > [ 263.890205] requeue_io 301: inode 4031351 size 74 at 08:07(sda7) > [ 263.890208] requeue_io 301: inode 4031408 size 175 at 08:07(sda7) > [ 263.890212] requeue_io 301: inode 4031413 size 129 at 08:07(sda7) > [ 263.890215] requeue_io 301: inode 4031415 size 391 at 08:07(sda7) > [ 263.890223] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global > 4593 0 0 wc _M tw 1024 sk 0 > > It's about sda7, not NFSv4. > > Is it a Reiserfs? We have a fresh fix for it: > http://lkml.org/lkml/2007/10/23/93 Yes, it is a Reiserfs. Incidentally it is the partition that holds the kernel sources and build directory. The message states that the same bug exists in 2.6.23 but I do not see the same behavior in 2.6.23. Anyway, I will apply the patch and see what I get. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Wed, Oct 31, 2007 at 02:53:25PM +0800, Fengguang Wu wrote: On Tue, Oct 30, 2007 at 10:52:45PM -0500, Florin Iucha wrote: On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote: I have added the patches and started a linux kernel compilation, and something really interesting happens. I run the build with the equivalent of make -j3 and in a separate console I am watching the build with 'top'. The build consumes 98% of both CPUs. If I stop the output in the build console with Ctrl-S, one core goes to idle, while the other is in 50% waiting, then goes to 75% waiting. When I resume the build with Ctrl-Q, the build starts to use both CPUs at 98-99%. The NFS4 use was minimal, as I did not login with Gnome, but just logged on the console. Also, the CPU that is in 75% waiting state changes occasionally. 'Top' shows pdflush in D state, using 5-6% of CPU. I forgot the traces: http://iucha.net/2.6.24-rc1/fw.1.gz http://iucha.net/2.6.24-rc1/fw.2.gz http://iucha.net/2.6.24-rc1/fw.3.gz Sorry for the delay - I've been fixing our server today. [ 263.685691] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 4593 0 0 wc _M tw 1024 sk 0 [ 263.789648] requeue_io 301: inode 4031199 size 562 at 08:07(sda7) [ 263.789656] requeue_io 301: inode 4031231 size 329 at 08:07(sda7) [ 263.789660] requeue_io 301: inode 4031255 size 177 at 08:07(sda7) [ 263.789664] requeue_io 301: inode 4031268 size 94 at 08:07(sda7) [ 263.789667] requeue_io 301: inode 4031329 size 88 at 08:07(sda7) [ 263.789671] requeue_io 301: inode 4031351 size 74 at 08:07(sda7) [ 263.789674] requeue_io 301: inode 4031408 size 175 at 08:07(sda7) [ 263.789678] requeue_io 301: inode 4031413 size 129 at 08:07(sda7) [ 263.789681] requeue_io 301: inode 4031415 size 391 at 08:07(sda7) [ 263.789690] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 4593 0 0 wc _M tw 1024 sk 0 [ 263.890184] requeue_io 301: inode 4031199 size 562 at 08:07(sda7) [ 263.890191] requeue_io 301: inode 4031231 size 329 at 08:07(sda7) [ 263.890195] requeue_io 301: inode 4031255 size 177 at 08:07(sda7) [ 263.890198] requeue_io 301: inode 4031268 size 94 at 08:07(sda7) [ 263.890202] requeue_io 301: inode 4031329 size 88 at 08:07(sda7) [ 263.890205] requeue_io 301: inode 4031351 size 74 at 08:07(sda7) [ 263.890208] requeue_io 301: inode 4031408 size 175 at 08:07(sda7) [ 263.890212] requeue_io 301: inode 4031413 size 129 at 08:07(sda7) [ 263.890215] requeue_io 301: inode 4031415 size 391 at 08:07(sda7) [ 263.890223] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 4593 0 0 wc _M tw 1024 sk 0 It's about sda7, not NFSv4. Is it a Reiserfs? We have a fresh fix for it: http://lkml.org/lkml/2007/10/23/93 Yes, it is a Reiserfs. Incidentally it is the partition that holds the kernel sources and build directory. The message states that the same bug exists in 2.6.23 but I do not see the same behavior in 2.6.23. Anyway, I will apply the patch and see what I get. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Wed, Oct 31, 2007 at 07:16:06AM -0500, Florin Iucha wrote: On Wed, Oct 31, 2007 at 02:53:25PM +0800, Fengguang Wu wrote: On Tue, Oct 30, 2007 at 10:52:45PM -0500, Florin Iucha wrote: On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote: I have added the patches and started a linux kernel compilation, and something really interesting happens. I run the build with the equivalent of make -j3 and in a separate console I am watching the build with 'top'. The build consumes 98% of both CPUs. If I stop the output in the build console with Ctrl-S, one core goes to idle, while the other is in 50% waiting, then goes to 75% waiting. When I resume the build with Ctrl-Q, the build starts to use both CPUs at 98-99%. The NFS4 use was minimal, as I did not login with Gnome, but just logged on the console. Also, the CPU that is in 75% waiting state changes occasionally. 'Top' shows pdflush in D state, using 5-6% of CPU. I forgot the traces: http://iucha.net/2.6.24-rc1/fw.1.gz http://iucha.net/2.6.24-rc1/fw.2.gz http://iucha.net/2.6.24-rc1/fw.3.gz Sorry for the delay - I've been fixing our server today. [ 263.685691] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 4593 0 0 wc _M tw 1024 sk 0 [ 263.789648] requeue_io 301: inode 4031199 size 562 at 08:07(sda7) [ 263.789656] requeue_io 301: inode 4031231 size 329 at 08:07(sda7) [ 263.789660] requeue_io 301: inode 4031255 size 177 at 08:07(sda7) [ 263.789664] requeue_io 301: inode 4031268 size 94 at 08:07(sda7) [ 263.789667] requeue_io 301: inode 4031329 size 88 at 08:07(sda7) [ 263.789671] requeue_io 301: inode 4031351 size 74 at 08:07(sda7) [ 263.789674] requeue_io 301: inode 4031408 size 175 at 08:07(sda7) [ 263.789678] requeue_io 301: inode 4031413 size 129 at 08:07(sda7) [ 263.789681] requeue_io 301: inode 4031415 size 391 at 08:07(sda7) [ 263.789690] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 4593 0 0 wc _M tw 1024 sk 0 [ 263.890184] requeue_io 301: inode 4031199 size 562 at 08:07(sda7) [ 263.890191] requeue_io 301: inode 4031231 size 329 at 08:07(sda7) [ 263.890195] requeue_io 301: inode 4031255 size 177 at 08:07(sda7) [ 263.890198] requeue_io 301: inode 4031268 size 94 at 08:07(sda7) [ 263.890202] requeue_io 301: inode 4031329 size 88 at 08:07(sda7) [ 263.890205] requeue_io 301: inode 4031351 size 74 at 08:07(sda7) [ 263.890208] requeue_io 301: inode 4031408 size 175 at 08:07(sda7) [ 263.890212] requeue_io 301: inode 4031413 size 129 at 08:07(sda7) [ 263.890215] requeue_io 301: inode 4031415 size 391 at 08:07(sda7) [ 263.890223] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 4593 0 0 wc _M tw 1024 sk 0 It's about sda7, not NFSv4. Is it a Reiserfs? We have a fresh fix for it: http://lkml.org/lkml/2007/10/23/93 Yes, it is a Reiserfs. Incidentally it is the partition that holds the kernel sources and build directory. The message states that the same bug exists in 2.6.23 but I do not see the same behavior in 2.6.23. Anyway, I will apply the patch and see what I get. Fengguang, This patch does not fix anything for me. Even such light use of the reiserfs filesystem as pulling the linux-2.6 git tree updates caused one CPU to go to 75% iowait. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote: > I have added the patches and started a linux kernel compilation, and > something really interesting happens. I run the build with the > equivalent of "make -j3" and in a separate console I am watching the > build with 'top'. The build consumes 98% of both CPUs. If I stop the > output in the build console with "Ctrl-S", one core goes to idle, > while the other is in 50% waiting, then goes to 75% waiting. When I > resume the build with "Ctrl-Q", the build starts to use both CPUs at > 98-99%. The NFS4 use was minimal, as I did not login with Gnome, but > just logged on the console. Also, the CPU that is in 75% waiting > state changes occasionally. 'Top' shows pdflush in D state, using > 5-6% of CPU. I forgot the traces: http://iucha.net/2.6.24-rc1/fw.1.gz http://iucha.net/2.6.24-rc1/fw.2.gz http://iucha.net/2.6.24-rc1/fw.3.gz florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Tue, Oct 30, 2007 at 07:49:41PM +0800, Fengguang Wu wrote: > > > It could be triggered by the more aggressive writeback behavior - the > > > new code will keep on retrying as long as there are dirty inodes pending. > > > > > > Florin, would you try the attached patches against 2.6.24-git? > > > They may generate big traffic of printk messages, but will help > > > debug the problem. > > > > I have updated to v2.6.24-rc1-334-g82798a1. After using my computer > > for two hours, I left the computer idle overnight. This morning, > > pdflushd is again consuming 25% of a CPU. I will try Fengguang's > > patches today. > > Thank you. > > Be sure to stop klogd ;-) > A `dmesg` will be sufficient for debugging. I have added the patches and started a linux kernel compilation, and something really interesting happens. I run the build with the equivalent of "make -j3" and in a separate console I am watching the build with 'top'. The build consumes 98% of both CPUs. If I stop the output in the build console with "Ctrl-S", one core goes to idle, while the other is in 50% waiting, then goes to 75% waiting. When I resume the build with "Ctrl-Q", the build starts to use both CPUs at 98-99%. The NFS4 use was minimal, as I did not login with Gnome, but just logged on the console. Also, the CPU that is in 75% waiting state changes occasionally. 'Top' shows pdflush in D state, using 5-6% of CPU. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Tue, Oct 30, 2007 at 07:49:41PM +0800, Fengguang Wu wrote: > On Tue, Oct 30, 2007 at 06:42:50AM -0500, Florin Iucha wrote: > > On Tue, Oct 30, 2007 at 03:54:03PM +0800, Fengguang Wu wrote: > > > It could be triggered by the more aggressive writeback behavior - the > > > new code will keep on retrying as long as there are dirty inodes pending. > > > > > > Florin, would you try the attached patches against 2.6.24-git? > > > They may generate big traffic of printk messages, but will help > > > debug the problem. > > > > I have updated to v2.6.24-rc1-334-g82798a1. After using my computer > > for two hours, I left the computer idle overnight. This morning, > > pdflushd is again consuming 25% of a CPU. I will try Fengguang's > > patches today. > > Thank you. > > Be sure to stop klogd ;-) > A `dmesg` will be sufficient for debugging. > > btw, I cannot reproduce it with simple NFSv3 tests. This is using nfsv4. And being idle in a GNOME session with home mounted from the server is hardly a stressful experience. But what do I know ;) florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Tue, Oct 30, 2007 at 03:54:03PM +0800, Fengguang Wu wrote: > On Sun, Oct 28, 2007 at 10:24:29AM -0500, Florin Iucha wrote: > [...] > >[ 3687.824468] > >[ 3687.824470] pdflush D 805787c0 0 248 2 > >[ 3687.824473] 810006001d90 0046 > > 0286 > >[ 3687.824476] 8100057fc770 810003062000 8100057fc978 > > 000106001da0 > >[ 3687.824480] 0003 8023b1b2 > > > >[ 3687.824483] Call Trace: > >[ 3687.824488] [] __mod_timer+0xb8/0xca > >[ 3687.824492] [] schedule_timeout+0x8d/0xb4 > >[ 3687.824496] [] process_timeout+0x0/0xb > >[ 3687.824499] [] io_schedule_timeout+0x28/0x33 > >[ 3687.824503] [] congestion_wait+0x6b/0x87 > >[ 3687.824506] [] autoremove_wake_function+0x0/0x38 > >[ 3687.824510] [] writeback_inodes+0xcd/0xd5 > >[ 3687.824514] [] wb_kupdate+0xbb/0x10d > >[ 3687.824518] [] pdflush+0x0/0x1c3 > >[ 3687.824520] [] pdflush+0x118/0x1c3 > >[ 3687.824523] [] wb_kupdate+0x0/0x10d > >[ 3687.824527] [] kthread+0x49/0x77 > >[ 3687.824530] [] child_rip+0xa/0x12 > >[ 3687.824535] [] kthread+0x0/0x77 > >[ 3687.824538] [] child_rip+0x0/0x12 > >[ 3687.824540] > > > > What could cause this? I use NFS4 to automount the home directories > > from a Solaris10 server, and this box found a few bugs in the NFS4 > > code (fixed in the 2.6.22 kernel). > > > > I'll try running with 2.6.23 again for a few days, to see if I get the > > pdflush stuck. Any other ideas? > > It could be triggered by the more aggressive writeback behavior - the > new code will keep on retrying as long as there are dirty inodes pending. > > Florin, would you try the attached patches against 2.6.24-git? > They may generate big traffic of printk messages, but will help > debug the problem. I have updated to v2.6.24-rc1-334-g82798a1. After using my computer for two hours, I left the computer idle overnight. This morning, pdflushd is again consuming 25% of a CPU. I will try Fengguang's patches today. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Tue, Oct 30, 2007 at 07:49:41PM +0800, Fengguang Wu wrote: On Tue, Oct 30, 2007 at 06:42:50AM -0500, Florin Iucha wrote: On Tue, Oct 30, 2007 at 03:54:03PM +0800, Fengguang Wu wrote: It could be triggered by the more aggressive writeback behavior - the new code will keep on retrying as long as there are dirty inodes pending. Florin, would you try the attached patches against 2.6.24-git? They may generate big traffic of printk messages, but will help debug the problem. I have updated to v2.6.24-rc1-334-g82798a1. After using my computer for two hours, I left the computer idle overnight. This morning, pdflushd is again consuming 25% of a CPU. I will try Fengguang's patches today. Thank you. Be sure to stop klogd ;-) A `dmesg` will be sufficient for debugging. btw, I cannot reproduce it with simple NFSv3 tests. This is using nfsv4. And being idle in a GNOME session with home mounted from the server is hardly a stressful experience. But what do I know ;) florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Tue, Oct 30, 2007 at 03:54:03PM +0800, Fengguang Wu wrote: On Sun, Oct 28, 2007 at 10:24:29AM -0500, Florin Iucha wrote: [...] [ 3687.824468] [ 3687.824470] pdflush D 805787c0 0 248 2 [ 3687.824473] 810006001d90 0046 0286 [ 3687.824476] 8100057fc770 810003062000 8100057fc978 000106001da0 [ 3687.824480] 0003 8023b1b2 [ 3687.824483] Call Trace: [ 3687.824488] [8023b1b2] __mod_timer+0xb8/0xca [ 3687.824492] [8055c87a] schedule_timeout+0x8d/0xb4 [ 3687.824496] [8023ad6c] process_timeout+0x0/0xb [ 3687.824499] [8055c79a] io_schedule_timeout+0x28/0x33 [ 3687.824503] [8026bb24] congestion_wait+0x6b/0x87 [ 3687.824506] [80245983] autoremove_wake_function+0x0/0x38 [ 3687.824510] [8029e684] writeback_inodes+0xcd/0xd5 [ 3687.824514] [80266dc4] wb_kupdate+0xbb/0x10d [ 3687.824518] [80267165] pdflush+0x0/0x1c3 [ 3687.824520] [8026727d] pdflush+0x118/0x1c3 [ 3687.824523] [80266d09] wb_kupdate+0x0/0x10d [ 3687.824527] [80245876] kthread+0x49/0x77 [ 3687.824530] [8020c598] child_rip+0xa/0x12 [ 3687.824535] [8024582d] kthread+0x0/0x77 [ 3687.824538] [8020c58e] child_rip+0x0/0x12 [ 3687.824540] What could cause this? I use NFS4 to automount the home directories from a Solaris10 server, and this box found a few bugs in the NFS4 code (fixed in the 2.6.22 kernel). I'll try running with 2.6.23 again for a few days, to see if I get the pdflush stuck. Any other ideas? It could be triggered by the more aggressive writeback behavior - the new code will keep on retrying as long as there are dirty inodes pending. Florin, would you try the attached patches against 2.6.24-git? They may generate big traffic of printk messages, but will help debug the problem. I have updated to v2.6.24-rc1-334-g82798a1. After using my computer for two hours, I left the computer idle overnight. This morning, pdflushd is again consuming 25% of a CPU. I will try Fengguang's patches today. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Tue, Oct 30, 2007 at 07:49:41PM +0800, Fengguang Wu wrote: It could be triggered by the more aggressive writeback behavior - the new code will keep on retrying as long as there are dirty inodes pending. Florin, would you try the attached patches against 2.6.24-git? They may generate big traffic of printk messages, but will help debug the problem. I have updated to v2.6.24-rc1-334-g82798a1. After using my computer for two hours, I left the computer idle overnight. This morning, pdflushd is again consuming 25% of a CPU. I will try Fengguang's patches today. Thank you. Be sure to stop klogd ;-) A `dmesg` will be sufficient for debugging. I have added the patches and started a linux kernel compilation, and something really interesting happens. I run the build with the equivalent of make -j3 and in a separate console I am watching the build with 'top'. The build consumes 98% of both CPUs. If I stop the output in the build console with Ctrl-S, one core goes to idle, while the other is in 50% waiting, then goes to 75% waiting. When I resume the build with Ctrl-Q, the build starts to use both CPUs at 98-99%. The NFS4 use was minimal, as I did not login with Gnome, but just logged on the console. Also, the CPU that is in 75% waiting state changes occasionally. 'Top' shows pdflush in D state, using 5-6% of CPU. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote: I have added the patches and started a linux kernel compilation, and something really interesting happens. I run the build with the equivalent of make -j3 and in a separate console I am watching the build with 'top'. The build consumes 98% of both CPUs. If I stop the output in the build console with Ctrl-S, one core goes to idle, while the other is in 50% waiting, then goes to 75% waiting. When I resume the build with Ctrl-Q, the build starts to use both CPUs at 98-99%. The NFS4 use was minimal, as I did not login with Gnome, but just logged on the console. Also, the CPU that is in 75% waiting state changes occasionally. 'Top' shows pdflush in D state, using 5-6% of CPU. I forgot the traces: http://iucha.net/2.6.24-rc1/fw.1.gz http://iucha.net/2.6.24-rc1/fw.2.gz http://iucha.net/2.6.24-rc1/fw.3.gz florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Mon, Oct 29, 2007 at 02:43:32PM -0400, Trond Myklebust wrote: > > On Mon, 2007-10-29 at 10:01 -0500, Florin Iucha wrote: > > On Mon, Oct 29, 2007 at 09:46:59AM -0400, Trond Myklebust wrote: > > > > What could cause this? I use NFS4 to automount the home directories > > > > from a Solaris10 server, and this box found a few bugs in the NFS4 > > > > code (fixed in the 2.6.22 kernel). > > > > > > > > I'll try running with 2.6.23 again for a few days, to see if I get the > > > > pdflush stuck. Any other ideas? > > > > > > One of them appears to be waiting for i/o congestion to clear up. If the > > > filesystem is NFS, then that means that some other thread is busy > > > writing data out to the server. You'll need to look at the rest of the > > > thread dump to figure out which thread is writing the data out, and > > > where it is getting stuck. > > > > Trond, > > > > The full dmesg is at http://iucha.net/2.6.24-rc1/dmesg.stuck_pdflush.gz > > I can't see any evidence of NFS traffic at all in those traces, but > there is a fuse process that is sleeping in :fuse:fuse_dev_read(). Could > that perhaps be relevant. That might be the overzealous Ubuntu trying to make the NTFS partition available. I will try to disable it and see if I can reproduce the hang. BTW: With 2.6.24-rc1+ it happens after a couple of hours. With 2.6.23 it did not happen after 6 hours or so. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Mon, Oct 29, 2007 at 09:46:59AM -0400, Trond Myklebust wrote: > > What could cause this? I use NFS4 to automount the home directories > > from a Solaris10 server, and this box found a few bugs in the NFS4 > > code (fixed in the 2.6.22 kernel). > > > > I'll try running with 2.6.23 again for a few days, to see if I get the > > pdflush stuck. Any other ideas? > > One of them appears to be waiting for i/o congestion to clear up. If the > filesystem is NFS, then that means that some other thread is busy > writing data out to the server. You'll need to look at the rest of the > thread dump to figure out which thread is writing the data out, and > where it is getting stuck. Trond, The full dmesg is at http://iucha.net/2.6.24-rc1/dmesg.stuck_pdflush.gz Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Mon, Oct 29, 2007 at 09:46:59AM -0400, Trond Myklebust wrote: What could cause this? I use NFS4 to automount the home directories from a Solaris10 server, and this box found a few bugs in the NFS4 code (fixed in the 2.6.22 kernel). I'll try running with 2.6.23 again for a few days, to see if I get the pdflush stuck. Any other ideas? One of them appears to be waiting for i/o congestion to clear up. If the filesystem is NFS, then that means that some other thread is busy writing data out to the server. You'll need to look at the rest of the thread dump to figure out which thread is writing the data out, and where it is getting stuck. Trond, The full dmesg is at http://iucha.net/2.6.24-rc1/dmesg.stuck_pdflush.gz Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
On Mon, Oct 29, 2007 at 02:43:32PM -0400, Trond Myklebust wrote: On Mon, 2007-10-29 at 10:01 -0500, Florin Iucha wrote: On Mon, Oct 29, 2007 at 09:46:59AM -0400, Trond Myklebust wrote: What could cause this? I use NFS4 to automount the home directories from a Solaris10 server, and this box found a few bugs in the NFS4 code (fixed in the 2.6.22 kernel). I'll try running with 2.6.23 again for a few days, to see if I get the pdflush stuck. Any other ideas? One of them appears to be waiting for i/o congestion to clear up. If the filesystem is NFS, then that means that some other thread is busy writing data out to the server. You'll need to look at the rest of the thread dump to figure out which thread is writing the data out, and where it is getting stuck. Trond, The full dmesg is at http://iucha.net/2.6.24-rc1/dmesg.stuck_pdflush.gz I can't see any evidence of NFS traffic at all in those traces, but there is a fuse process that is sleeping in :fuse:fuse_dev_read(). Could that perhaps be relevant. That might be the overzealous Ubuntu trying to make the NTFS partition available. I will try to disable it and see if I can reproduce the hang. BTW: With 2.6.24-rc1+ it happens after a couple of hours. With 2.6.23 it did not happen after 6 hours or so. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
Hello, For a week or two I started noticing that some time after I'm logged in, my keyboard input becomes a bit staggering, there is a small delay between the keypress and the actual character appearing in the terminal. This is on a AMD Athlon x2 4200+ with 2 GB RAM and just a gnome-terminal open. The machine is as idle as possible - monitored via the system monitor applet. I could not get any hard data on it, until now. After I logged off from GNOME, I switched to the text console and ran top, with the option of showing one CPU stats line for each CPU. Lo and behold, one core is 100% idle, and the other one is 25% idle and 75% waiting. Periodically, a pdflush process in 'D' state raises to the top. I did a 'echo t > /proc/sysrequest-trigger' and this is what is says for the two pdflush processes: [ 3687.824424] pdflush S 8100057ffef8 0 247 2 [ 3687.824427] 8100057ffed0 0046 8100057ffe70 8022a96c [ 3687.824431] 8100057fc000 810003040770 8100057fc208 0297 [ 3687.824434] 8100057ffe90 810002c1ba10 8100057ffed0 8022b9d2 [ 3687.824438] Call Trace: [ 3687.824440] [] enqueue_task_fair+0x21/0x34 [ 3687.82] [] set_user_nice+0x110/0x12c [ 3687.824448] [] pdflush+0x0/0x1c3 [ 3687.824451] [] pdflush+0xcf/0x1c3 [ 3687.824455] [] kthread+0x49/0x77 [ 3687.824458] [] child_rip+0xa/0x12 [ 3687.824463] [] kthread+0x0/0x77 [ 3687.824466] [] child_rip+0x0/0x12 [ 3687.824468] [ 3687.824470] pdflush D 805787c0 0 248 2 [ 3687.824473] 810006001d90 0046 0286 [ 3687.824476] 8100057fc770 810003062000 8100057fc978 000106001da0 [ 3687.824480] 0003 8023b1b2 [ 3687.824483] Call Trace: [ 3687.824488] [] __mod_timer+0xb8/0xca [ 3687.824492] [] schedule_timeout+0x8d/0xb4 [ 3687.824496] [] process_timeout+0x0/0xb [ 3687.824499] [] io_schedule_timeout+0x28/0x33 [ 3687.824503] [] congestion_wait+0x6b/0x87 [ 3687.824506] [] autoremove_wake_function+0x0/0x38 [ 3687.824510] [] writeback_inodes+0xcd/0xd5 [ 3687.824514] [] wb_kupdate+0xbb/0x10d [ 3687.824518] [] pdflush+0x0/0x1c3 [ 3687.824520] [] pdflush+0x118/0x1c3 [ 3687.824523] [] wb_kupdate+0x0/0x10d [ 3687.824527] [] kthread+0x49/0x77 [ 3687.824530] [] child_rip+0xa/0x12 [ 3687.824535] [] kthread+0x0/0x77 [ 3687.824538] [] child_rip+0x0/0x12 [ 3687.824540] What could cause this? I use NFS4 to automount the home directories from a Solaris10 server, and this box found a few bugs in the NFS4 code (fixed in the 2.6.22 kernel). I'll try running with 2.6.23 again for a few days, to see if I get the pdflush stuck. Any other ideas? florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
pdflush stuck in D state with v2.6.24-rc1-192-gef49c32
Hello, For a week or two I started noticing that some time after I'm logged in, my keyboard input becomes a bit staggering, there is a small delay between the keypress and the actual character appearing in the terminal. This is on a AMD Athlon x2 4200+ with 2 GB RAM and just a gnome-terminal open. The machine is as idle as possible - monitored via the system monitor applet. I could not get any hard data on it, until now. After I logged off from GNOME, I switched to the text console and ran top, with the option of showing one CPU stats line for each CPU. Lo and behold, one core is 100% idle, and the other one is 25% idle and 75% waiting. Periodically, a pdflush process in 'D' state raises to the top. I did a 'echo t /proc/sysrequest-trigger' and this is what is says for the two pdflush processes: [ 3687.824424] pdflush S 8100057ffef8 0 247 2 [ 3687.824427] 8100057ffed0 0046 8100057ffe70 8022a96c [ 3687.824431] 8100057fc000 810003040770 8100057fc208 0297 [ 3687.824434] 8100057ffe90 810002c1ba10 8100057ffed0 8022b9d2 [ 3687.824438] Call Trace: [ 3687.824440] [8022a96c] enqueue_task_fair+0x21/0x34 [ 3687.82] [8022b9d2] set_user_nice+0x110/0x12c [ 3687.824448] [80267165] pdflush+0x0/0x1c3 [ 3687.824451] [80267234] pdflush+0xcf/0x1c3 [ 3687.824455] [80245876] kthread+0x49/0x77 [ 3687.824458] [8020c598] child_rip+0xa/0x12 [ 3687.824463] [8024582d] kthread+0x0/0x77 [ 3687.824466] [8020c58e] child_rip+0x0/0x12 [ 3687.824468] [ 3687.824470] pdflush D 805787c0 0 248 2 [ 3687.824473] 810006001d90 0046 0286 [ 3687.824476] 8100057fc770 810003062000 8100057fc978 000106001da0 [ 3687.824480] 0003 8023b1b2 [ 3687.824483] Call Trace: [ 3687.824488] [8023b1b2] __mod_timer+0xb8/0xca [ 3687.824492] [8055c87a] schedule_timeout+0x8d/0xb4 [ 3687.824496] [8023ad6c] process_timeout+0x0/0xb [ 3687.824499] [8055c79a] io_schedule_timeout+0x28/0x33 [ 3687.824503] [8026bb24] congestion_wait+0x6b/0x87 [ 3687.824506] [80245983] autoremove_wake_function+0x0/0x38 [ 3687.824510] [8029e684] writeback_inodes+0xcd/0xd5 [ 3687.824514] [80266dc4] wb_kupdate+0xbb/0x10d [ 3687.824518] [80267165] pdflush+0x0/0x1c3 [ 3687.824520] [8026727d] pdflush+0x118/0x1c3 [ 3687.824523] [80266d09] wb_kupdate+0x0/0x10d [ 3687.824527] [80245876] kthread+0x49/0x77 [ 3687.824530] [8020c598] child_rip+0xa/0x12 [ 3687.824535] [8024582d] kthread+0x0/0x77 [ 3687.824538] [8020c58e] child_rip+0x0/0x12 [ 3687.824540] What could cause this? I use NFS4 to automount the home directories from a Solaris10 server, and this box found a few bugs in the NFS4 code (fixed in the 2.6.22 kernel). I'll try running with 2.6.23 again for a few days, to see if I get the pdflush stuck. Any other ideas? florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: kernel NULL pointer dereference in blk_rq_map_sg with v2.6.23-6815-g0895e91
On Tue, Oct 23, 2007 at 07:46:37AM -0500, Florin Iucha wrote: > [ 60.656136] Unable to handle kernel NULL pointer dereference at > RIP: > [ 60.656143] [] blk_rq_map_sg+0x10d/0x17c > [ 60.656151] PGD 4640067 PUD 46d4067 PMD 0 > [ 60.656154] Oops: [1] SMP > [ 60.656157] CPU 1 > ... There was a DVD in the drive. After the OOPS, I cannot eject it via the button, and the "eject" command is stuck in "D" state: [ 436.308282] eject D 80571760 0 5336 5324 [ 436.308285] 810007c35d08 0082 810007c35ca8 [ 436.308288] 810006fb15f0 810003062000 810006fb17f8 0001 [ 436.308292] 0003 8100057e1070 [ 436.308295] Call Trace: [ 436.308301] [] __mutex_lock_slowpath+0x133/0x23c [ 436.308306] [] mutex_lock+0x19/0x1d [ 436.308309] [] do_open+0x74/0x2d1 [ 436.308313] [] blkdev_open+0x0/0x69 [ 436.308315] [] blkdev_open+0x37/0x69 [ 436.308319] [] __dentry_open+0xe6/0x1bd [ 436.308323] [] nameidata_to_filp+0x2d/0x3f [ 436.308326] [] do_filp_open+0x39/0x4b [ 436.308330] [] _spin_unlock+0x9/0xb [ 436.308333] [] get_unused_fd_flags+0x113/0x121 [ 436.308337] [] do_sys_open+0x51/0xd9 [ 436.308341] [] sys_open+0x1b/0x1d [ 436.308343] [] system_call+0x7e/0x83 florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
kernel NULL pointer dereference in blk_rq_map_sg with v2.6.23-6815-g0895e91
Jens, This is freshly after booting into this morning's kernel: [ 60.656136] Unable to handle kernel NULL pointer dereference at RIP: [ 60.656143] [] blk_rq_map_sg+0x10d/0x17c [ 60.656151] PGD 4640067 PUD 46d4067 PMD 0 [ 60.656154] Oops: [1] SMP [ 60.656157] CPU 1 [ 60.656159] Modules linked in: sbp2 lp dvb_pll lgdt330x cx88_dvb cx88_vp3054_i2c videobuf_dvb tuner tea5767 td a8290 tuner_simple mt20xx cx88_alsa cx8802 cx8800 cx88xx ir_common tveeprom videobuf_dma_sg videobuf_core btcx_ri sc i2c_nforce2 evdev rtc forcedeth ehci_hcd fuse [ 60.656176] Pid: 4250, comm: hald-probe-stor Not tainted 2.6.24-rc0-5 #1 [ 60.656178] RIP: 0010:[] [] blk_rq_map_sg+0x10d/0x17c [ 60.656182] RSP: 0018:810004791930 EFLAGS: 00010246 [ 60.656184] RAX: 0403b000 RBX: 1000 RCX: 6db6db6db6db6db7 [ 60.656187] RDX: RSI: 81000100 RDI: 05701000 [ 60.656189] RBP: 810004791968 R08: 0570 R09: 8100044aa060 [ 60.656191] R10: R11: 8100050dea00 R12: 2000 [ 60.656193] R13: 8100060d2700 R14: R15: 807f [ 60.656196] FS: 2b5da088e6e0() GS:810003011500() knlGS: [ 60.656198] CS: 0010 DS: ES: CR0: 8005003b [ 60.656200] CR2: CR3: 04568000 CR4: 06e0 [ 60.656202] DR0: DR1: DR2: [ 60.656204] DR3: DR6: 0ff0 DR7: 0400 [ 60.656207] Process hald-probe-stor (pid: 4250, threadinfo 81000479, task 810006312000) [ 60.656208] Stack: 81000607a000 00010001 8100040fa120 807fe2c0 [ 60.656213] 81000607a000 81000607a000 807fe2c0 8100047919a8 [ 60.656217] 8041bb58 8100047919a8 8100040fa120 807fe2c0 [ 60.656220] Call Trace: [ 60.656226] [] ide_map_sg+0x38/0xb0 [ 60.656231] [] cdrom_start_read_continuation+0x0/0xb5 [ 60.656234] [] ide_build_sglist+0x38/0x88 [ 60.656238] [] ide_build_dmatable+0x2f/0x172 [ 60.656241] [] ide_dma_setup+0x34/0xaa [ 60.656245] [] cdrom_start_packet_command+0x5a/0x177 [ 60.656249] [] cfq_dispatch_insert+0x38/0x50 [ 60.656253] [] ide_do_rw_cdrom+0x423/0x57c [ 60.656257] [] ide_do_request+0x7a7/0xa74 [ 60.656263] [] del_timer+0x52/0x5d [ 60.656267] [] sync_page+0x0/0x45 [ 60.656269] [] do_ide_request+0x1b/0x1d [ 60.656273] [] __generic_unplug_device+0x28/0x2c [ 60.656276] [] generic_unplug_device+0x20/0x31 [ 60.656279] [] blk_backing_dev_unplug+0x16/0x18 [ 60.656283] [] block_sync_page+0x42/0x44 [ 60.656285] [] sync_page+0x3c/0x45 [ 60.656290] [] __wait_on_bit_lock+0x42/0x79 [ 60.656294] [] __lock_page+0x64/0x6b [ 60.656298] [] wake_bit_function+0x0/0x2a [ 60.656301] [] do_generic_mapping_read+0x1da/0x383 [ 60.656304] [] file_read_actor+0x0/0x137 [ 60.656309] [] generic_file_aio_read+0x11e/0x15d [ 60.656315] [] do_sync_read+0xe2/0x126 [ 60.656318] [] handle_mm_fault+0x62e/0x65e [ 60.656324] [] __up_read+0x8f/0x97 [ 60.656327] [] autoremove_wake_function+0x0/0x38 [ 60.656331] [] __mutex_lock_slowpath+0x22f/0x23c [ 60.656337] [] vfs_read+0xab/0x134 [ 60.656341] [] sys_read+0x47/0x6f [ 60.656345] [] system_call+0x7e/0x83 [ 60.656349] [ 60.656350] [ 60.656350] Code: 49 8b 02 41 c7 42 18 00 00 00 00 49 c7 42 10 00 00 00 00 83 [ 60.656359] RIP [] blk_rq_map_sg+0x10d/0x17c [ 60.656362] RSP [ 60.656363] CR2: Platform is AMD64 and the userspace is Ubuntu 7/10 Gutsy Gibbon. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
kernel NULL pointer dereference in blk_rq_map_sg with v2.6.23-6815-g0895e91
Jens, This is freshly after booting into this morning's kernel: [ 60.656136] Unable to handle kernel NULL pointer dereference at RIP: [ 60.656143] [80375553] blk_rq_map_sg+0x10d/0x17c [ 60.656151] PGD 4640067 PUD 46d4067 PMD 0 [ 60.656154] Oops: [1] SMP [ 60.656157] CPU 1 [ 60.656159] Modules linked in: sbp2 lp dvb_pll lgdt330x cx88_dvb cx88_vp3054_i2c videobuf_dvb tuner tea5767 td a8290 tuner_simple mt20xx cx88_alsa cx8802 cx8800 cx88xx ir_common tveeprom videobuf_dma_sg videobuf_core btcx_ri sc i2c_nforce2 evdev rtc forcedeth ehci_hcd fuse [ 60.656176] Pid: 4250, comm: hald-probe-stor Not tainted 2.6.24-rc0-5 #1 [ 60.656178] RIP: 0010:[80375553] [80375553] blk_rq_map_sg+0x10d/0x17c [ 60.656182] RSP: 0018:810004791930 EFLAGS: 00010246 [ 60.656184] RAX: 0403b000 RBX: 1000 RCX: 6db6db6db6db6db7 [ 60.656187] RDX: RSI: 81000100 RDI: 05701000 [ 60.656189] RBP: 810004791968 R08: 0570 R09: 8100044aa060 [ 60.656191] R10: R11: 8100050dea00 R12: 2000 [ 60.656193] R13: 8100060d2700 R14: R15: 807f [ 60.656196] FS: 2b5da088e6e0() GS:810003011500() knlGS: [ 60.656198] CS: 0010 DS: ES: CR0: 8005003b [ 60.656200] CR2: CR3: 04568000 CR4: 06e0 [ 60.656202] DR0: DR1: DR2: [ 60.656204] DR3: DR6: 0ff0 DR7: 0400 [ 60.656207] Process hald-probe-stor (pid: 4250, threadinfo 81000479, task 810006312000) [ 60.656208] Stack: 81000607a000 00010001 8100040fa120 807fe2c0 [ 60.656213] 81000607a000 81000607a000 807fe2c0 8100047919a8 [ 60.656217] 8041bb58 8100047919a8 8100040fa120 807fe2c0 [ 60.656220] Call Trace: [ 60.656226] [8041bb58] ide_map_sg+0x38/0xb0 [ 60.656231] [8042952b] cdrom_start_read_continuation+0x0/0xb5 [ 60.656234] [80423806] ide_build_sglist+0x38/0x88 [ 60.656238] [80423885] ide_build_dmatable+0x2f/0x172 [ 60.656241] [804239fc] ide_dma_setup+0x34/0xaa [ 60.656245] [804277e5] cdrom_start_packet_command+0x5a/0x177 [ 60.656249] [8037fac4] cfq_dispatch_insert+0x38/0x50 [ 60.656253] [80428339] ide_do_rw_cdrom+0x423/0x57c [ 60.656257] [8041c56c] ide_do_request+0x7a7/0xa74 [ 60.656263] [8023c097] del_timer+0x52/0x5d [ 60.656267] [8025d343] sync_page+0x0/0x45 [ 60.656269] [8041cba0] do_ide_request+0x1b/0x1d [ 60.656273] [803778a7] __generic_unplug_device+0x28/0x2c [ 60.656276] [80377c6e] generic_unplug_device+0x20/0x31 [ 60.656279] [803751b1] blk_backing_dev_unplug+0x16/0x18 [ 60.656283] [8029decc] block_sync_page+0x42/0x44 [ 60.656285] [8025d37f] sync_page+0x3c/0x45 [ 60.656290] [805589b8] __wait_on_bit_lock+0x42/0x79 [ 60.656294] [8025d32f] __lock_page+0x64/0x6b [ 60.656298] [8024664b] wake_bit_function+0x0/0x2a [ 60.656301] [8025da95] do_generic_mapping_read+0x1da/0x383 [ 60.656304] [8025d08d] file_read_actor+0x0/0x137 [ 60.656309] [8025f1af] generic_file_aio_read+0x11e/0x15d [ 60.656315] [8027ee59] do_sync_read+0xe2/0x126 [ 60.656318] [8026b15a] handle_mm_fault+0x62e/0x65e [ 60.656324] [80386fcc] __up_read+0x8f/0x97 [ 60.656327] [80246613] autoremove_wake_function+0x0/0x38 [ 60.656331] [80559233] __mutex_lock_slowpath+0x22f/0x23c [ 60.656337] [8027f5f0] vfs_read+0xab/0x134 [ 60.656341] [8027f9b5] sys_read+0x47/0x6f [ 60.656345] [8020b77e] system_call+0x7e/0x83 [ 60.656349] [ 60.656350] [ 60.656350] Code: 49 8b 02 41 c7 42 18 00 00 00 00 49 c7 42 10 00 00 00 00 83 [ 60.656359] RIP [80375553] blk_rq_map_sg+0x10d/0x17c [ 60.656362] RSP 810004791930 [ 60.656363] CR2: Platform is AMD64 and the userspace is Ubuntu 7/10 Gutsy Gibbon. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: kernel NULL pointer dereference in blk_rq_map_sg with v2.6.23-6815-g0895e91
On Tue, Oct 23, 2007 at 07:46:37AM -0500, Florin Iucha wrote: [ 60.656136] Unable to handle kernel NULL pointer dereference at RIP: [ 60.656143] [80375553] blk_rq_map_sg+0x10d/0x17c [ 60.656151] PGD 4640067 PUD 46d4067 PMD 0 [ 60.656154] Oops: [1] SMP [ 60.656157] CPU 1 ... There was a DVD in the drive. After the OOPS, I cannot eject it via the button, and the eject command is stuck in D state: [ 436.308282] eject D 80571760 0 5336 5324 [ 436.308285] 810007c35d08 0082 810007c35ca8 [ 436.308288] 810006fb15f0 810003062000 810006fb17f8 0001 [ 436.308292] 0003 8100057e1070 [ 436.308295] Call Trace: [ 436.308301] [80559137] __mutex_lock_slowpath+0x133/0x23c [ 436.308306] [80559259] mutex_lock+0x19/0x1d [ 436.308309] [802a35a0] do_open+0x74/0x2d1 [ 436.308313] [802a3a02] blkdev_open+0x0/0x69 [ 436.308315] [802a3a39] blkdev_open+0x37/0x69 [ 436.308319] [8027d68e] __dentry_open+0xe6/0x1bd [ 436.308323] [8027d7fd] nameidata_to_filp+0x2d/0x3f [ 436.308326] [8027d848] do_filp_open+0x39/0x4b [ 436.308330] [8055a16d] _spin_unlock+0x9/0xb [ 436.308333] [8027d58d] get_unused_fd_flags+0x113/0x121 [ 436.308337] [8027d8ab] do_sys_open+0x51/0xd9 [ 436.308341] [8027d95c] sys_open+0x1b/0x1d [ 436.308343] [8020b77e] system_call+0x7e/0x83 florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: NFS woes again
On Thu, Aug 30, 2007 at 03:18:37PM -0700, Bret Towe wrote: > > uptime of 3 hours and keyboard is still working fine > > I'll hopefully get to test this on the mini tomorrow for at least 3 hours > > also > > got 45min on mini before I had to go elsewhere > the amd64 shutdown fine and has been up for more than 3 hours > I'd say the patch does it Yup. Same here. Many startups, shuthdowns and minutes of uptime, with no observations. Check it in! Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: NFS woes again
On Thu, Aug 30, 2007 at 03:18:37PM -0700, Bret Towe wrote: uptime of 3 hours and keyboard is still working fine I'll hopefully get to test this on the mini tomorrow for at least 3 hours also got 45min on mini before I had to go elsewhere the amd64 shutdown fine and has been up for more than 3 hours I'd say the patch does it Yup. Same here. Many startups, shuthdowns and minutes of uptime, with no observations. Check it in! Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: NFS woes again
On Tue, Aug 28, 2007 at 09:28:43AM -0400, Trond Myklebust wrote: > Doh! I see the problem: cancel_delayed_work_sync() shouldn't ever be > called recursively. > > The following patch should be correct. Please just discard the previous > one... So far so good. This patch got one hour uptime... I'll stay with this kernel for a few days, to keep an eye on it. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: NFS woes again
On Tue, Aug 28, 2007 at 09:28:43AM -0400, Trond Myklebust wrote: Doh! I see the problem: cancel_delayed_work_sync() shouldn't ever be called recursively. The following patch should be correct. Please just discard the previous one... So far so good. This patch got one hour uptime... I'll stay with this kernel for a few days, to keep an eye on it. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: NFS woes again
On Mon, Aug 27, 2007 at 06:19:29PM -0700, Bret Towe wrote: > On 8/27/07, Trond Myklebust <[EMAIL PROTECTED]> wrote: > > > > this sounds alot like the post i did yesterday titled 'nfs4 hang > > > > regression' > > > > i tracked it down to commit 3d39c691ff486142dd9aaeac12f553f4476b7a6 > > > > > > Yes, it certainly does -- all the symptoms match! > > > > Could you and Bret please check if the attached patch fixes the hang? > > no good for me still hangs after ~30minutes I just booted into the new kernel (3d39c691ff486142dd9aaeac12f553f4476b7a6 + Trond's patch) and it hangs in 10-15 minutes. Process traces available at http://iucha.net/nfs/23-rc2-nfs-fix-1/kernel.log.gz Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: NFS woes again
On Mon, Aug 27, 2007 at 06:19:29PM -0700, Bret Towe wrote: On 8/27/07, Trond Myklebust [EMAIL PROTECTED] wrote: this sounds alot like the post i did yesterday titled 'nfs4 hang regression' i tracked it down to commit 3d39c691ff486142dd9aaeac12f553f4476b7a6 Yes, it certainly does -- all the symptoms match! Could you and Bret please check if the attached patch fixes the hang? no good for me still hangs after ~30minutes I just booted into the new kernel (3d39c691ff486142dd9aaeac12f553f4476b7a6 + Trond's patch) and it hangs in 10-15 minutes. Process traces available at http://iucha.net/nfs/23-rc2-nfs-fix-1/kernel.log.gz Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: NFS woes again Was: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Thu, Aug 23, 2007 at 10:14:38AM -0700, Bret Towe wrote: > this sounds alot like the post i did yesterday titled 'nfs4 hang regression' > i tracked it down to commit 3d39c691ff486142dd9aaeac12f553f4476b7a6 Yes, it certainly does -- all the symptoms match! I'm not [alone in] seeing dead keyboards! Now, if only somebody could clarify to me the connection between the bad NFS4 shooting the keyboard but not the mouse, that would be wonderful. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
NFS woes again Was: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
Trond, Fess up... I'm closing in: http://iucha.net/2.6.23-rc3/2.6.23-rc-bisect.png [Dropping Jiri and linux-usb-devel from future postings. You are included now just for communicating the conclusion of this thread.] On Wed, Aug 22, 2007 at 08:22:00AM -0500, Florin Iucha wrote: > On Tue, Aug 21, 2007 at 03:42:26PM +0200, Jiri Kosina wrote: > > On Tue, 21 Aug 2007, Florin Iucha wrote: > > > > > There is another interesting angle to this: in the past, every time I > > > had keyboard problems, it used to be caused by the VFS and/or NFS... > > > after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, > > > Alan!). Now, after the keyboard "locked up", I used the mouse to close > > > the gnome session, then I logged-in remotely to reboot. The reboot > > > process locked up and I need to use the reset button! The second time > > > the keyboard "locked up" I listed my processes, and I noticed that I had > > > a couple of bash processes and a ssh process in "D" state. Something is > > > fishy again in the VFS ;) > > > > Yes, there were some NFS updates in between -rc2 and > > 28e8351ac22de25034e048c680014ad824323c65. I'd be now even more curious > > what are you going to find by bisect, please let us know. > > > > I added Trond to CC, full thread to be found at > > http://lkml.org/lkml/2007/8/21/151 for reference. > > > > Florin, it also might be useful to capture the states of stuck processess > > via alt-sysrq-T (or better by echo t > /proc/sysrq-trigger), so that we > > know better where are they stuck. > > This morning it took a bit longer to hang, but it happened. The > backtraces are at http://iucha.net/2.6.23-rc3/backtraces.gz . > > I'll try a bisect session this weekend. [EMAIL PROTECTED] $ git bisect bad Bisecting: 5 revisions left to test after this [EMAIL PROTECTED] $ git bisect log git-bisect start # good: [d4ac2477fad0f2680e84ec12e387ce67682c5c13] Linux 2.6.23-rc2 git-bisect good d4ac2477fad0f2680e84ec12e387ce67682c5c13 # bad: [28e8351ac22de25034e048c680014ad824323c65] Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes git-bisect bad 28e8351ac22de25034e048c680014ad824323c65 # bad: [8f2ea1fd3f97ab7a809e939b5b9005a16f862439] [POWERPC] Fix initialization and usage of dma_mask git-bisect bad 8f2ea1fd3f97ab7a809e939b5b9005a16f862439 # good: [ff95f3df54609d9d4b9572f8a67d09922a645043] sched: remove the 'u64 now' parameter from pick_next_task() git-bisect good ff95f3df54609d9d4b9572f8a67d09922a645043 # good: [be12014dd7750648fde33e1e45cac24dc9a8be6d] Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched git-bisect good be12014dd7750648fde33e1e45cac24dc9a8be6d # good: [6a0ed91e361a93ee1efb4c20c4967024ed2a8dd7] hexdump: use const notation git-bisect good 6a0ed91e361a93ee1efb4c20c4967024ed2a8dd7 # bad: [6adb31c90c47262c8a25bf5097de9b3426caf3ae] remove dubious legal statment from uio-howto git-bisect bad 6adb31c90c47262c8a25bf5097de9b3426caf3ae florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
NFS woes again Was: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
Trond, Fess up... I'm closing in: http://iucha.net/2.6.23-rc3/2.6.23-rc-bisect.png [Dropping Jiri and linux-usb-devel from future postings. You are included now just for communicating the conclusion of this thread.] On Wed, Aug 22, 2007 at 08:22:00AM -0500, Florin Iucha wrote: On Tue, Aug 21, 2007 at 03:42:26PM +0200, Jiri Kosina wrote: On Tue, 21 Aug 2007, Florin Iucha wrote: There is another interesting angle to this: in the past, every time I had keyboard problems, it used to be caused by the VFS and/or NFS... after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, Alan!). Now, after the keyboard locked up, I used the mouse to close the gnome session, then I logged-in remotely to reboot. The reboot process locked up and I need to use the reset button! The second time the keyboard locked up I listed my processes, and I noticed that I had a couple of bash processes and a ssh process in D state. Something is fishy again in the VFS ;) Yes, there were some NFS updates in between -rc2 and 28e8351ac22de25034e048c680014ad824323c65. I'd be now even more curious what are you going to find by bisect, please let us know. I added Trond to CC, full thread to be found at http://lkml.org/lkml/2007/8/21/151 for reference. Florin, it also might be useful to capture the states of stuck processess via alt-sysrq-T (or better by echo t /proc/sysrq-trigger), so that we know better where are they stuck. This morning it took a bit longer to hang, but it happened. The backtraces are at http://iucha.net/2.6.23-rc3/backtraces.gz . I'll try a bisect session this weekend. [EMAIL PROTECTED] $ git bisect bad Bisecting: 5 revisions left to test after this [EMAIL PROTECTED] $ git bisect log git-bisect start # good: [d4ac2477fad0f2680e84ec12e387ce67682c5c13] Linux 2.6.23-rc2 git-bisect good d4ac2477fad0f2680e84ec12e387ce67682c5c13 # bad: [28e8351ac22de25034e048c680014ad824323c65] Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes git-bisect bad 28e8351ac22de25034e048c680014ad824323c65 # bad: [8f2ea1fd3f97ab7a809e939b5b9005a16f862439] [POWERPC] Fix initialization and usage of dma_mask git-bisect bad 8f2ea1fd3f97ab7a809e939b5b9005a16f862439 # good: [ff95f3df54609d9d4b9572f8a67d09922a645043] sched: remove the 'u64 now' parameter from pick_next_task() git-bisect good ff95f3df54609d9d4b9572f8a67d09922a645043 # good: [be12014dd7750648fde33e1e45cac24dc9a8be6d] Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched git-bisect good be12014dd7750648fde33e1e45cac24dc9a8be6d # good: [6a0ed91e361a93ee1efb4c20c4967024ed2a8dd7] hexdump: use const notation git-bisect good 6a0ed91e361a93ee1efb4c20c4967024ed2a8dd7 # bad: [6adb31c90c47262c8a25bf5097de9b3426caf3ae] remove dubious legal statment from uio-howto git-bisect bad 6adb31c90c47262c8a25bf5097de9b3426caf3ae florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: NFS woes again Was: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Thu, Aug 23, 2007 at 10:14:38AM -0700, Bret Towe wrote: this sounds alot like the post i did yesterday titled 'nfs4 hang regression' i tracked it down to commit 3d39c691ff486142dd9aaeac12f553f4476b7a6 Yes, it certainly does -- all the symptoms match! I'm not [alone in] seeing dead keyboards! Now, if only somebody could clarify to me the connection between the bad NFS4 shooting the keyboard but not the mouse, that would be wonderful. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 03:42:26PM +0200, Jiri Kosina wrote: > On Tue, 21 Aug 2007, Florin Iucha wrote: > > > There is another interesting angle to this: in the past, every time I > > had keyboard problems, it used to be caused by the VFS and/or NFS... > > after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, > > Alan!). Now, after the keyboard "locked up", I used the mouse to close > > the gnome session, then I logged-in remotely to reboot. The reboot > > process locked up and I need to use the reset button! The second time > > the keyboard "locked up" I listed my processes, and I noticed that I had > > a couple of bash processes and a ssh process in "D" state. Something is > > fishy again in the VFS ;) > > Yes, there were some NFS updates in between -rc2 and > 28e8351ac22de25034e048c680014ad824323c65. I'd be now even more curious > what are you going to find by bisect, please let us know. > > I added Trond to CC, full thread to be found at > http://lkml.org/lkml/2007/8/21/151 for reference. > > Florin, it also might be useful to capture the states of stuck processess > via alt-sysrq-T (or better by echo t > /proc/sysrq-trigger), so that we > know better where are they stuck. This morning it took a bit longer to hang, but it happened. The backtraces are at http://iucha.net/2.6.23-rc3/backtraces.gz . I'll try a bisect session this weekend. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 03:42:26PM +0200, Jiri Kosina wrote: On Tue, 21 Aug 2007, Florin Iucha wrote: There is another interesting angle to this: in the past, every time I had keyboard problems, it used to be caused by the VFS and/or NFS... after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, Alan!). Now, after the keyboard locked up, I used the mouse to close the gnome session, then I logged-in remotely to reboot. The reboot process locked up and I need to use the reset button! The second time the keyboard locked up I listed my processes, and I noticed that I had a couple of bash processes and a ssh process in D state. Something is fishy again in the VFS ;) Yes, there were some NFS updates in between -rc2 and 28e8351ac22de25034e048c680014ad824323c65. I'd be now even more curious what are you going to find by bisect, please let us know. I added Trond to CC, full thread to be found at http://lkml.org/lkml/2007/8/21/151 for reference. Florin, it also might be useful to capture the states of stuck processess via alt-sysrq-T (or better by echo t /proc/sysrq-trigger), so that we know better where are they stuck. This morning it took a bit longer to hang, but it happened. The backtraces are at http://iucha.net/2.6.23-rc3/backtraces.gz . I'll try a bisect session this weekend. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 08:17:59AM -0500, Florin Iucha wrote: > On Tue, Aug 21, 2007 at 03:05:25PM +0200, Jiri Kosina wrote: > > > I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and > > > 'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if > > > the keyboard/usb behaves or not. > > > > Thanks. If this doesn't give us any hint, it would be useful if you could > > do git-bisect between rc2 and rc3, I really can't immediately see anything > > in the list of commits that might directly cause the behavior you are > > seeing (most importantly because there were no USB and no HID updates in > > this window). > > The keyboard still locked up. There is absolutely nothing in the > kernel log. > > > There are approximately 290 commits, so it shouldn't require more than 9 > > reboots plus the time needed to check whether the bug triggers or not. > > The top commit is not v2.6.23-rc3 but > >commit 28e8351ac22de25034e048c680014ad824323c65 >Merge: 3b993e8... d18c4d6... >Author: Linus Torvalds <[EMAIL PROTECTED]> >Date: Tue Aug 14 10:00:29 2007 -0700 > >Merge > git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes > > I'll try to make time to bisect it... There is another interesting angle to this: in the past, every time I had keyboard problems, it used to be caused by the VFS and/or NFS... after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, Alan!). Now, after the keyboard "locked up", I used the mouse to close the gnome session, then I logged-in remotely to reboot. The reboot process locked up and I need to use the reset button! The second time the keyboard "locked up" I listed my processes, and I noticed that I had a couple of bash processes and a ssh process in "D" state. Something is fishy again in the VFS ;) florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 03:05:25PM +0200, Jiri Kosina wrote: > > I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and > > 'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if > > the keyboard/usb behaves or not. > > Thanks. If this doesn't give us any hint, it would be useful if you could > do git-bisect between rc2 and rc3, I really can't immediately see anything > in the list of commits that might directly cause the behavior you are > seeing (most importantly because there were no USB and no HID updates in > this window). The keyboard still locked up. There is absolutely nothing in the kernel log. > There are approximately 290 commits, so it shouldn't require more than 9 > reboots plus the time needed to check whether the bug triggers or not. The top commit is not v2.6.23-rc3 but commit 28e8351ac22de25034e048c680014ad824323c65 Merge: 3b993e8... d18c4d6... Author: Linus Torvalds <[EMAIL PROTECTED]> Date: Tue Aug 14 10:00:29 2007 -0700 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes I'll try to make time to bisect it... Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 06:51:15AM -0500, Florin Iucha wrote: > On Wed, Aug 15, 2007 at 04:58:33PM +0200, Jiri Kosina wrote: > > On Wed, 15 Aug 2007, Florin Iucha wrote: > > > > > [See my message to Alan]: It happened twice, within 15 minutes of > > > boot+login, with 2.6.23-rc3-$whatever . I does not happen with > > > 2.6.2[123](-rc*)? After the two incidents, I rebooted in 2.6.23-rc2 and > > > it is working for an hour now. > > > > It is not immediately clear what might be causing this, 2.6.23-rc3 didn't > > get any USB nor HID updates at all compared to 2.6.23-rc2. > > > > Could you please enable USB and HID debugging to see whether we can see > > anything spurious in the logs at the time the keyboard gets stuck? > > Jiri, > > I have enabled USB debugging and I see a bunch (=46) of these messages: > >[ $timestamp] usb 1-9: usb auto-suspend >[ $timestamp] usb 1-9: usb auto-resume >[ $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER > sig=se0 PE CONNECT >[ $timestamp] usb 1-9: finish resume > > The messages continued to be logged, even after the keyboard has > become unresponsive. [snip] > Do you need me to build a -rc2 with USB debug enabled to compare and > contrast? With 2.6.23-rc2 and USB_DEBUG enabled, I see the same messages but no keyboard "dissapearance". I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and 'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if the keyboard/usb behaves or not. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 02:04:26PM +0200, Jiri Kosina wrote: > > I have enabled USB debugging and I see a bunch (=46) of these messages: > > >[ $timestamp] usb 1-9: usb auto-suspend > >[ $timestamp] usb 1-9: usb auto-resume > >[ $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 > > POWER sig=se0 PE CONNECT > >[ $timestamp] usb 1-9: finish resume > > The messages continued to be logged, even after the keyboard has > > become unresponsive. > > I guess that this is the card reader being suspended and resumed > afterwards. Do you by any chance see any improvement when you > > - rmmod ehci_hcd It's built-in. Should I build it as a module? This machine has only usb 2.0 ports. If I rmmod it, will my USB keyboard still work? [The card reader is one of those that fit into a 3.5" bay, connected straight to the motherboard controller, so it's a bit of a pain to disconnect.] > - disable USB_AUTOSUSPEND You mean CONFIG_USB_SUSPEND? florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Wed, Aug 15, 2007 at 04:58:33PM +0200, Jiri Kosina wrote: > On Wed, 15 Aug 2007, Florin Iucha wrote: > > > [See my message to Alan]: It happened twice, within 15 minutes of > > boot+login, with 2.6.23-rc3-$whatever . I does not happen with > > 2.6.2[123](-rc*)? After the two incidents, I rebooted in 2.6.23-rc2 and > > it is working for an hour now. > > It is not immediately clear what might be causing this, 2.6.23-rc3 didn't > get any USB nor HID updates at all compared to 2.6.23-rc2. > > Could you please enable USB and HID debugging to see whether we can see > anything spurious in the logs at the time the keyboard gets stuck? Jiri, I have enabled USB debugging and I see a bunch (=46) of these messages: [ $timestamp] usb 1-9: usb auto-suspend [ $timestamp] usb 1-9: usb auto-resume [ $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER sig=se0 PE CONNECT [ $timestamp] usb 1-9: finish resume The messages continued to be logged, even after the keyboard has become unresponsive. The entire kernel log is at http://iucha.net/usb/log-2.6.23-rc3-2 . The dump of /proc/bus/usb/devices is at http://iucha.net/usb/devices . The output of 'lsusb -t' is at http://iucha.net/usb/lsusb-t . Plain lsusb is not working. The version of usbutils is '0.72-7ubuntu2' . Do you need me to build a -rc2 with USB debug enabled to compare and contrast? Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Wed, Aug 15, 2007 at 04:58:33PM +0200, Jiri Kosina wrote: On Wed, 15 Aug 2007, Florin Iucha wrote: [See my message to Alan]: It happened twice, within 15 minutes of boot+login, with 2.6.23-rc3-$whatever . I does not happen with 2.6.2[123](-rc*)? After the two incidents, I rebooted in 2.6.23-rc2 and it is working for an hour now. It is not immediately clear what might be causing this, 2.6.23-rc3 didn't get any USB nor HID updates at all compared to 2.6.23-rc2. Could you please enable USB and HID debugging to see whether we can see anything spurious in the logs at the time the keyboard gets stuck? Jiri, I have enabled USB debugging and I see a bunch (=46) of these messages: [ $timestamp] usb 1-9: usb auto-suspend [ $timestamp] usb 1-9: usb auto-resume [ $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER sig=se0 PE CONNECT [ $timestamp] usb 1-9: finish resume The messages continued to be logged, even after the keyboard has become unresponsive. The entire kernel log is at http://iucha.net/usb/log-2.6.23-rc3-2 . The dump of /proc/bus/usb/devices is at http://iucha.net/usb/devices . The output of 'lsusb -t' is at http://iucha.net/usb/lsusb-t . Plain lsusb is not working. The version of usbutils is '0.72-7ubuntu2' . Do you need me to build a -rc2 with USB debug enabled to compare and contrast? Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 06:51:15AM -0500, Florin Iucha wrote: On Wed, Aug 15, 2007 at 04:58:33PM +0200, Jiri Kosina wrote: On Wed, 15 Aug 2007, Florin Iucha wrote: [See my message to Alan]: It happened twice, within 15 minutes of boot+login, with 2.6.23-rc3-$whatever . I does not happen with 2.6.2[123](-rc*)? After the two incidents, I rebooted in 2.6.23-rc2 and it is working for an hour now. It is not immediately clear what might be causing this, 2.6.23-rc3 didn't get any USB nor HID updates at all compared to 2.6.23-rc2. Could you please enable USB and HID debugging to see whether we can see anything spurious in the logs at the time the keyboard gets stuck? Jiri, I have enabled USB debugging and I see a bunch (=46) of these messages: [ $timestamp] usb 1-9: usb auto-suspend [ $timestamp] usb 1-9: usb auto-resume [ $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER sig=se0 PE CONNECT [ $timestamp] usb 1-9: finish resume The messages continued to be logged, even after the keyboard has become unresponsive. [snip] Do you need me to build a -rc2 with USB debug enabled to compare and contrast? With 2.6.23-rc2 and USB_DEBUG enabled, I see the same messages but no keyboard dissapearance. I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and 'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if the keyboard/usb behaves or not. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 02:04:26PM +0200, Jiri Kosina wrote: I have enabled USB debugging and I see a bunch (=46) of these messages: [ $timestamp] usb 1-9: usb auto-suspend [ $timestamp] usb 1-9: usb auto-resume [ $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER sig=se0 PE CONNECT [ $timestamp] usb 1-9: finish resume The messages continued to be logged, even after the keyboard has become unresponsive. I guess that this is the card reader being suspended and resumed afterwards. Do you by any chance see any improvement when you - rmmod ehci_hcd It's built-in. Should I build it as a module? This machine has only usb 2.0 ports. If I rmmod it, will my USB keyboard still work? [The card reader is one of those that fit into a 3.5 bay, connected straight to the motherboard controller, so it's a bit of a pain to disconnect.] - disable USB_AUTOSUSPEND You mean CONFIG_USB_SUSPEND? florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 03:05:25PM +0200, Jiri Kosina wrote: I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and 'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if the keyboard/usb behaves or not. Thanks. If this doesn't give us any hint, it would be useful if you could do git-bisect between rc2 and rc3, I really can't immediately see anything in the list of commits that might directly cause the behavior you are seeing (most importantly because there were no USB and no HID updates in this window). The keyboard still locked up. There is absolutely nothing in the kernel log. There are approximately 290 commits, so it shouldn't require more than 9 reboots plus the time needed to check whether the bug triggers or not. The top commit is not v2.6.23-rc3 but commit 28e8351ac22de25034e048c680014ad824323c65 Merge: 3b993e8... d18c4d6... Author: Linus Torvalds [EMAIL PROTECTED] Date: Tue Aug 14 10:00:29 2007 -0700 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes I'll try to make time to bisect it... Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Tue, Aug 21, 2007 at 08:17:59AM -0500, Florin Iucha wrote: On Tue, Aug 21, 2007 at 03:05:25PM +0200, Jiri Kosina wrote: I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and 'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if the keyboard/usb behaves or not. Thanks. If this doesn't give us any hint, it would be useful if you could do git-bisect between rc2 and rc3, I really can't immediately see anything in the list of commits that might directly cause the behavior you are seeing (most importantly because there were no USB and no HID updates in this window). The keyboard still locked up. There is absolutely nothing in the kernel log. There are approximately 290 commits, so it shouldn't require more than 9 reboots plus the time needed to check whether the bug triggers or not. The top commit is not v2.6.23-rc3 but commit 28e8351ac22de25034e048c680014ad824323c65 Merge: 3b993e8... d18c4d6... Author: Linus Torvalds [EMAIL PROTECTED] Date: Tue Aug 14 10:00:29 2007 -0700 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes I'll try to make time to bisect it... There is another interesting angle to this: in the past, every time I had keyboard problems, it used to be caused by the VFS and/or NFS... after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, Alan!). Now, after the keyboard locked up, I used the mouse to close the gnome session, then I logged-in remotely to reboot. The reboot process locked up and I need to use the reset button! The second time the keyboard locked up I listed my processes, and I noticed that I had a couple of bash processes and a ssh process in D state. Something is fishy again in the VFS ;) florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Wed, Aug 15, 2007 at 04:49:02PM +0200, Jiri Kosina wrote: > On Wed, 15 Aug 2007, Florin Iucha wrote: > > > Today my USB keyboard stopped working in the middle of composing and > > e-mail. I unplugged it and plugged it back, with no success. I > > logged in remotely and found this lovely message: > > The error message seems unrelated to your keyboard becoming dead. Yes, it was related to me unplugging it in the hopes that a re-plug will make it work again ;) > > I am testing each rcX kernel, and I did not see this problem so far. > > Smells like a new regression. > > Is that reproducible, or did it happen just once? Any error message > present in log prior to that sysfs dump please? [See my message to Alan]: It happened twice, within 15 minutes of boot+login, with 2.6.23-rc3-$whatever . I does not happen with 2.6.2[123](-rc*)? After the two incidents, I rebooted in 2.6.23-rc2 and it is working for an hour now. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Wed, Aug 15, 2007 at 10:38:54AM -0400, Alan Stern wrote: > This patch will get rid of the annoying error messages. It won't do > anything about your keyboard's tendency to spontaneously stop working, > alas. My keyboard works fine for days, with kernels up to and including 2.6.23-rc2 . I have booted into 2.6.23-rc3-$whatever this morning, and after 10-15 minutes the keyboard stopped working. The mice which were plugged in the keyboard's built-in hub were fine though. The first time it happened, I removed the keyboard and got the oops that started this thread. The second time, I just logged-in remotely and rebooted, and the reboot process stopped at "KILLING all processes" step. I simply reset the box and rebooted into 2.6.23-rc2 and it is fine since (over an hour ago). Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
Today my USB keyboard stopped working in the middle of composing and e-mail. I unplugged it and plugged it back, with no success. I logged in remotely and found this lovely message: [ 1301.567351] usb 1-4: USB disconnect, address 3 [ 1301.567356] usb 1-4.2: USB disconnect, address 5 [ 1301.567599] sysfs_remove_bin_file: bad dentry or inode or no such file: "descriptors" [ 1301.567604] [ 1301.567605] Call Trace: [ 1301.567614] [] sysfs_remove_bin_file+0x39/0x3d [ 1301.567619] [] device_remove_bin_file+0x15/0x17 [ 1301.567623] [] usb_remove_sysfs_dev_files+0x89/0x9d [ 1301.567627] [] generic_disconnect+0x2e/0x32 [ 1301.567630] [] usb_unbind_device+0x15/0x19 [ 1301.567634] [] __device_release_driver+0x93/0xb3 [ 1301.567637] [] device_release_driver+0x31/0x49 [ 1301.567640] [] bus_remove_device+0x76/0x87 [ 1301.567644] [] device_del+0x216/0x297 [ 1301.567648] [] usb_disconnect+0xc8/0x151 [ 1301.567651] [] usb_disconnect+0x8f/0x151 [ 1301.567655] [] hub_thread+0x442/0xc47 [ 1301.567659] [] _spin_unlock_irq+0x9/0xc [ 1301.567664] [] autoremove_wake_function+0x0/0x38 [ 1301.567668] [] hub_thread+0x0/0xc47 [ 1301.567671] [] kthread+0x49/0x76 [ 1301.567674] [] child_rip+0xa/0x12 [ 1301.567679] [] kthread+0x0/0x76 [ 1301.567682] [] child_rip+0x0/0x12 [ 1301.567684] I have rebooted, and while composing this message, I thought useful to include the output from 'lsusb'. Funny enough, lsusb does not list any devices. A 'cat /proc/bus/usb/devices' yields the following: T: Bus=02 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#= 1 Spd=12 MxCh=10 B: Alloc= 25/900 us ( 3%), #Int= 5, #Iso= 0 D: Ver= 1.10 Cls=09(hub ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor= ProdID= Rev= 2.06 S: Manufacturer=Linux 2.6.23-rc3-1 ohci_hcd S: Product=OHCI Host Controller S: SerialNumber=:00:02.0 C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr= 0mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 2 Ivl=255ms T: Bus=02 Lev=01 Prnt=01 Port=02 Cnt=01 Dev#= 2 Spd=12 MxCh= 4 D: Ver= 1.10 Cls=09(hub ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=05f3 ProdID=0081 Rev= 3.10 S: Manufacturer=PI Engineering S: Product=Kinesis Keyboard Hub C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr= 50mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 1 Ivl=255ms T: Bus=02 Lev=02 Prnt=02 Port=01 Cnt=01 Dev#= 3 Spd=12 MxCh= 0 D: Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=05f3 ProdID=0007 Rev= 3.10 C:* #Ifs= 2 Cfg#= 1 Atr=a0 MxPwr= 64mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=01 Prot=01 Driver=usbhid E: Ad=81(I) Atr=03(Int.) MxPS= 8 Ivl=8ms I:* If#= 1 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=00 Prot=00 Driver=usbhid E: Ad=82(I) Atr=03(Int.) MxPS= 4 Ivl=8ms T: Bus=02 Lev=02 Prnt=02 Port=02 Cnt=02 Dev#= 4 Spd=1.5 MxCh= 0 D: Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=05bc ProdID=0102 Rev= 2.00 S: Manufacturer=Forward S: Product=USB Optical Mouse C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=01 Prot=02 Driver=usbhid E: Ad=81(I) Atr=03(Int.) MxPS= 4 Ivl=10ms T: Bus=02 Lev=02 Prnt=02 Port=03 Cnt=03 Dev#= 5 Spd=1.5 MxCh= 0 D: Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=045e ProdID=0039 Rev= 1.21 S: Manufacturer=Microsoft S: Product=Microsoft IntelliMouse® Optical C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=01 Prot=02 Driver=usbhid E: Ad=81(I) Atr=03(Int.) MxPS= 4 Ivl=10ms T: Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#= 1 Spd=480 MxCh=10 B: Alloc= 0/800 us ( 0%), #Int= 1, #Iso= 0 D: Ver= 2.00 Cls=09(hub ) Sub=00 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor= ProdID= Rev= 2.06 S: Manufacturer=Linux 2.6.23-rc3-1 ehci_hcd S: Product=EHCI Host Controller S: SerialNumber=:00:02.1 C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr= 0mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 4 Ivl=256ms T: Bus=01 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#= 3 Spd=480 MxCh= 4 D: Ver= 2.00 Cls=09(hub ) Sub=00 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor=0409 ProdID=0058 Rev= 1.00 S: Manufacturer=NEC Corporation S: Product=USB2.0 Hub Controller C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 1 Ivl=256ms T: Bus=01 Lev=02 Prnt=03 Port=01 Cnt=01 Dev#= 5 Spd=1.5 MxCh= 0 D: Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=0fe9 ProdID=9010 Rev= 1.00 S: Manufacturer=DVICO S: Product=DVICO USB HID Remocon V1.00 C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=100mA I:* If#= 0 Alt= 0 #EPs= 2 Cls=03(HID ) Sub=00 Prot=00 Driver=usbhid E: Ad=81(I) Atr=03(Int.) MxPS= 3 Ivl=32ms E: Ad=02(O) Atr=03(Int.) MxPS= 1 Ivl=32ms T: Bus=01 Lev=01 Prnt=01 Port=08 Cnt=02 Dev#= 4 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=07cc ProdID=0501
USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
Today my USB keyboard stopped working in the middle of composing and e-mail. I unplugged it and plugged it back, with no success. I logged in remotely and found this lovely message: [ 1301.567351] usb 1-4: USB disconnect, address 3 [ 1301.567356] usb 1-4.2: USB disconnect, address 5 [ 1301.567599] sysfs_remove_bin_file: bad dentry or inode or no such file: descriptors [ 1301.567604] [ 1301.567605] Call Trace: [ 1301.567614] [802b89a6] sysfs_remove_bin_file+0x39/0x3d [ 1301.567619] [803f1d24] device_remove_bin_file+0x15/0x17 [ 1301.567623] [8045ea0d] usb_remove_sysfs_dev_files+0x89/0x9d [ 1301.567627] [804625cc] generic_disconnect+0x2e/0x32 [ 1301.567630] [8045b58d] usb_unbind_device+0x15/0x19 [ 1301.567634] [803f4161] __device_release_driver+0x93/0xb3 [ 1301.567637] [803f45af] device_release_driver+0x31/0x49 [ 1301.567640] [803f39f1] bus_remove_device+0x76/0x87 [ 1301.567644] [803f2014] device_del+0x216/0x297 [ 1301.567648] [80455c8f] usb_disconnect+0xc8/0x151 [ 1301.567651] [80455c56] usb_disconnect+0x8f/0x151 [ 1301.567655] [804564c8] hub_thread+0x442/0xc47 [ 1301.567659] [80553cdb] _spin_unlock_irq+0x9/0xc [ 1301.567664] [80244ad7] autoremove_wake_function+0x0/0x38 [ 1301.567668] [80456086] hub_thread+0x0/0xc47 [ 1301.567671] [802449cb] kthread+0x49/0x76 [ 1301.567674] [8020c618] child_rip+0xa/0x12 [ 1301.567679] [80244982] kthread+0x0/0x76 [ 1301.567682] [8020c60e] child_rip+0x0/0x12 [ 1301.567684] I have rebooted, and while composing this message, I thought useful to include the output from 'lsusb'. Funny enough, lsusb does not list any devices. A 'cat /proc/bus/usb/devices' yields the following: T: Bus=02 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#= 1 Spd=12 MxCh=10 B: Alloc= 25/900 us ( 3%), #Int= 5, #Iso= 0 D: Ver= 1.10 Cls=09(hub ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor= ProdID= Rev= 2.06 S: Manufacturer=Linux 2.6.23-rc3-1 ohci_hcd S: Product=OHCI Host Controller S: SerialNumber=:00:02.0 C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr= 0mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 2 Ivl=255ms T: Bus=02 Lev=01 Prnt=01 Port=02 Cnt=01 Dev#= 2 Spd=12 MxCh= 4 D: Ver= 1.10 Cls=09(hub ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=05f3 ProdID=0081 Rev= 3.10 S: Manufacturer=PI Engineering S: Product=Kinesis Keyboard Hub C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr= 50mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 1 Ivl=255ms T: Bus=02 Lev=02 Prnt=02 Port=01 Cnt=01 Dev#= 3 Spd=12 MxCh= 0 D: Ver= 1.10 Cls=00(ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=05f3 ProdID=0007 Rev= 3.10 C:* #Ifs= 2 Cfg#= 1 Atr=a0 MxPwr= 64mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=01 Prot=01 Driver=usbhid E: Ad=81(I) Atr=03(Int.) MxPS= 8 Ivl=8ms I:* If#= 1 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=00 Prot=00 Driver=usbhid E: Ad=82(I) Atr=03(Int.) MxPS= 4 Ivl=8ms T: Bus=02 Lev=02 Prnt=02 Port=02 Cnt=02 Dev#= 4 Spd=1.5 MxCh= 0 D: Ver= 1.10 Cls=00(ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=05bc ProdID=0102 Rev= 2.00 S: Manufacturer=Forward S: Product=USB Optical Mouse C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=01 Prot=02 Driver=usbhid E: Ad=81(I) Atr=03(Int.) MxPS= 4 Ivl=10ms T: Bus=02 Lev=02 Prnt=02 Port=03 Cnt=03 Dev#= 5 Spd=1.5 MxCh= 0 D: Ver= 1.10 Cls=00(ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=045e ProdID=0039 Rev= 1.21 S: Manufacturer=Microsoft S: Product=Microsoft IntelliMouse® Optical C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=01 Prot=02 Driver=usbhid E: Ad=81(I) Atr=03(Int.) MxPS= 4 Ivl=10ms T: Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#= 1 Spd=480 MxCh=10 B: Alloc= 0/800 us ( 0%), #Int= 1, #Iso= 0 D: Ver= 2.00 Cls=09(hub ) Sub=00 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor= ProdID= Rev= 2.06 S: Manufacturer=Linux 2.6.23-rc3-1 ehci_hcd S: Product=EHCI Host Controller S: SerialNumber=:00:02.1 C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr= 0mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 4 Ivl=256ms T: Bus=01 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#= 3 Spd=480 MxCh= 4 D: Ver= 2.00 Cls=09(hub ) Sub=00 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor=0409 ProdID=0058 Rev= 1.00 S: Manufacturer=NEC Corporation S: Product=USB2.0 Hub Controller C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 1 Ivl=256ms T: Bus=01 Lev=02 Prnt=03 Port=01 Cnt=01 Dev#= 5 Spd=1.5 MxCh= 0 D: Ver= 1.10 Cls=00(ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=0fe9 ProdID=9010 Rev= 1.00 S: Manufacturer=DVICO S: Product=DVICO USB HID Remocon V1.00 C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=100mA I:* If#= 0 Alt= 0
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Wed, Aug 15, 2007 at 10:38:54AM -0400, Alan Stern wrote: This patch will get rid of the annoying error messages. It won't do anything about your keyboard's tendency to spontaneously stop working, alas. My keyboard works fine for days, with kernels up to and including 2.6.23-rc2 . I have booted into 2.6.23-rc3-$whatever this morning, and after 10-15 minutes the keyboard stopped working. The mice which were plugged in the keyboard's built-in hub were fine though. The first time it happened, I removed the keyboard and got the oops that started this thread. The second time, I just logged-in remotely and rebooted, and the reboot process stopped at KILLING all processes step. I simply reset the box and rebooted into 2.6.23-rc2 and it is fine since (over an hour ago). Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351
On Wed, Aug 15, 2007 at 04:49:02PM +0200, Jiri Kosina wrote: On Wed, 15 Aug 2007, Florin Iucha wrote: Today my USB keyboard stopped working in the middle of composing and e-mail. I unplugged it and plugged it back, with no success. I logged in remotely and found this lovely message: The error message seems unrelated to your keyboard becoming dead. Yes, it was related to me unplugging it in the hopes that a re-plug will make it work again ;) I am testing each rcX kernel, and I did not see this problem so far. Smells like a new regression. Is that reproducible, or did it happen just once? Any error message present in log prior to that sysfs dump please? [See my message to Alan]: It happened twice, within 15 minutes of boot+login, with 2.6.23-rc3-$whatever . I does not happen with 2.6.2[123](-rc*)? After the two incidents, I rebooted in 2.6.23-rc2 and it is working for an hour now. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
"upping" a semaphore from interrupt context?
Hello, I am writing a USB driver for some custom hardware, and I need to synchronize between the user-space and the USB subsystem. Can I create a semaphore and "down" it in the reader then "up" it in the completion handler? I know the completion handler runs in interrupt context so you are not allowed to acquire any semaphores: but can you release them? Will the waiting tasks wake up after the handler and its caller returned - IOW will the waking up task run in interrupt context as well? This is with Linux 2.4 (if it makes a difference). Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
upping a semaphore from interrupt context?
Hello, I am writing a USB driver for some custom hardware, and I need to synchronize between the user-space and the USB subsystem. Can I create a semaphore and down it in the reader then up it in the completion handler? I know the completion handler runs in interrupt context so you are not allowed to acquire any semaphores: but can you release them? Will the waiting tasks wake up after the handler and its caller returned - IOW will the waking up task run in interrupt context as well? This is with Linux 2.4 (if it makes a difference). Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: spurious completions during NCQ?
On Fri, Jun 08, 2007 at 09:55:58PM +0900, Tejun Heo wrote: > Florin Iucha wrote: > >> It means the drive reported command tags were completed that were not > >> outstanding. What kind of drive is this? > > > > [ 29.033142] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors > > = 156 > > 301488 > > [ 29.033146] ata1.00: ATA-7: HITACHI HTS541680J9SA00, SB2IC7EP, max > > UDMA/100 > > [ 29.033149] ata1.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 31/32) > > [ 29.034230] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors > > = 156 > > 301488 > > Please report the result of 'hdparm -I /dev/sda'. We'll probably have > to blacklist it for NCQ. Here it is: I only jumbled the serial number a bit. /dev/sda: ATA device, with non-removable media Model Number: HITACHI HTS541680J9SA00 Serial Number: SB**I57L4A Firmware Revision: SB2IC7EP Standards: Used: ATA/ATAPI-7 T13 1532D revision 1 Supported: 7 6 5 4 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 156301488 LBA48 user addressable sectors: 156301488 device size with M = 1024*1024: 76319 MBytes device size with M = 1000*1000: 80026 MBytes (80 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Vendor, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Advanced power management level: 128 (0x80) Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: *SMART feature set Security Mode feature set *Power Management feature set *Write cache *Look-ahead *Host Protected Area feature set *WRITE_BUFFER command *READ_BUFFER command *NOP cmd *DOWNLOAD_MICROCODE *Advanced Power Management feature set SET_MAX security extension Automatic Acoustic Management feature set *48-bit Address feature set *Device Configuration Overlay feature set *Mandatory FLUSH_CACHE *FLUSH_CACHE_EXT *SMART error logging *SMART self-test *General Purpose Logging feature set *WRITE_{DMA|MULTIPLE}_FUA_EXT *64-bit World wide name *IDLE_IMMEDIATE with UNLOAD *SATA-I signaling speed (1.5Gb/s) *Native Command Queueing (NCQ) *Host-initiated interface power management *Phy event counters Non-Zero buffer offsets in DMA Setup FIS DMA Setup Auto-Activate optimization Device-initiated interface power management In-order data delivery *Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked frozen not expired: security count not supported: enhanced erase 42min for SECURITY ERASE UNIT. Checksum: correct florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: spurious completions during NCQ?
On Fri, Jun 08, 2007 at 09:55:58PM +0900, Tejun Heo wrote: Florin Iucha wrote: It means the drive reported command tags were completed that were not outstanding. What kind of drive is this? [ 29.033142] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156 301488 [ 29.033146] ata1.00: ATA-7: HITACHI HTS541680J9SA00, SB2IC7EP, max UDMA/100 [ 29.033149] ata1.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 31/32) [ 29.034230] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156 301488 Please report the result of 'hdparm -I /dev/sda'. We'll probably have to blacklist it for NCQ. Here it is: I only jumbled the serial number a bit. /dev/sda: ATA device, with non-removable media Model Number: HITACHI HTS541680J9SA00 Serial Number: SB**I57L4A Firmware Revision: SB2IC7EP Standards: Used: ATA/ATAPI-7 T13 1532D revision 1 Supported: 7 6 5 4 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 156301488 LBA48 user addressable sectors: 156301488 device size with M = 1024*1024: 76319 MBytes device size with M = 1000*1000: 80026 MBytes (80 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Vendor, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Advanced power management level: 128 (0x80) Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: *SMART feature set Security Mode feature set *Power Management feature set *Write cache *Look-ahead *Host Protected Area feature set *WRITE_BUFFER command *READ_BUFFER command *NOP cmd *DOWNLOAD_MICROCODE *Advanced Power Management feature set SET_MAX security extension Automatic Acoustic Management feature set *48-bit Address feature set *Device Configuration Overlay feature set *Mandatory FLUSH_CACHE *FLUSH_CACHE_EXT *SMART error logging *SMART self-test *General Purpose Logging feature set *WRITE_{DMA|MULTIPLE}_FUA_EXT *64-bit World wide name *IDLE_IMMEDIATE with UNLOAD *SATA-I signaling speed (1.5Gb/s) *Native Command Queueing (NCQ) *Host-initiated interface power management *Phy event counters Non-Zero buffer offsets in DMA Setup FIS DMA Setup Auto-Activate optimization Device-initiated interface power management In-order data delivery *Software settings preservation Security: Master password revision code = 65534 supported not enabled not locked frozen not expired: security count not supported: enhanced erase 42min for SECURITY ERASE UNIT. Checksum: correct florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: spurious completions during NCQ?
On Wed, Jun 06, 2007 at 08:28:07AM -0600, Robert Hancock wrote: > >This is on a Thinkpad T60 with 2 GB RAM, running Ubuntu 7.04 (kernel > >2.6.20-16-generic). No proprietary drivers (ok, maybe the Intel > >Wi-Fi - but that should not count). > > > >The laptop came with Windows but I blew that away - did I mess some > >thing up regarding HPA and its ilk? > > It means the drive reported command tags were completed that were not > outstanding. What kind of drive is this? [ 29.033142] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156 301488 [ 29.033146] ata1.00: ATA-7: HITACHI HTS541680J9SA00, SB2IC7EP, max UDMA/100 [ 29.033149] ata1.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 31/32) [ 29.034230] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156 301488 florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: spurious completions during NCQ?
On Wed, Jun 06, 2007 at 08:28:07AM -0600, Robert Hancock wrote: This is on a Thinkpad T60 with 2 GB RAM, running Ubuntu 7.04 (kernel 2.6.20-16-generic). No proprietary drivers (ok, maybe the Intel Wi-Fi - but that should not count). The laptop came with Windows but I blew that away - did I mess some thing up regarding HPA and its ilk? It means the drive reported command tags were completed that were not outstanding. What kind of drive is this? [ 29.033142] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156 301488 [ 29.033146] ata1.00: ATA-7: HITACHI HTS541680J9SA00, SB2IC7EP, max UDMA/100 [ 29.033149] ata1.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 31/32) [ 29.034230] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156 301488 florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
spurious completions during NCQ?
Hello, I was working on a I/O heavy workload (parsing 100K spam messages to extract certain structures) when I got this in the kernel log: [ 2320.132893] ata1.00: exception Emask 0x2 SAct 0x701f SErr 0x0 action 0x2 frozen [ 2320.132899] ata1.00: (spurious completions during NCQ issue=0x0 SAct=0x701f FIS=005040a1:0800) [ 2320.132905] ata1.00: cmd 61/10:00:59:fc:d0/00:00:07:00:00/40 tag 0 cdb 0x0 data 8192 out [ 2320.132906] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132911] ata1.00: cmd 61/10:08:69:fc:d0/00:00:07:00:00/40 tag 1 cdb 0x0 data 8192 out [ 2320.132913] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132918] ata1.00: cmd 61/08:10:19:4c:d1/00:00:07:00:00/40 tag 2 cdb 0x0 data 4096 out [ 2320.132919] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132924] ata1.00: cmd 61/01:18:fb:27:0f/00:00:08:00:00/40 tag 3 cdb 0x0 data 512 out [ 2320.132925] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132930] ata1.00: cmd 61/08:20:11:28:0f/00:00:08:00:00/40 tag 4 cdb 0x0 data 4096 out [ 2320.132932] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132937] ata1.00: cmd 61/08:28:91:92:17/00:00:08:00:00/40 tag 5 cdb 0x0 data 4096 out [ 2320.132938] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132943] ata1.00: cmd 61/08:30:99:b3:17/00:00:08:00:00/40 tag 6 cdb 0x0 data 4096 out [ 2320.132944] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132949] ata1.00: cmd 61/01:38:5b:56:4b/00:00:05:00:00/40 tag 7 cdb 0x0 data 512 out [ 2320.132950] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132956] ata1.00: cmd 61/08:40:71:56:4b/00:00:05:00:00/40 tag 8 cdb 0x0 data 4096 out [ 2320.132957] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132962] ata1.00: cmd 61/08:48:09:cf:5a/00:00:05:00:00/40 tag 9 cdb 0x0 data 4096 out [ 2320.132963] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132968] ata1.00: cmd 61/01:50:3b:af:8b/00:00:05:00:00/40 tag 10 cdb 0x0 data 512 out [ 2320.132969] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132974] ata1.00: cmd 61/08:58:51:af:8b/00:00:05:00:00/40 tag 11 cdb 0x0 data 4096 out [ 2320.132976] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132981] ata1.00: cmd 61/08:60:d9:f1:8d/00:00:05:00:00/40 tag 12 cdb 0x0 data 4096 out [ 2320.132982] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132987] ata1.00: cmd 61/08:68:49:bd:8e/00:00:05:00:00/40 tag 13 cdb 0x0 data 4096 out [ 2320.132988] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132993] ata1.00: cmd 61/08:70:d9:63:d1/00:00:05:00:00/40 tag 14 cdb 0x0 data 4096 out [ 2320.132995] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133000] ata1.00: cmd 61/08:78:91:39:0f/00:00:06:00:00/40 tag 15 cdb 0x0 data 4096 out [ 2320.133001] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133006] ata1.00: cmd 61/08:80:49:30:97/00:00:06:00:00/40 tag 16 cdb 0x0 data 4096 out [ 2320.133007] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133013] ata1.00: cmd 61/08:88:41:c0:d0/00:00:06:00:00/40 tag 17 cdb 0x0 data 4096 out [ 2320.133014] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133019] ata1.00: cmd 61/08:90:b9:8c:d1/00:00:06:00:00/40 tag 18 cdb 0x0 data 4096 out [ 2320.133020] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133025] ata1.00: cmd 61/01:98:1b:cf:ce/00:00:07:00:00/40 tag 19 cdb 0x0 data 512 out [ 2320.133027] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133032] ata1.00: cmd 61/08:a0:31:cf:ce/00:00:07:00:00/40 tag 20 cdb 0x0 data 4096 out [ 2320.133033] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133038] ata1.00: cmd 61/10:e0:41:af:8b/00:00:05:00:00/40 tag 28 cdb 0x0 data 8192 out [ 2320.133039] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133044] ata1.00: cmd 61/01:e8:ba:12:8d/00:00:06:00:00/40 tag 29 cdb 0x0 data 512 out [ 2320.133046] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133051] ata1.00: cmd 61/10:f0:c1:12:8d/00:00:06:00:00/40 tag 30 cdb 0x0 data 8192 out [ 2320.133052] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.378387] ata1: soft resetting port [ 2320.442169] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 2320.460012] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156301488 [ 2320.461395] ata1.00: ata_hpa_resize
spurious completions during NCQ?
Hello, I was working on a I/O heavy workload (parsing 100K spam messages to extract certain structures) when I got this in the kernel log: [ 2320.132893] ata1.00: exception Emask 0x2 SAct 0x701f SErr 0x0 action 0x2 frozen [ 2320.132899] ata1.00: (spurious completions during NCQ issue=0x0 SAct=0x701f FIS=005040a1:0800) [ 2320.132905] ata1.00: cmd 61/10:00:59:fc:d0/00:00:07:00:00/40 tag 0 cdb 0x0 data 8192 out [ 2320.132906] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132911] ata1.00: cmd 61/10:08:69:fc:d0/00:00:07:00:00/40 tag 1 cdb 0x0 data 8192 out [ 2320.132913] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132918] ata1.00: cmd 61/08:10:19:4c:d1/00:00:07:00:00/40 tag 2 cdb 0x0 data 4096 out [ 2320.132919] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132924] ata1.00: cmd 61/01:18:fb:27:0f/00:00:08:00:00/40 tag 3 cdb 0x0 data 512 out [ 2320.132925] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132930] ata1.00: cmd 61/08:20:11:28:0f/00:00:08:00:00/40 tag 4 cdb 0x0 data 4096 out [ 2320.132932] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132937] ata1.00: cmd 61/08:28:91:92:17/00:00:08:00:00/40 tag 5 cdb 0x0 data 4096 out [ 2320.132938] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132943] ata1.00: cmd 61/08:30:99:b3:17/00:00:08:00:00/40 tag 6 cdb 0x0 data 4096 out [ 2320.132944] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132949] ata1.00: cmd 61/01:38:5b:56:4b/00:00:05:00:00/40 tag 7 cdb 0x0 data 512 out [ 2320.132950] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132956] ata1.00: cmd 61/08:40:71:56:4b/00:00:05:00:00/40 tag 8 cdb 0x0 data 4096 out [ 2320.132957] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132962] ata1.00: cmd 61/08:48:09:cf:5a/00:00:05:00:00/40 tag 9 cdb 0x0 data 4096 out [ 2320.132963] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132968] ata1.00: cmd 61/01:50:3b:af:8b/00:00:05:00:00/40 tag 10 cdb 0x0 data 512 out [ 2320.132969] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132974] ata1.00: cmd 61/08:58:51:af:8b/00:00:05:00:00/40 tag 11 cdb 0x0 data 4096 out [ 2320.132976] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132981] ata1.00: cmd 61/08:60:d9:f1:8d/00:00:05:00:00/40 tag 12 cdb 0x0 data 4096 out [ 2320.132982] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132987] ata1.00: cmd 61/08:68:49:bd:8e/00:00:05:00:00/40 tag 13 cdb 0x0 data 4096 out [ 2320.132988] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.132993] ata1.00: cmd 61/08:70:d9:63:d1/00:00:05:00:00/40 tag 14 cdb 0x0 data 4096 out [ 2320.132995] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133000] ata1.00: cmd 61/08:78:91:39:0f/00:00:06:00:00/40 tag 15 cdb 0x0 data 4096 out [ 2320.133001] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133006] ata1.00: cmd 61/08:80:49:30:97/00:00:06:00:00/40 tag 16 cdb 0x0 data 4096 out [ 2320.133007] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133013] ata1.00: cmd 61/08:88:41:c0:d0/00:00:06:00:00/40 tag 17 cdb 0x0 data 4096 out [ 2320.133014] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133019] ata1.00: cmd 61/08:90:b9:8c:d1/00:00:06:00:00/40 tag 18 cdb 0x0 data 4096 out [ 2320.133020] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133025] ata1.00: cmd 61/01:98:1b:cf:ce/00:00:07:00:00/40 tag 19 cdb 0x0 data 512 out [ 2320.133027] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133032] ata1.00: cmd 61/08:a0:31:cf:ce/00:00:07:00:00/40 tag 20 cdb 0x0 data 4096 out [ 2320.133033] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133038] ata1.00: cmd 61/10:e0:41:af:8b/00:00:05:00:00/40 tag 28 cdb 0x0 data 8192 out [ 2320.133039] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133044] ata1.00: cmd 61/01:e8:ba:12:8d/00:00:06:00:00/40 tag 29 cdb 0x0 data 512 out [ 2320.133046] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.133051] ata1.00: cmd 61/10:f0:c1:12:8d/00:00:06:00:00/40 tag 30 cdb 0x0 data 8192 out [ 2320.133052] res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM violation) [ 2320.378387] ata1: soft resetting port [ 2320.442169] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 2320.460012] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156301488 [ 2320.461395] ata1.00: ata_hpa_resize
Re: [1/4] 2.6.22-rc3: known regressions
On Tue, May 29, 2007 at 04:34:59PM +0200, Jan Kara wrote: > On Tue 29-05-07 14:52:53, Michal Piotrowski wrote: > > Here is a list of some known regressions in 2.6.22-rc3. > > > > Subject: Oops in dentry_iput with 2.6.22-rc2 on AMD64 > > References : http://lkml.org/lkml/2007/5/22/4 > > Submitter : Florin Iucha <[EMAIL PROTECTED]> > > Status : Unknown > Actually, the bug seems to be unreproducible and it has probably been a > 1-bit flip. So I'd be reluctant to call it a regression... I agree with this statement. I'll ping Michal and Jan if the oops resurfaces. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [1/4] 2.6.22-rc3: known regressions
On Tue, May 29, 2007 at 04:34:59PM +0200, Jan Kara wrote: On Tue 29-05-07 14:52:53, Michal Piotrowski wrote: Here is a list of some known regressions in 2.6.22-rc3. Subject: Oops in dentry_iput with 2.6.22-rc2 on AMD64 References : http://lkml.org/lkml/2007/5/22/4 Submitter : Florin Iucha [EMAIL PROTECTED] Status : Unknown Actually, the bug seems to be unreproducible and it has probably been a 1-bit flip. So I'd be reluctant to call it a regression... I agree with this statement. I'll ping Michal and Jan if the oops resurfaces. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: Oops in dentry_iput with 2.6.22-rc2 on AMD64
On Tue, May 22, 2007 at 11:57:11AM +0200, Jan Kara wrote: > > I was running a multithreaded perl application that leaks some memory > > so it gets to eat up a significant chunk of my 2 GB and even push a > > bit into swap. I left it running before going out for a walk. > Hmm, what seems suspitious is, that in R12 (which probably contains > the address dereferenced later) is address 9100... while all other > addresses start with 8100. So it seems to me it could be a 1-bit > flip. Care to check your memory with memtest? I let it run overnight: 8 passes and no surprises. This is a system that has been quite stable for a year and a half, except finding the occasional kernel bug ;) > Also this is a code all other people use all the time so I guess we > would see more reports if this was some general bug... Haven't got any more oopses like that, either. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: Oops in dentry_iput with 2.6.22-rc2 on AMD64
On Tue, May 22, 2007 at 11:57:11AM +0200, Jan Kara wrote: I was running a multithreaded perl application that leaks some memory so it gets to eat up a significant chunk of my 2 GB and even push a bit into swap. I left it running before going out for a walk. Hmm, what seems suspitious is, that in R12 (which probably contains the address dereferenced later) is address 9100... while all other addresses start with 8100. So it seems to me it could be a 1-bit flip. Care to check your memory with memtest? I let it run overnight: 8 passes and no surprises. This is a system that has been quite stable for a year and a half, except finding the occasional kernel bug ;) Also this is a code all other people use all the time so I guess we would see more reports if this was some general bug... Haven't got any more oopses like that, either. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Oops in dentry_iput with 2.6.22-rc2 on AMD64
I was running a multithreaded perl application that leaks some memory so it gets to eat up a significant chunk of my 2 GB and even push a bit into swap. I left it running before going out for a walk. When I got back, I found this in the log: [28818.103829] Unable to handle kernel paging request at 910025c9a8b4 RIP: [28818.103836] [] dentry_iput+0x58/0xbb [28818.103845] PGD 0 [28818.103848] Oops: [1] SMP [28818.103851] CPU 0 [28818.103853] Modules linked in: radeon ntfs sbp2 lp lgdt330x cx88_dvb cx88_vp3054_i2c dvb_pll video_buf_dvb tuner cx8802 cx88_alsa cx8800 cx88xx ir_common rtc tveeprom video_buf btcx_risc i2c_nforce2 evdev forcedeth [28818.103868] Pid: 253, comm: kswapd0 Not tainted 2.6.22-rc2 #1 [28818.103871] RIP: 0010:[] [] dentry_iput+0x58/0xbb [28818.103877] RSP: :81007de75d40 EFLAGS: 00010246 [28818.103880] RAX: RBX: 810075662e00 RCX: 810025c9a898 [28818.103883] RDX: 810025c9a898 RSI: 0001 RDI: 0282 [28818.103886] RBP: 81007de75d50 R08: 534d R09: 0064 [28818.103890] R10: 81007de75d80 R11: 000a R12: 910025c9a868 [28818.103893] R13: 810075662e08 R14: R15: 005e [28818.103897] FS: 42804940() GS:806c8000() knlGS: [28818.103900] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [28818.103903] CR2: 910025c9a8b4 CR3: 254bf000 CR4: 06e0 [28818.103907] Process kswapd0 (pid: 253, threadinfo 81007de74000, task 81007de706d0) [28818.103909] Stack: 810075662e00 0001 81007de75d70 80288aaf [28818.103916] 81007e606070 0001 81007de75d90 80289a35 [28818.103921] 81007e606070 810075662e00 81007de75dd0 80289d3d [28818.103926] Call Trace: [28818.103931] [] d_kill+0x38/0x58 [28818.103935] [] prune_one_dentry+0x3c/0x10f [28818.103939] [] prune_dcache+0x137/0x1a0 [28818.103944] [] shrink_dcache_memory+0x1c/0x35 [28818.103948] [] shrink_slab+0xe6/0x162 [28818.103953] [] kswapd+0x329/0x4ca [28818.103958] [] autoremove_wake_function+0x0/0x38 [28818.103963] [] kswapd+0x0/0x4ca [28818.103967] [] kthread+0x49/0x76 [28818.103971] [] child_rip+0xa/0x12 [28818.103977] [] kthread+0x0/0x76 [28818.103980] [] child_rip+0x0/0x12 [28818.103982] [28818.103984] [28818.103985] Code: 41 83 7c 24 4c 00 75 1c 4c 89 e7 45 31 c0 31 c9 31 d2 be 00 [28818.103994] RIP [] dentry_iput+0x58/0xbb [28818.103999] RSP [28818.104001] CR2: 910025c9a8b4 florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Oops in dentry_iput with 2.6.22-rc2 on AMD64
I was running a multithreaded perl application that leaks some memory so it gets to eat up a significant chunk of my 2 GB and even push a bit into swap. I left it running before going out for a walk. When I got back, I found this in the log: [28818.103829] Unable to handle kernel paging request at 910025c9a8b4 RIP: [28818.103836] [80288998] dentry_iput+0x58/0xbb [28818.103845] PGD 0 [28818.103848] Oops: [1] SMP [28818.103851] CPU 0 [28818.103853] Modules linked in: radeon ntfs sbp2 lp lgdt330x cx88_dvb cx88_vp3054_i2c dvb_pll video_buf_dvb tuner cx8802 cx88_alsa cx8800 cx88xx ir_common rtc tveeprom video_buf btcx_risc i2c_nforce2 evdev forcedeth [28818.103868] Pid: 253, comm: kswapd0 Not tainted 2.6.22-rc2 #1 [28818.103871] RIP: 0010:[80288998] [80288998] dentry_iput+0x58/0xbb [28818.103877] RSP: :81007de75d40 EFLAGS: 00010246 [28818.103880] RAX: RBX: 810075662e00 RCX: 810025c9a898 [28818.103883] RDX: 810025c9a898 RSI: 0001 RDI: 0282 [28818.103886] RBP: 81007de75d50 R08: 534d R09: 0064 [28818.103890] R10: 81007de75d80 R11: 000a R12: 910025c9a868 [28818.103893] R13: 810075662e08 R14: R15: 005e [28818.103897] FS: 42804940() GS:806c8000() knlGS: [28818.103900] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [28818.103903] CR2: 910025c9a8b4 CR3: 254bf000 CR4: 06e0 [28818.103907] Process kswapd0 (pid: 253, threadinfo 81007de74000, task 81007de706d0) [28818.103909] Stack: 810075662e00 0001 81007de75d70 80288aaf [28818.103916] 81007e606070 0001 81007de75d90 80289a35 [28818.103921] 81007e606070 810075662e00 81007de75dd0 80289d3d [28818.103926] Call Trace: [28818.103931] [80288aaf] d_kill+0x38/0x58 [28818.103935] [80289a35] prune_one_dentry+0x3c/0x10f [28818.103939] [80289d3d] prune_dcache+0x137/0x1a0 [28818.103944] [80289dc2] shrink_dcache_memory+0x1c/0x35 [28818.103948] [80261178] shrink_slab+0xe6/0x162 [28818.103953] [802615ef] kswapd+0x329/0x4ca [28818.103958] [8024458b] autoremove_wake_function+0x0/0x38 [28818.103963] [802612c6] kswapd+0x0/0x4ca [28818.103967] [8024447f] kthread+0x49/0x76 [28818.103971] [8020a578] child_rip+0xa/0x12 [28818.103977] [80244436] kthread+0x0/0x76 [28818.103980] [8020a56e] child_rip+0x0/0x12 [28818.103982] [28818.103984] [28818.103985] Code: 41 83 7c 24 4c 00 75 1c 4c 89 e7 45 31 c0 31 c9 31 d2 be 00 [28818.103994] RIP [80288998] dentry_iput+0x58/0xbb [28818.103999] RSP 81007de75d40 [28818.104001] CR2: 910025c9a8b4 florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
BUG with today's git
Kernel: v2.6.22-rc1-g0479ea0 Got this in the logs: [ 8314.672340] BUG: at fs/inotify.c:172 set_dentry_child_flags() [ 8314.672345] [ 8314.672346] Call Trace: [ 8314.672353] [] _spin_lock+0x9/0xb [ 8314.672361] [] set_dentry_child_flags+0x6d/0x14f [ 8314.672366] [] remove_watch_no_event+0x68/0x78 [ 8314.672370] [] inotify_remove_watch_locked+0x18/0x3a [ 8314.672374] [] inotify_rm_wd+0x87/0xad [ 8314.672378] [] sys_inotify_rm_watch+0x4e/0x6d [ 8314.672383] [] system_call+0x7e/0x83 [ 8314.672387] Machine was mostly idle. A movie dvd was in the DVD-ROM, but not being played. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
BUG with today's git
Kernel: v2.6.22-rc1-g0479ea0 Got this in the logs: [ 8314.672340] BUG: at fs/inotify.c:172 set_dentry_child_flags() [ 8314.672345] [ 8314.672346] Call Trace: [ 8314.672353] [8054a44c] _spin_lock+0x9/0xb [ 8314.672361] [8029fc0f] set_dentry_child_flags+0x6d/0x14f [ 8314.672366] [8029fd59] remove_watch_no_event+0x68/0x78 [ 8314.672370] [8029fe5d] inotify_remove_watch_locked+0x18/0x3a [ 8314.672374] [802a0164] inotify_rm_wd+0x87/0xad [ 8314.672378] [802a07b3] sys_inotify_rm_watch+0x4e/0x6d [ 8314.672383] [8020975e] system_call+0x7e/0x83 [ 8314.672387] Machine was mostly idle. A movie dvd was in the DVD-ROM, but not being played. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Current git fails to compile on AMD64 for three days in a row
I just pulled the current git (de5603748af8bf7deac403e6ba92887f8d18e812) and tried to compile it on my AMD64 box, to test Chuck's RPC fix. It fails: arch/x86_64/kernel/head64.c: In function ‘x86_64_start_kernel’: arch/x86_64/kernel/head64.c:70: error: size of array ‘type name’ is negative It is failing in the same way since Monday. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Current git fails to compile on AMD64 for three days in a row
I just pulled the current git (de5603748af8bf7deac403e6ba92887f8d18e812) and tried to compile it on my AMD64 box, to test Chuck's RPC fix. It fails: arch/x86_64/kernel/head64.c: In function ‘x86_64_start_kernel’: arch/x86_64/kernel/head64.c:70: error: size of array ‘type name’ is negative It is failing in the same way since Monday. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
REGRESSION: cannot change file access permissions on files NFS-mounted
Hello, This morning I updated the kernel on my workstation to the current git tree (62ea6d80211ecc88ef516927ecebf64cb505be3f). Upon reboot, I cannot change file access permissions of files in a directory that is nfs mounted (using NFS4): $ chmod 0600 $path chmod: Changing permissions of `$path': Input/output error In the kernel log I am getting these gems: [ 275.990233] decode_attr_group: reply buffer overflowed in line 2678.<4>decode _attr_group: reply buffer overflowed in line 2678.<4>decode_attr_group: reply bu ffer overflowed in line 2678.<4>decode_attr_group: reply buffer overflowed in li ne 2678.<4>decode_attr_group: reply buffer overflowed in line 2678.<4>decode_att r_group: reply buffer overflowed in line 2678. The server is Linux running Debian testing with the kernel updated to 2.6.20-1-686 from (http://ftp.us.debian.org/debian/pool/main/l/linux-2.6/linux-image-2.6.20-1-686_2.6.20-3_i386.deb) Kernel 2.6.21 works fine on the client. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
REGRESSION: cannot change file access permissions on files NFS-mounted
Hello, This morning I updated the kernel on my workstation to the current git tree (62ea6d80211ecc88ef516927ecebf64cb505be3f). Upon reboot, I cannot change file access permissions of files in a directory that is nfs mounted (using NFS4): $ chmod 0600 $path chmod: Changing permissions of `$path': Input/output error In the kernel log I am getting these gems: [ 275.990233] decode_attr_group: reply buffer overflowed in line 2678.4decode _attr_group: reply buffer overflowed in line 2678.4decode_attr_group: reply bu ffer overflowed in line 2678.4decode_attr_group: reply buffer overflowed in li ne 2678.4decode_attr_group: reply buffer overflowed in line 2678.4decode_att r_group: reply buffer overflowed in line 2678. The server is Linux running Debian testing with the kernel updated to 2.6.20-1-686 from (http://ftp.us.debian.org/debian/pool/main/l/linux-2.6/linux-image-2.6.20-1-686_2.6.20-3_i386.deb) Kernel 2.6.21 works fine on the client. Cheers, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/5] 2.6.21-rc7 NFS writes: fix a series of issues
On Fri, Apr 20, 2007 at 04:03:58PM -0400, Trond Myklebust wrote: > I've split the issues introduced by the 2.6.21-rcX write code up into 4 > subproblems. [snip] > My thanks to the various patient victim^Wpeople who helped with extensive > testing. I've pulled the tree this morning (0f851021c0f91e5073fa89f26b5ac68e23df8e11) and ran big-copy from a gnome session. All is well. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/5] 2.6.21-rc7 NFS writes: fix a series of issues
On Fri, Apr 20, 2007 at 04:03:58PM -0400, Trond Myklebust wrote: I've split the issues introduced by the 2.6.21-rcX write code up into 4 subproblems. [snip] My thanks to the various patient victim^Wpeople who helped with extensive testing. I've pulled the tree this morning (0f851021c0f91e5073fa89f26b5ac68e23df8e11) and ran big-copy from a gnome session. All is well. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Fri, Apr 20, 2007 at 09:37:30AM -0400, Trond Myklebust wrote: > Thanks! Did you ever find out what had happened to the test that hung > last night? Nope. I could not ssh into it and the machine was needed for some windows duty before I got home ;) I'll try again this coming week-end and let you know if I see any problems. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Thu, Apr 19, 2007 at 04:49:31PM -0500, Florin Iucha wrote: > On Thu, Apr 19, 2007 at 05:30:42PM -0400, Trond Myklebust wrote: > > > I'm far from the machine right now, so I will do some more tests > > > tonight, but right now, the new patchset is not good. What is the > > > difference between reverting the patch you sent yesterday and your > > > current fifth patch? I assume the other four are identical, right? > > > > The only difference is the way in which we handle retries of an NFSv4 > > request: the new patch disconnects if and only if a timeout has > > occurred, or the server sends us garbage. > > I have to mention that I rebased to the head of the tree > (895e1fc7226e6732bc77138955b6c7dfa279f57a) before applying your > patches, in order to test what I expect the official tree to be. > > Tonight I'll test this kernel once more, then go back to 21-rc7 and > apply your 5 patches and re-test. It passed big-copy, and the copy run from the gnome-session while I did my morning light browsing, email reading, etc. kernel: 895e1fc7226e6732bc77138955b6c7dfa279f57a patches: linux-2.6.21-001-cleanup_unstable_write.dif linux-2.6.21-002-defer_clearing_pg_writeback.dif linux-2.6.21-003-fix_desynchronised_ncommit.dif linux-2.6.21-004-fix_nfs_set_page_dirty.dif linux-2.6.21-005-fix_nfsv4_resend.dif Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Thu, Apr 19, 2007 at 04:49:31PM -0500, Florin Iucha wrote: On Thu, Apr 19, 2007 at 05:30:42PM -0400, Trond Myklebust wrote: I'm far from the machine right now, so I will do some more tests tonight, but right now, the new patchset is not good. What is the difference between reverting the patch you sent yesterday and your current fifth patch? I assume the other four are identical, right? The only difference is the way in which we handle retries of an NFSv4 request: the new patch disconnects if and only if a timeout has occurred, or the server sends us garbage. I have to mention that I rebased to the head of the tree (895e1fc7226e6732bc77138955b6c7dfa279f57a) before applying your patches, in order to test what I expect the official tree to be. Tonight I'll test this kernel once more, then go back to 21-rc7 and apply your 5 patches and re-test. It passed big-copy, and the copy run from the gnome-session while I did my morning light browsing, email reading, etc. kernel: 895e1fc7226e6732bc77138955b6c7dfa279f57a patches: linux-2.6.21-001-cleanup_unstable_write.dif linux-2.6.21-002-defer_clearing_pg_writeback.dif linux-2.6.21-003-fix_desynchronised_ncommit.dif linux-2.6.21-004-fix_nfs_set_page_dirty.dif linux-2.6.21-005-fix_nfsv4_resend.dif Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Fri, Apr 20, 2007 at 09:37:30AM -0400, Trond Myklebust wrote: Thanks! Did you ever find out what had happened to the test that hung last night? Nope. I could not ssh into it and the machine was needed for some windows duty before I got home ;) I'll try again this coming week-end and let you know if I see any problems. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: Failure! Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Thu, Apr 19, 2007 at 05:30:42PM -0400, Trond Myklebust wrote: > > I'm far from the machine right now, so I will do some more tests > > tonight, but right now, the new patchset is not good. What is the > > difference between reverting the patch you sent yesterday and your > > current fifth patch? I assume the other four are identical, right? > > The only difference is the way in which we handle retries of an NFSv4 > request: the new patch disconnects if and only if a timeout has > occurred, or the server sends us garbage. I have to mention that I rebased to the head of the tree (895e1fc7226e6732bc77138955b6c7dfa279f57a) before applying your patches, in order to test what I expect the official tree to be. Tonight I'll test this kernel once more, then go back to 21-rc7 and apply your 5 patches and re-test. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Failure! Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Thu, Apr 19, 2007 at 12:09:42PM -0400, Trond Myklebust wrote: > See >http://client.linux-nfs.org/Linux-2.6.x/2.6.21-rc7/ > > I'm giving the first 5 patches of that series (i.e. > linux-2.6.21-001-cleanup_unstable_write.dif to > linux-2.6.21-005-fix_nfsv4_resend.dif) an extra beating since those are > the ones that I feel should go into 2.6.21 final in order to fix the > read/write regressions that have been reported. They should be identical > to the patches that I posted on lkml in the past 3 days. > > Please feel free to grab them and give them a test. The copy completed some time ago, but now I cannot ssh into the box! This is a new development, as before I was always able to ssh into, even when the copy slowed down to a trickle. I'm far from the machine right now, so I will do some more tests tonight, but right now, the new patchset is not good. What is the difference between reverting the patch you sent yesterday and your current fifth patch? I assume the other four are identical, right? Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Thu, Apr 19, 2007 at 11:17:28AM -0400, Trond Myklebust wrote: > On Thu, 2007-04-19 at 11:12 -0400, Chuck Lever wrote: > > Perhaps instead of looking at the number of bytes sent, the logic in the > > last hunk of this patch should check which queue the request is sitting on. > > ??? It would be a bug for the request to be sitting on _any_ queue when > it enters xprt_transmit(). > > Here is the patch that I'm currently testing. Trond, What is the set of patches that are you testing? I'd like to give that a spin tonight as well. It is possible that what makes my configuration more susceptible to the problem is the fact that the client significantly overpowers the server: Athlon x2 4200+ with 2Gb of RAM for the client vs. PIII 1Ghz 512 MB RAM for the server. They both have gigabit ethernet and both NICs and the switch support jumbo frames. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Thu, Apr 19, 2007 at 11:17:28AM -0400, Trond Myklebust wrote: On Thu, 2007-04-19 at 11:12 -0400, Chuck Lever wrote: Perhaps instead of looking at the number of bytes sent, the logic in the last hunk of this patch should check which queue the request is sitting on. ??? It would be a bug for the request to be sitting on _any_ queue when it enters xprt_transmit(). Here is the patch that I'm currently testing. Trond, What is the set of patches that are you testing? I'd like to give that a spin tonight as well. It is possible that what makes my configuration more susceptible to the problem is the fact that the client significantly overpowers the server: Athlon x2 4200+ with 2Gb of RAM for the client vs. PIII 1Ghz 512 MB RAM for the server. They both have gigabit ethernet and both NICs and the switch support jumbo frames. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Failure! Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Thu, Apr 19, 2007 at 12:09:42PM -0400, Trond Myklebust wrote: See http://client.linux-nfs.org/Linux-2.6.x/2.6.21-rc7/ I'm giving the first 5 patches of that series (i.e. linux-2.6.21-001-cleanup_unstable_write.dif to linux-2.6.21-005-fix_nfsv4_resend.dif) an extra beating since those are the ones that I feel should go into 2.6.21 final in order to fix the read/write regressions that have been reported. They should be identical to the patches that I posted on lkml in the past 3 days. Please feel free to grab them and give them a test. The copy completed some time ago, but now I cannot ssh into the box! This is a new development, as before I was always able to ssh into, even when the copy slowed down to a trickle. I'm far from the machine right now, so I will do some more tests tonight, but right now, the new patchset is not good. What is the difference between reverting the patch you sent yesterday and your current fifth patch? I assume the other four are identical, right? Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: Failure! Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Thu, Apr 19, 2007 at 05:30:42PM -0400, Trond Myklebust wrote: I'm far from the machine right now, so I will do some more tests tonight, but right now, the new patchset is not good. What is the difference between reverting the patch you sent yesterday and your current fifth patch? I assume the other four are identical, right? The only difference is the way in which we handle retries of an NFSv4 request: the new patch disconnects if and only if a timeout has occurred, or the server sends us garbage. I have to mention that I rebased to the head of the tree (895e1fc7226e6732bc77138955b6c7dfa279f57a) before applying your patches, in order to test what I expect the official tree to be. Tonight I'll test this kernel once more, then go back to 21-rc7 and apply your 5 patches and re-test. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, Apr 18, 2007 at 10:45:13PM -0400, Trond Myklebust wrote: > On Wed, 2007-04-18 at 20:52 -0500, Florin Iucha wrote: > > It seems that my original problem report had a big mistake! There is > > no hang, but at some point the write slows down to a trickle (from > > 40,000 blocks/s to 22 blocks/s) as can be seen from the iostat log. > > Yeah. You only captured the outgoing traffic to the server, but already > it looks as if there were 'interesting' things going on. In frames 29346 > to 29350, the traffic stops altogether for 5 seconds (I only see > keepalives) then it starts up again. Ditto for frames 40477-40482 > (another 5 seconds). ... > Then at around frame 92072, the client starts to send a bunch of RSTs. > Aha I'll bet that reverting the appended patch fixes the problem. You win! Reverting this patch (on top of your previous 5) allowed the big copy to complete (70GB) as well as successful log-in to gnome! Acked-By: Florin Iucha <[EMAIL PROTECTED]> Thanks so much for the patience with this elusive bug and stubborn bugreporter! Regards, florin > --- > commit 43d78ef2ba5bec26d0315859e8324bfc0be23766 > Author: Chuck Lever <[EMAIL PROTECTED]> > Date: Tue Feb 6 18:26:11 2007 -0500 > > NFS: disconnect before retrying NFSv4 requests over TCP > > RFC3530 section 3.1.1 states an NFSv4 client MUST NOT send a request > twice on the same connection unless it is the NULL procedure. Section > 3.1.1 suggests that the client should disconnect and reconnect if it > wants to retry a request. > > Implement this by adding an rpc_clnt flag that an ULP can use to > specify that the underlying transport should be disconnected on a > major timeout. The NFSv4 client asserts this new flag, and requests > no retries after a minor retransmit timeout. > > Note that disconnecting on a retransmit is in general not safe to do > if the RPC client does not reuse the TCP port number when reconnecting. > > See http://bugzilla.linux-nfs.org/show_bug.cgi?id=6 > > Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> > Signed-off-by: Trond Myklebust <[EMAIL PROTECTED]> > > diff --git a/fs/nfs/client.c b/fs/nfs/client.c > index a3191f0..c46e94f 100644 > --- a/fs/nfs/client.c > +++ b/fs/nfs/client.c > @@ -394,7 +394,8 @@ static void nfs_init_timeout_values(struct rpc_timeout > *to, int proto, > static int nfs_create_rpc_client(struct nfs_client *clp, int proto, > unsigned int timeo, > unsigned int retrans, > - rpc_authflavor_t flavor) > + rpc_authflavor_t flavor, > + int flags) > { > struct rpc_timeout timeparms; > struct rpc_clnt *clnt = NULL; > @@ -407,6 +408,7 @@ static int nfs_create_rpc_client(struct nfs_client *clp, > int proto, > .program= _program, > .version= clp->rpc_ops->version, > .authflavor = flavor, > + .flags = flags, > }; > > if (!IS_ERR(clp->cl_rpcclient)) > @@ -548,7 +550,7 @@ static int nfs_init_client(struct nfs_client *clp, const > struct nfs_mount_data * >* - RFC 2623, sec 2.3.2 >*/ > error = nfs_create_rpc_client(clp, proto, data->timeo, data->retrans, > - RPC_AUTH_UNIX); > + RPC_AUTH_UNIX, 0); > if (error < 0) > goto error; > nfs_mark_client_ready(clp, NFS_CS_READY); > @@ -868,7 +870,8 @@ static int nfs4_init_client(struct nfs_client *clp, > /* Check NFS protocol revision and initialize RPC op vector */ > clp->rpc_ops = _v4_clientops; > > - error = nfs_create_rpc_client(clp, proto, timeo, retrans, authflavour); > + error = nfs_create_rpc_client(clp, proto, timeo, retrans, authflavour, > + RPC_CLNT_CREATE_DISCRTRY); > if (error < 0) > goto error; > memcpy(clp->cl_ipaddr, ip_addr, sizeof(clp->cl_ipaddr)); > diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h > index a1be89d..c7a78ee 100644 > --- a/include/linux/sunrpc/clnt.h > +++ b/include/linux/sunrpc/clnt.h > @@ -40,6 +40,7 @@ struct rpc_clnt { > > unsigned intcl_softrtry : 1,/* soft timeouts */ > cl_intr : 1,/* interruptible */ > + cl_discrtry : 1,/* disconnect bef
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, Apr 18, 2007 at 10:11:46AM -0400, Trond Myklebust wrote: > Do you have a copy of wireshark or ethereal on hand? If so, could you > take a look at whether or not any NFS traffic is going between the > client and server once the hang happens? I used the following command tcpdump -w nfs-traffic -i eth0 -vv -tt dst port nfs to capture http://iucha.net/nfs/21-rc7-nfs4/nfs-traffic.bz2 I started the capture before starting the copy and left it to run for a few minutes after the traffic slowed to a crawl. The iostat and vmstat are at: http://iucha.net/nfs/21-rc7-nfs4/iostat http://iucha.net/nfs/21-rc7-nfs4/vmstat It seems that my original problem report had a big mistake! There is no hang, but at some point the write slows down to a trickle (from 40,000 blocks/s to 22 blocks/s) as can be seen from the iostat log. Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, Apr 18, 2007 at 10:11:46AM -0400, Trond Myklebust wrote: > On Wed, 2007-04-18 at 08:42 -0500, Florin Iucha wrote: > > Could the port in CLOSE_WAIT state be the culprit? (FWIW > > the server has been up for 38 days and subjected to > > this nfs test quite a bit without showing any stress). > > The port in CLOSE_WAIT shows that a socket was closed down recently, but > once the connection is re-established, the client should start sending > data. > Do you have a copy of wireshark or ethereal on hand? If so, could you > take a look at whether or not any NFS traffic is going between the > client and server once the hang happens? > Note that the timeout value is 60 seconds, so if you see no immediate > traffic, then let the ethereal/wireshark session keep running for a > couple more minutes. Should I run wireshark/ethereal on the client or on the server? I'll get a trace tonight (10 PM CST) and get back to you. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, Apr 18, 2007 at 08:42:25AM -0500, Florin Iucha wrote: > On Wed, Apr 18, 2007 at 09:15:31AM -0400, Trond Myklebust wrote: > The netstat outputs are stable (not changed in 5 minutes): > >http://iucha.net/nfs/21-rc7-nfs3/netstat-server : > > tcp1 0 hermes.iucha.org:nfszeus.iucha.org:799 > CLOSE_WAIT > tcp0 0 hermes.iucha.org:nfszeus.iucha.org:976 > ESTABLISHED > >http://iucha.net/nfs/21-rc7-nfs3/netstat-client > > Active Internet connections (w/o servers) > Proto Recv-Q Send-Q Local Address Foreign Address State > > tcp0 0 zeus.iucha.org:976 hermes.iucha.org:nfs > ESTABLISHED > tcp0 0 zeus.iucha.org:ssh hermes.iucha.org:56880 > ESTABLISHED > tcp0 0 zeus.iucha.org:ssh hermes.iucha.org:45176 > ESTABLISHED > > Could the port in CLOSE_WAIT state be the culprit? (FWIW > the server has been up for 38 days and subjected to > this nfs test quite a bit without showing any stress). The CLOSE_WAIT went away as soon as I rebooted the client. Something was holding it up... Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, Apr 18, 2007 at 09:15:31AM -0400, Trond Myklebust wrote: > There is only one request on the 'pending' queue. That would usually > indicate that the connection to the server is down. Can you check using > "netstat -t" whether or not there is a connection in the 'ESTABLISHED' > state to the server? Please also repeat the command a couple of times in > order to see if the socket/port number on the connection changes. This is with your fifth patch on top of the previous four patches: http://iucha.net/nfs/21-rc7-nfs3/big-copy Again, it has memory, stack traces and rpc_debug. The iostat 5 output: http://iucha.net/nfs/21-rc7-nfs3/iostat The netstat outputs are stable (not changed in 5 minutes): http://iucha.net/nfs/21-rc7-nfs3/netstat-server : tcp1 0 hermes.iucha.org:nfszeus.iucha.org:799 CLOSE_WAIT tcp0 0 hermes.iucha.org:nfszeus.iucha.org:976 ESTABLISHED http://iucha.net/nfs/21-rc7-nfs3/netstat-client Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp0 0 zeus.iucha.org:976 hermes.iucha.org:nfsESTABLISHED tcp0 0 zeus.iucha.org:ssh hermes.iucha.org:56880 ESTABLISHED tcp0 0 zeus.iucha.org:ssh hermes.iucha.org:45176 ESTABLISHED Could the port in CLOSE_WAIT state be the culprit? (FWIW the server has been up for 38 days and subjected to this nfs test quite a bit without showing any stress). Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Tue, Apr 17, 2007 at 10:37:38PM -0700, Andrew Morton wrote: > Florin, can we please see /proc/meminfo as well? http://iucha.net/nfs/21-rc7-nfs2/meminfo > Also the result of `echo m > /proc/sysrq-trigger' http://iucha.net/nfs/21-rc7-nfs2/big-copy This has 'echo m > /proc/sysrq-trigger', 'echo t > /proc/sysrq-trigger' and 'echo 0 > /proc/sys/sunrpc/rpc_debug'. The output from the server's 'iostat 5' is at http://iucha.net/nfs/21-rc7-nfs2/iostat This run it copied 5.6G (vs yesterday's 2.5G). Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Tue, Apr 17, 2007 at 10:37:38PM -0700, Andrew Morton wrote: Florin, can we please see /proc/meminfo as well? http://iucha.net/nfs/21-rc7-nfs2/meminfo Also the result of `echo m /proc/sysrq-trigger' http://iucha.net/nfs/21-rc7-nfs2/big-copy This has 'echo m /proc/sysrq-trigger', 'echo t /proc/sysrq-trigger' and 'echo 0 /proc/sys/sunrpc/rpc_debug'. The output from the server's 'iostat 5' is at http://iucha.net/nfs/21-rc7-nfs2/iostat This run it copied 5.6G (vs yesterday's 2.5G). Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, Apr 18, 2007 at 09:15:31AM -0400, Trond Myklebust wrote: There is only one request on the 'pending' queue. That would usually indicate that the connection to the server is down. Can you check using netstat -t whether or not there is a connection in the 'ESTABLISHED' state to the server? Please also repeat the command a couple of times in order to see if the socket/port number on the connection changes. This is with your fifth patch on top of the previous four patches: http://iucha.net/nfs/21-rc7-nfs3/big-copy Again, it has memory, stack traces and rpc_debug. The iostat 5 output: http://iucha.net/nfs/21-rc7-nfs3/iostat The netstat outputs are stable (not changed in 5 minutes): http://iucha.net/nfs/21-rc7-nfs3/netstat-server : tcp1 0 hermes.iucha.org:nfszeus.iucha.org:799 CLOSE_WAIT tcp0 0 hermes.iucha.org:nfszeus.iucha.org:976 ESTABLISHED http://iucha.net/nfs/21-rc7-nfs3/netstat-client Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp0 0 zeus.iucha.org:976 hermes.iucha.org:nfsESTABLISHED tcp0 0 zeus.iucha.org:ssh hermes.iucha.org:56880 ESTABLISHED tcp0 0 zeus.iucha.org:ssh hermes.iucha.org:45176 ESTABLISHED Could the port in CLOSE_WAIT state be the culprit? (FWIW the server has been up for 38 days and subjected to this nfs test quite a bit without showing any stress). Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, Apr 18, 2007 at 08:42:25AM -0500, Florin Iucha wrote: On Wed, Apr 18, 2007 at 09:15:31AM -0400, Trond Myklebust wrote: The netstat outputs are stable (not changed in 5 minutes): http://iucha.net/nfs/21-rc7-nfs3/netstat-server : tcp1 0 hermes.iucha.org:nfszeus.iucha.org:799 CLOSE_WAIT tcp0 0 hermes.iucha.org:nfszeus.iucha.org:976 ESTABLISHED http://iucha.net/nfs/21-rc7-nfs3/netstat-client Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp0 0 zeus.iucha.org:976 hermes.iucha.org:nfs ESTABLISHED tcp0 0 zeus.iucha.org:ssh hermes.iucha.org:56880 ESTABLISHED tcp0 0 zeus.iucha.org:ssh hermes.iucha.org:45176 ESTABLISHED Could the port in CLOSE_WAIT state be the culprit? (FWIW the server has been up for 38 days and subjected to this nfs test quite a bit without showing any stress). The CLOSE_WAIT went away as soon as I rebooted the client. Something was holding it up... Regards, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, Apr 18, 2007 at 10:11:46AM -0400, Trond Myklebust wrote: On Wed, 2007-04-18 at 08:42 -0500, Florin Iucha wrote: Could the port in CLOSE_WAIT state be the culprit? (FWIW the server has been up for 38 days and subjected to this nfs test quite a bit without showing any stress). The port in CLOSE_WAIT shows that a socket was closed down recently, but once the connection is re-established, the client should start sending data. Do you have a copy of wireshark or ethereal on hand? If so, could you take a look at whether or not any NFS traffic is going between the client and server once the hang happens? Note that the timeout value is 60 seconds, so if you see no immediate traffic, then let the ethereal/wireshark session keep running for a couple more minutes. Should I run wireshark/ethereal on the client or on the server? I'll get a trace tonight (10 PM CST) and get back to you. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature