Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-02 Thread Florin Iucha
On Fri, Nov 02, 2007 at 08:56:55PM +0800, Fengguang Wu wrote:
> > > ---
> > >  fs/reiserfs/stree.c |3 ---
> > >  1 file changed, 3 deletions(-)
> > > 
> > > --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
> > > +++ linux-2.6.24-git17/fs/reiserfs/stree.c
> > > @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
> > >   }
> > >   bh = next;
> > >   } while (bh != head);
> > > - if (PAGE_SIZE == bh->b_size) {
> > > - cancel_dirty_page(page, PAGE_CACHE_SIZE);
> > > - }
> > >   }
> > >   }
> > >  }
> > 
> > ... and with the above patch applied.
> > 
> > Copying 300 MB from root (ext3) to the new file system did not trigger
> > the pdflush condition.  But then I did a
> >cd $MOUNTPOINT && find . -exec md5sum {} \;
> > and that brought one cpu to 75% iowait.
> 
> Immediately? Do you have the debug printk messages this time(with the
> above patch)?

No, but I will add them this afternoon.

> > I have attached my .config, if it helps.
> 
> It's really curious - I tried your .config and commands, and still
> could not trigger the high iowait. I'm running 64bit Intel Core 2,
> and kernel 2.6.24-rc1-git6 with the above patch.

Curious but 100% reproducible, at least on my box.  What I'm going to
try is booting into the kernel with your patch and just doing the find
/ md5sum.  It would be really interesting if the read-only access
triggers it.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-02 Thread Florin Iucha
On Fri, Nov 02, 2007 at 08:56:55PM +0800, Fengguang Wu wrote:
   ---
fs/reiserfs/stree.c |3 ---
1 file changed, 3 deletions(-)
   
   --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
   +++ linux-2.6.24-git17/fs/reiserfs/stree.c
   @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
 }
 bh = next;
 } while (bh != head);
   - if (PAGE_SIZE == bh-b_size) {
   - cancel_dirty_page(page, PAGE_CACHE_SIZE);
   - }
 }
 }
}
  
  ... and with the above patch applied.
  
  Copying 300 MB from root (ext3) to the new file system did not trigger
  the pdflush condition.  But then I did a
 cd $MOUNTPOINT  find . -exec md5sum {} \;
  and that brought one cpu to 75% iowait.
 
 Immediately? Do you have the debug printk messages this time(with the
 above patch)?

No, but I will add them this afternoon.

  I have attached my .config, if it helps.
 
 It's really curious - I tried your .config and commands, and still
 could not trigger the high iowait. I'm running 64bit Intel Core 2,
 and kernel 2.6.24-rc1-git6 with the above patch.

Curious but 100% reproducible, at least on my box.  What I'm going to
try is booting into the kernel with your patch and just doing the find
/ md5sum.  It would be really interesting if the read-only access
triggers it.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-01 Thread Florin Iucha
On Fri, Nov 02, 2007 at 09:33:21AM +0800, Fengguang Wu wrote:
> > I will try that with a USB disk - I hope that won't make a difference.
> 
> Thank you. I guess a reiserfs on loop file would also be OK.
> 
> > > btw, what's the exact kernel version you are running?
> > 
> > I noticed it with the kernel in the $SUBJECT, as reported by 'git
> > describe'.  I have pulled in new changesets since then.
> 
> And with the following patch applied?
> 
> ---
>  fs/reiserfs/stree.c |3 ---
>  1 file changed, 3 deletions(-)
> 
> --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
> +++ linux-2.6.24-git17/fs/reiserfs/stree.c
> @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
>   }
>   bh = next;
>   } while (bh != head);
> - if (PAGE_SIZE == bh->b_size) {
> - cancel_dirty_page(page, PAGE_CACHE_SIZE);
> - }
>   }
>   }
>  }

... and with the above patch applied.

Copying 300 MB from root (ext3) to the new file system did not trigger
the pdflush condition.  But then I did a
   cd $MOUNTPOINT && find . -exec md5sum {} \;
and that brought one cpu to 75% iowait.

I have attached my .config, if it helps.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc1-7-fw2
# Wed Oct 31 07:27:14 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=19
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CPUSETS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_BLK_DEV_BSG=y
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=m
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y

Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-01 Thread Florin Iucha
On Thu, Nov 01, 2007 at 09:03:33PM +0800, Fengguang Wu wrote:
> Or will the system or fs size/age make any difference? If you happen
> to have a spare/swap partition, could you make a new reiserfs and
> mount it and copy several less-than-4KB files into it and wait for 30s
> and see what happen to pdflush?

I will try that with a USB disk - I hope that won't make a difference.

> btw, what's the exact kernel version you are running?

I noticed it with the kernel in the $SUBJECT, as reported by 'git
describe'.  I have pulled in new changesets since then.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-01 Thread Florin Iucha
On Thu, Nov 01, 2007 at 03:15:32PM +0800, Fengguang Wu wrote:
> On Wed, Oct 31, 2007 at 12:53:18PM -0500, Florin Iucha wrote:
> > This patch does not fix anything for me.  Even such light use of the
> > reiserfs filesystem as pulling the linux-2.6 git tree updates caused
> > one CPU to go to 75% iowait.
>  
> Thank you, Florin. Could you provide more details about sda7, such as
> the mount option and output of `reiserfstune /dev/sda7`? I'll try to
> reproduce it before asking for your help.

Fengguang,

[EMAIL PROTECTED]:~# mount | grep sda7
/dev/sda7 on /scratch type reiserfs (rw,noatime)
[EMAIL PROTECTED]:~# df -h /scratch/
FilesystemSize  Used Avail Use% Mounted on
/dev/sda7  38G   32G  5.7G  85% /scratch
[EMAIL PROTECTED]:~# umount /dev/sda7
[EMAIL PROTECTED]:~# reiserfstune /dev/sda7
reiserfstune: Journal device has not been specified. Assuming journal is on the 
main device (/dev/sda7).

Current parameters:

Filesystem state: consistent

/scratch: Reiserfs super block in block 16 on 0x807 of format 3.6 with standard 
journal
Count of blocks on the device: 9765504
Number of bitmaps: 299
Blocksize: 4096
Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 
1471399
Root block: 2359332
Filesystem is clean
Tree height: 5
Hash function used to sort names: "r5"
Objectid map size 916, max 972
Journal parameters:
Device [0x0]
Magic [0x31037e64]
Size 8193 blocks (including 1 for journal header) (first block 18)
Max transaction length 1024 blocks
Max batch size 900 blocks
Max commit age 30
Blocks reserved by journal: 0
Fs state field: 0x0:
sb_version: 2
inode generation number: 5856766
UUID: a0191e80-be6e-47f6-8fd0-047e2d763a4a
LABEL: /scratch
Set flags in SB:
ATTRIBUTES CLEAN

And for bonus points:
###
reiserfsck --check started at Thu Nov  1 07:09:56 2007
###
Replaying journal..
Reiserfs journal '/dev/sda7' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..finished   
Comparing bitmaps..finished
Checking Semantic tree:
finished   
No corruptions found
There are on the filesystem:
Leaves 247231
Internal nodes 1570
Directories 11330
Other files 722878
Data block pointers 8040675 (3880 of them are zero)
Safe links 0
###
reiserfsck finished at Thu Nov  1 07:18:43 2007
###

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-01 Thread Florin Iucha
On Thu, Nov 01, 2007 at 03:15:32PM +0800, Fengguang Wu wrote:
 On Wed, Oct 31, 2007 at 12:53:18PM -0500, Florin Iucha wrote:
  This patch does not fix anything for me.  Even such light use of the
  reiserfs filesystem as pulling the linux-2.6 git tree updates caused
  one CPU to go to 75% iowait.
  
 Thank you, Florin. Could you provide more details about sda7, such as
 the mount option and output of `reiserfstune /dev/sda7`? I'll try to
 reproduce it before asking for your help.

Fengguang,

[EMAIL PROTECTED]:~# mount | grep sda7
/dev/sda7 on /scratch type reiserfs (rw,noatime)
[EMAIL PROTECTED]:~# df -h /scratch/
FilesystemSize  Used Avail Use% Mounted on
/dev/sda7  38G   32G  5.7G  85% /scratch
[EMAIL PROTECTED]:~# umount /dev/sda7
[EMAIL PROTECTED]:~# reiserfstune /dev/sda7
reiserfstune: Journal device has not been specified. Assuming journal is on the 
main device (/dev/sda7).

Current parameters:

Filesystem state: consistent

/scratch: Reiserfs super block in block 16 on 0x807 of format 3.6 with standard 
journal
Count of blocks on the device: 9765504
Number of bitmaps: 299
Blocksize: 4096
Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 
1471399
Root block: 2359332
Filesystem is clean
Tree height: 5
Hash function used to sort names: r5
Objectid map size 916, max 972
Journal parameters:
Device [0x0]
Magic [0x31037e64]
Size 8193 blocks (including 1 for journal header) (first block 18)
Max transaction length 1024 blocks
Max batch size 900 blocks
Max commit age 30
Blocks reserved by journal: 0
Fs state field: 0x0:
sb_version: 2
inode generation number: 5856766
UUID: a0191e80-be6e-47f6-8fd0-047e2d763a4a
LABEL: /scratch
Set flags in SB:
ATTRIBUTES CLEAN

And for bonus points:
###
reiserfsck --check started at Thu Nov  1 07:09:56 2007
###
Replaying journal..
Reiserfs journal '/dev/sda7' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..finished   
Comparing bitmaps..finished
Checking Semantic tree:
finished   
No corruptions found
There are on the filesystem:
Leaves 247231
Internal nodes 1570
Directories 11330
Other files 722878
Data block pointers 8040675 (3880 of them are zero)
Safe links 0
###
reiserfsck finished at Thu Nov  1 07:18:43 2007
###

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-01 Thread Florin Iucha
On Thu, Nov 01, 2007 at 09:03:33PM +0800, Fengguang Wu wrote:
 Or will the system or fs size/age make any difference? If you happen
 to have a spare/swap partition, could you make a new reiserfs and
 mount it and copy several less-than-4KB files into it and wait for 30s
 and see what happen to pdflush?

I will try that with a USB disk - I hope that won't make a difference.

 btw, what's the exact kernel version you are running?

I noticed it with the kernel in the $SUBJECT, as reported by 'git
describe'.  I have pulled in new changesets since then.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-01 Thread Florin Iucha
On Fri, Nov 02, 2007 at 09:33:21AM +0800, Fengguang Wu wrote:
  I will try that with a USB disk - I hope that won't make a difference.
 
 Thank you. I guess a reiserfs on loop file would also be OK.
 
   btw, what's the exact kernel version you are running?
  
  I noticed it with the kernel in the $SUBJECT, as reported by 'git
  describe'.  I have pulled in new changesets since then.
 
 And with the following patch applied?
 
 ---
  fs/reiserfs/stree.c |3 ---
  1 file changed, 3 deletions(-)
 
 --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
 +++ linux-2.6.24-git17/fs/reiserfs/stree.c
 @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
   }
   bh = next;
   } while (bh != head);
 - if (PAGE_SIZE == bh-b_size) {
 - cancel_dirty_page(page, PAGE_CACHE_SIZE);
 - }
   }
   }
  }

... and with the above patch applied.

Copying 300 MB from root (ext3) to the new file system did not trigger
the pdflush condition.  But then I did a
   cd $MOUNTPOINT  find . -exec md5sum {} \;
and that brought one cpu to 75% iowait.

I have attached my .config, if it helps.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc1-7-fw2
# Wed Oct 31 07:27:14 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=19
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CPUSETS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_BLK_DEV_BSG=y
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=cfq

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=m
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# 

Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-31 Thread Florin Iucha
On Wed, Oct 31, 2007 at 07:16:06AM -0500, Florin Iucha wrote:
> On Wed, Oct 31, 2007 at 02:53:25PM +0800, Fengguang Wu wrote:
> > On Tue, Oct 30, 2007 at 10:52:45PM -0500, Florin Iucha wrote:
> > > On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote:
> > > > I have added the patches and started a linux kernel compilation, and
> > > > something really interesting happens.  I run the build with the
> > > > equivalent of "make -j3" and in a separate console I am watching the
> > > > build with 'top'.  The build consumes 98% of both CPUs.  If I stop the
> > > > output in the build console with "Ctrl-S", one core goes to idle,
> > > > while the other is in 50% waiting, then goes to 75% waiting.  When I
> > > > resume the build with "Ctrl-Q", the build starts to use both CPUs at
> > > > 98-99%.  The NFS4 use was minimal, as I did not login with Gnome, but
> > > > just logged on the console.  Also, the CPU that is in 75% waiting
> > > > state changes occasionally.  'Top' shows pdflush in D state, using
> > > > 5-6% of CPU.
> > > 
> > > I forgot the traces:
> > > 
> > >http://iucha.net/2.6.24-rc1/fw.1.gz
> > >http://iucha.net/2.6.24-rc1/fw.2.gz
> > >http://iucha.net/2.6.24-rc1/fw.3.gz
> > 
> > Sorry for the delay - I've been fixing our server today.
> > 
> > [  263.685691] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 
> > global 4593 0 0 wc _M tw 1024 sk 0
> > [  263.789648] requeue_io 301: inode 4031199 size 562 at 08:07(sda7)
> > [  263.789656] requeue_io 301: inode 4031231 size 329 at 08:07(sda7)
> > [  263.789660] requeue_io 301: inode 4031255 size 177 at 08:07(sda7)
> > [  263.789664] requeue_io 301: inode 4031268 size 94 at 08:07(sda7)
> > [  263.789667] requeue_io 301: inode 4031329 size 88 at 08:07(sda7)
> > [  263.789671] requeue_io 301: inode 4031351 size 74 at 08:07(sda7)
> > [  263.789674] requeue_io 301: inode 4031408 size 175 at 08:07(sda7)
> > [  263.789678] requeue_io 301: inode 4031413 size 129 at 08:07(sda7)
> > [  263.789681] requeue_io 301: inode 4031415 size 391 at 08:07(sda7)
> > [  263.789690] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 
> > global 4593 0 0 wc _M tw 1024 sk 0
> > [  263.890184] requeue_io 301: inode 4031199 size 562 at 08:07(sda7)
> > [  263.890191] requeue_io 301: inode 4031231 size 329 at 08:07(sda7)
> > [  263.890195] requeue_io 301: inode 4031255 size 177 at 08:07(sda7)
> > [  263.890198] requeue_io 301: inode 4031268 size 94 at 08:07(sda7)
> > [  263.890202] requeue_io 301: inode 4031329 size 88 at 08:07(sda7)
> > [  263.890205] requeue_io 301: inode 4031351 size 74 at 08:07(sda7)
> > [  263.890208] requeue_io 301: inode 4031408 size 175 at 08:07(sda7)
> > [  263.890212] requeue_io 301: inode 4031413 size 129 at 08:07(sda7)
> > [  263.890215] requeue_io 301: inode 4031415 size 391 at 08:07(sda7)
> > [  263.890223] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 
> > global 4593 0 0 wc _M tw 1024 sk 0
> > 
> > It's about sda7, not NFSv4.
> > 
> > Is it a Reiserfs? We have a fresh fix for it: 
> > http://lkml.org/lkml/2007/10/23/93
> 
> Yes, it is a Reiserfs.  Incidentally it is the partition that holds
> the kernel sources and build directory.  The message states that the
> same bug exists in 2.6.23 but I do not see the same behavior in
> 2.6.23.  Anyway, I will apply the patch and see what I get.

Fengguang,

This patch does not fix anything for me.  Even such light use of the
reiserfs filesystem as pulling the linux-2.6 git tree updates caused
one CPU to go to 75% iowait.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-31 Thread Florin Iucha
On Wed, Oct 31, 2007 at 02:53:25PM +0800, Fengguang Wu wrote:
> On Tue, Oct 30, 2007 at 10:52:45PM -0500, Florin Iucha wrote:
> > On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote:
> > > I have added the patches and started a linux kernel compilation, and
> > > something really interesting happens.  I run the build with the
> > > equivalent of "make -j3" and in a separate console I am watching the
> > > build with 'top'.  The build consumes 98% of both CPUs.  If I stop the
> > > output in the build console with "Ctrl-S", one core goes to idle,
> > > while the other is in 50% waiting, then goes to 75% waiting.  When I
> > > resume the build with "Ctrl-Q", the build starts to use both CPUs at
> > > 98-99%.  The NFS4 use was minimal, as I did not login with Gnome, but
> > > just logged on the console.  Also, the CPU that is in 75% waiting
> > > state changes occasionally.  'Top' shows pdflush in D state, using
> > > 5-6% of CPU.
> > 
> > I forgot the traces:
> > 
> >http://iucha.net/2.6.24-rc1/fw.1.gz
> >http://iucha.net/2.6.24-rc1/fw.2.gz
> >http://iucha.net/2.6.24-rc1/fw.3.gz
> 
> Sorry for the delay - I've been fixing our server today.
> 
> [  263.685691] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 
> 4593 0 0 wc _M tw 1024 sk 0
> [  263.789648] requeue_io 301: inode 4031199 size 562 at 08:07(sda7)
> [  263.789656] requeue_io 301: inode 4031231 size 329 at 08:07(sda7)
> [  263.789660] requeue_io 301: inode 4031255 size 177 at 08:07(sda7)
> [  263.789664] requeue_io 301: inode 4031268 size 94 at 08:07(sda7)
> [  263.789667] requeue_io 301: inode 4031329 size 88 at 08:07(sda7)
> [  263.789671] requeue_io 301: inode 4031351 size 74 at 08:07(sda7)
> [  263.789674] requeue_io 301: inode 4031408 size 175 at 08:07(sda7)
> [  263.789678] requeue_io 301: inode 4031413 size 129 at 08:07(sda7)
> [  263.789681] requeue_io 301: inode 4031415 size 391 at 08:07(sda7)
> [  263.789690] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 
> 4593 0 0 wc _M tw 1024 sk 0
> [  263.890184] requeue_io 301: inode 4031199 size 562 at 08:07(sda7)
> [  263.890191] requeue_io 301: inode 4031231 size 329 at 08:07(sda7)
> [  263.890195] requeue_io 301: inode 4031255 size 177 at 08:07(sda7)
> [  263.890198] requeue_io 301: inode 4031268 size 94 at 08:07(sda7)
> [  263.890202] requeue_io 301: inode 4031329 size 88 at 08:07(sda7)
> [  263.890205] requeue_io 301: inode 4031351 size 74 at 08:07(sda7)
> [  263.890208] requeue_io 301: inode 4031408 size 175 at 08:07(sda7)
> [  263.890212] requeue_io 301: inode 4031413 size 129 at 08:07(sda7)
> [  263.890215] requeue_io 301: inode 4031415 size 391 at 08:07(sda7)
> [  263.890223] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 
> 4593 0 0 wc _M tw 1024 sk 0
> 
> It's about sda7, not NFSv4.
> 
> Is it a Reiserfs? We have a fresh fix for it: 
> http://lkml.org/lkml/2007/10/23/93

Yes, it is a Reiserfs.  Incidentally it is the partition that holds
the kernel sources and build directory.  The message states that the
same bug exists in 2.6.23 but I do not see the same behavior in
2.6.23.  Anyway, I will apply the patch and see what I get.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-31 Thread Florin Iucha
On Wed, Oct 31, 2007 at 02:53:25PM +0800, Fengguang Wu wrote:
 On Tue, Oct 30, 2007 at 10:52:45PM -0500, Florin Iucha wrote:
  On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote:
   I have added the patches and started a linux kernel compilation, and
   something really interesting happens.  I run the build with the
   equivalent of make -j3 and in a separate console I am watching the
   build with 'top'.  The build consumes 98% of both CPUs.  If I stop the
   output in the build console with Ctrl-S, one core goes to idle,
   while the other is in 50% waiting, then goes to 75% waiting.  When I
   resume the build with Ctrl-Q, the build starts to use both CPUs at
   98-99%.  The NFS4 use was minimal, as I did not login with Gnome, but
   just logged on the console.  Also, the CPU that is in 75% waiting
   state changes occasionally.  'Top' shows pdflush in D state, using
   5-6% of CPU.
  
  I forgot the traces:
  
 http://iucha.net/2.6.24-rc1/fw.1.gz
 http://iucha.net/2.6.24-rc1/fw.2.gz
 http://iucha.net/2.6.24-rc1/fw.3.gz
 
 Sorry for the delay - I've been fixing our server today.
 
 [  263.685691] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 
 4593 0 0 wc _M tw 1024 sk 0
 [  263.789648] requeue_io 301: inode 4031199 size 562 at 08:07(sda7)
 [  263.789656] requeue_io 301: inode 4031231 size 329 at 08:07(sda7)
 [  263.789660] requeue_io 301: inode 4031255 size 177 at 08:07(sda7)
 [  263.789664] requeue_io 301: inode 4031268 size 94 at 08:07(sda7)
 [  263.789667] requeue_io 301: inode 4031329 size 88 at 08:07(sda7)
 [  263.789671] requeue_io 301: inode 4031351 size 74 at 08:07(sda7)
 [  263.789674] requeue_io 301: inode 4031408 size 175 at 08:07(sda7)
 [  263.789678] requeue_io 301: inode 4031413 size 129 at 08:07(sda7)
 [  263.789681] requeue_io 301: inode 4031415 size 391 at 08:07(sda7)
 [  263.789690] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 
 4593 0 0 wc _M tw 1024 sk 0
 [  263.890184] requeue_io 301: inode 4031199 size 562 at 08:07(sda7)
 [  263.890191] requeue_io 301: inode 4031231 size 329 at 08:07(sda7)
 [  263.890195] requeue_io 301: inode 4031255 size 177 at 08:07(sda7)
 [  263.890198] requeue_io 301: inode 4031268 size 94 at 08:07(sda7)
 [  263.890202] requeue_io 301: inode 4031329 size 88 at 08:07(sda7)
 [  263.890205] requeue_io 301: inode 4031351 size 74 at 08:07(sda7)
 [  263.890208] requeue_io 301: inode 4031408 size 175 at 08:07(sda7)
 [  263.890212] requeue_io 301: inode 4031413 size 129 at 08:07(sda7)
 [  263.890215] requeue_io 301: inode 4031415 size 391 at 08:07(sda7)
 [  263.890223] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 global 
 4593 0 0 wc _M tw 1024 sk 0
 
 It's about sda7, not NFSv4.
 
 Is it a Reiserfs? We have a fresh fix for it: 
 http://lkml.org/lkml/2007/10/23/93

Yes, it is a Reiserfs.  Incidentally it is the partition that holds
the kernel sources and build directory.  The message states that the
same bug exists in 2.6.23 but I do not see the same behavior in
2.6.23.  Anyway, I will apply the patch and see what I get.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-31 Thread Florin Iucha
On Wed, Oct 31, 2007 at 07:16:06AM -0500, Florin Iucha wrote:
 On Wed, Oct 31, 2007 at 02:53:25PM +0800, Fengguang Wu wrote:
  On Tue, Oct 30, 2007 at 10:52:45PM -0500, Florin Iucha wrote:
   On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote:
I have added the patches and started a linux kernel compilation, and
something really interesting happens.  I run the build with the
equivalent of make -j3 and in a separate console I am watching the
build with 'top'.  The build consumes 98% of both CPUs.  If I stop the
output in the build console with Ctrl-S, one core goes to idle,
while the other is in 50% waiting, then goes to 75% waiting.  When I
resume the build with Ctrl-Q, the build starts to use both CPUs at
98-99%.  The NFS4 use was minimal, as I did not login with Gnome, but
just logged on the console.  Also, the CPU that is in 75% waiting
state changes occasionally.  'Top' shows pdflush in D state, using
5-6% of CPU.
   
   I forgot the traces:
   
  http://iucha.net/2.6.24-rc1/fw.1.gz
  http://iucha.net/2.6.24-rc1/fw.2.gz
  http://iucha.net/2.6.24-rc1/fw.3.gz
  
  Sorry for the delay - I've been fixing our server today.
  
  [  263.685691] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 
  global 4593 0 0 wc _M tw 1024 sk 0
  [  263.789648] requeue_io 301: inode 4031199 size 562 at 08:07(sda7)
  [  263.789656] requeue_io 301: inode 4031231 size 329 at 08:07(sda7)
  [  263.789660] requeue_io 301: inode 4031255 size 177 at 08:07(sda7)
  [  263.789664] requeue_io 301: inode 4031268 size 94 at 08:07(sda7)
  [  263.789667] requeue_io 301: inode 4031329 size 88 at 08:07(sda7)
  [  263.789671] requeue_io 301: inode 4031351 size 74 at 08:07(sda7)
  [  263.789674] requeue_io 301: inode 4031408 size 175 at 08:07(sda7)
  [  263.789678] requeue_io 301: inode 4031413 size 129 at 08:07(sda7)
  [  263.789681] requeue_io 301: inode 4031415 size 391 at 08:07(sda7)
  [  263.789690] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 
  global 4593 0 0 wc _M tw 1024 sk 0
  [  263.890184] requeue_io 301: inode 4031199 size 562 at 08:07(sda7)
  [  263.890191] requeue_io 301: inode 4031231 size 329 at 08:07(sda7)
  [  263.890195] requeue_io 301: inode 4031255 size 177 at 08:07(sda7)
  [  263.890198] requeue_io 301: inode 4031268 size 94 at 08:07(sda7)
  [  263.890202] requeue_io 301: inode 4031329 size 88 at 08:07(sda7)
  [  263.890205] requeue_io 301: inode 4031351 size 74 at 08:07(sda7)
  [  263.890208] requeue_io 301: inode 4031408 size 175 at 08:07(sda7)
  [  263.890212] requeue_io 301: inode 4031413 size 129 at 08:07(sda7)
  [  263.890215] requeue_io 301: inode 4031415 size 391 at 08:07(sda7)
  [  263.890223] mm/page-writeback.c 655 wb_kupdate: pdflush(248) 24235 
  global 4593 0 0 wc _M tw 1024 sk 0
  
  It's about sda7, not NFSv4.
  
  Is it a Reiserfs? We have a fresh fix for it: 
  http://lkml.org/lkml/2007/10/23/93
 
 Yes, it is a Reiserfs.  Incidentally it is the partition that holds
 the kernel sources and build directory.  The message states that the
 same bug exists in 2.6.23 but I do not see the same behavior in
 2.6.23.  Anyway, I will apply the patch and see what I get.

Fengguang,

This patch does not fix anything for me.  Even such light use of the
reiserfs filesystem as pulling the linux-2.6 git tree updates caused
one CPU to go to 75% iowait.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-30 Thread Florin Iucha
On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote:
> I have added the patches and started a linux kernel compilation, and
> something really interesting happens.  I run the build with the
> equivalent of "make -j3" and in a separate console I am watching the
> build with 'top'.  The build consumes 98% of both CPUs.  If I stop the
> output in the build console with "Ctrl-S", one core goes to idle,
> while the other is in 50% waiting, then goes to 75% waiting.  When I
> resume the build with "Ctrl-Q", the build starts to use both CPUs at
> 98-99%.  The NFS4 use was minimal, as I did not login with Gnome, but
> just logged on the console.  Also, the CPU that is in 75% waiting
> state changes occasionally.  'Top' shows pdflush in D state, using
> 5-6% of CPU.

I forgot the traces:

   http://iucha.net/2.6.24-rc1/fw.1.gz
   http://iucha.net/2.6.24-rc1/fw.2.gz
   http://iucha.net/2.6.24-rc1/fw.3.gz

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-30 Thread Florin Iucha
On Tue, Oct 30, 2007 at 07:49:41PM +0800, Fengguang Wu wrote:
> > > It could be triggered by the more aggressive writeback behavior - the
> > > new code will keep on retrying as long as there are dirty inodes pending.
> > > 
> > > Florin, would you try the attached patches against 2.6.24-git?
> > > They may generate big traffic of printk messages, but will help
> > > debug the problem.
> > 
> > I have updated to v2.6.24-rc1-334-g82798a1.  After using my computer
> > for two hours, I left the computer idle overnight.  This morning,
> > pdflushd is again consuming 25% of a CPU.  I will try Fengguang's
> > patches today.
> 
> Thank you.
> 
> Be sure to stop klogd ;-)
> A `dmesg` will be sufficient for debugging.

I have added the patches and started a linux kernel compilation, and
something really interesting happens.  I run the build with the
equivalent of "make -j3" and in a separate console I am watching the
build with 'top'.  The build consumes 98% of both CPUs.  If I stop the
output in the build console with "Ctrl-S", one core goes to idle,
while the other is in 50% waiting, then goes to 75% waiting.  When I
resume the build with "Ctrl-Q", the build starts to use both CPUs at
98-99%.  The NFS4 use was minimal, as I did not login with Gnome, but
just logged on the console.  Also, the CPU that is in 75% waiting
state changes occasionally.  'Top' shows pdflush in D state, using
5-6% of CPU.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-30 Thread Florin Iucha
On Tue, Oct 30, 2007 at 07:49:41PM +0800, Fengguang Wu wrote:
> On Tue, Oct 30, 2007 at 06:42:50AM -0500, Florin Iucha wrote:
> > On Tue, Oct 30, 2007 at 03:54:03PM +0800, Fengguang Wu wrote:
> > > It could be triggered by the more aggressive writeback behavior - the
> > > new code will keep on retrying as long as there are dirty inodes pending.
> > > 
> > > Florin, would you try the attached patches against 2.6.24-git?
> > > They may generate big traffic of printk messages, but will help
> > > debug the problem.
> > 
> > I have updated to v2.6.24-rc1-334-g82798a1.  After using my computer
> > for two hours, I left the computer idle overnight.  This morning,
> > pdflushd is again consuming 25% of a CPU.  I will try Fengguang's
> > patches today.
> 
> Thank you.
> 
> Be sure to stop klogd ;-)
> A `dmesg` will be sufficient for debugging.
> 
> btw, I cannot reproduce it with simple NFSv3 tests.

This is using nfsv4.  And being idle in a GNOME session with home
mounted from the server is hardly a stressful experience.  But what do
I know ;)

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-30 Thread Florin Iucha
On Tue, Oct 30, 2007 at 03:54:03PM +0800, Fengguang Wu wrote:
> On Sun, Oct 28, 2007 at 10:24:29AM -0500, Florin Iucha wrote:
> [...]
> >[ 3687.824468] 
> >[ 3687.824470] pdflush   D 805787c0 0   248  2
> >[ 3687.824473]  810006001d90 0046  
> > 0286
> >[ 3687.824476]  8100057fc770 810003062000 8100057fc978 
> > 000106001da0
> >[ 3687.824480]  0003 8023b1b2  
> > 
> >[ 3687.824483] Call Trace:
> >[ 3687.824488]  [] __mod_timer+0xb8/0xca
> >[ 3687.824492]  [] schedule_timeout+0x8d/0xb4
> >[ 3687.824496]  [] process_timeout+0x0/0xb
> >[ 3687.824499]  [] io_schedule_timeout+0x28/0x33
> >[ 3687.824503]  [] congestion_wait+0x6b/0x87
> >[ 3687.824506]  [] autoremove_wake_function+0x0/0x38
> >[ 3687.824510]  [] writeback_inodes+0xcd/0xd5
> >[ 3687.824514]  [] wb_kupdate+0xbb/0x10d
> >[ 3687.824518]  [] pdflush+0x0/0x1c3
> >[ 3687.824520]  [] pdflush+0x118/0x1c3
> >[ 3687.824523]  [] wb_kupdate+0x0/0x10d
> >[ 3687.824527]  [] kthread+0x49/0x77
> >[ 3687.824530]  [] child_rip+0xa/0x12
> >[ 3687.824535]  [] kthread+0x0/0x77
> >[ 3687.824538]  [] child_rip+0x0/0x12
> >[ 3687.824540] 
> > 
> > What could cause this?  I use NFS4 to automount the home directories
> > from a Solaris10 server, and this box found a few bugs in the NFS4
> > code (fixed in the 2.6.22 kernel).
> > 
> > I'll try running with 2.6.23 again for a few days, to see if I get the
> > pdflush stuck.  Any other ideas?
> 
> It could be triggered by the more aggressive writeback behavior - the
> new code will keep on retrying as long as there are dirty inodes pending.
> 
> Florin, would you try the attached patches against 2.6.24-git?
> They may generate big traffic of printk messages, but will help
> debug the problem.

I have updated to v2.6.24-rc1-334-g82798a1.  After using my computer
for two hours, I left the computer idle overnight.  This morning,
pdflushd is again consuming 25% of a CPU.  I will try Fengguang's
patches today.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-30 Thread Florin Iucha
On Tue, Oct 30, 2007 at 07:49:41PM +0800, Fengguang Wu wrote:
 On Tue, Oct 30, 2007 at 06:42:50AM -0500, Florin Iucha wrote:
  On Tue, Oct 30, 2007 at 03:54:03PM +0800, Fengguang Wu wrote:
   It could be triggered by the more aggressive writeback behavior - the
   new code will keep on retrying as long as there are dirty inodes pending.
   
   Florin, would you try the attached patches against 2.6.24-git?
   They may generate big traffic of printk messages, but will help
   debug the problem.
  
  I have updated to v2.6.24-rc1-334-g82798a1.  After using my computer
  for two hours, I left the computer idle overnight.  This morning,
  pdflushd is again consuming 25% of a CPU.  I will try Fengguang's
  patches today.
 
 Thank you.
 
 Be sure to stop klogd ;-)
 A `dmesg` will be sufficient for debugging.
 
 btw, I cannot reproduce it with simple NFSv3 tests.

This is using nfsv4.  And being idle in a GNOME session with home
mounted from the server is hardly a stressful experience.  But what do
I know ;)

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-30 Thread Florin Iucha
On Tue, Oct 30, 2007 at 03:54:03PM +0800, Fengguang Wu wrote:
 On Sun, Oct 28, 2007 at 10:24:29AM -0500, Florin Iucha wrote:
 [...]
 [ 3687.824468] 
 [ 3687.824470] pdflush   D 805787c0 0   248  2
 [ 3687.824473]  810006001d90 0046  
  0286
 [ 3687.824476]  8100057fc770 810003062000 8100057fc978 
  000106001da0
 [ 3687.824480]  0003 8023b1b2  
  
 [ 3687.824483] Call Trace:
 [ 3687.824488]  [8023b1b2] __mod_timer+0xb8/0xca
 [ 3687.824492]  [8055c87a] schedule_timeout+0x8d/0xb4
 [ 3687.824496]  [8023ad6c] process_timeout+0x0/0xb
 [ 3687.824499]  [8055c79a] io_schedule_timeout+0x28/0x33
 [ 3687.824503]  [8026bb24] congestion_wait+0x6b/0x87
 [ 3687.824506]  [80245983] autoremove_wake_function+0x0/0x38
 [ 3687.824510]  [8029e684] writeback_inodes+0xcd/0xd5
 [ 3687.824514]  [80266dc4] wb_kupdate+0xbb/0x10d
 [ 3687.824518]  [80267165] pdflush+0x0/0x1c3
 [ 3687.824520]  [8026727d] pdflush+0x118/0x1c3
 [ 3687.824523]  [80266d09] wb_kupdate+0x0/0x10d
 [ 3687.824527]  [80245876] kthread+0x49/0x77
 [ 3687.824530]  [8020c598] child_rip+0xa/0x12
 [ 3687.824535]  [8024582d] kthread+0x0/0x77
 [ 3687.824538]  [8020c58e] child_rip+0x0/0x12
 [ 3687.824540] 
  
  What could cause this?  I use NFS4 to automount the home directories
  from a Solaris10 server, and this box found a few bugs in the NFS4
  code (fixed in the 2.6.22 kernel).
  
  I'll try running with 2.6.23 again for a few days, to see if I get the
  pdflush stuck.  Any other ideas?
 
 It could be triggered by the more aggressive writeback behavior - the
 new code will keep on retrying as long as there are dirty inodes pending.
 
 Florin, would you try the attached patches against 2.6.24-git?
 They may generate big traffic of printk messages, but will help
 debug the problem.

I have updated to v2.6.24-rc1-334-g82798a1.  After using my computer
for two hours, I left the computer idle overnight.  This morning,
pdflushd is again consuming 25% of a CPU.  I will try Fengguang's
patches today.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-30 Thread Florin Iucha
On Tue, Oct 30, 2007 at 07:49:41PM +0800, Fengguang Wu wrote:
   It could be triggered by the more aggressive writeback behavior - the
   new code will keep on retrying as long as there are dirty inodes pending.
   
   Florin, would you try the attached patches against 2.6.24-git?
   They may generate big traffic of printk messages, but will help
   debug the problem.
  
  I have updated to v2.6.24-rc1-334-g82798a1.  After using my computer
  for two hours, I left the computer idle overnight.  This morning,
  pdflushd is again consuming 25% of a CPU.  I will try Fengguang's
  patches today.
 
 Thank you.
 
 Be sure to stop klogd ;-)
 A `dmesg` will be sufficient for debugging.

I have added the patches and started a linux kernel compilation, and
something really interesting happens.  I run the build with the
equivalent of make -j3 and in a separate console I am watching the
build with 'top'.  The build consumes 98% of both CPUs.  If I stop the
output in the build console with Ctrl-S, one core goes to idle,
while the other is in 50% waiting, then goes to 75% waiting.  When I
resume the build with Ctrl-Q, the build starts to use both CPUs at
98-99%.  The NFS4 use was minimal, as I did not login with Gnome, but
just logged on the console.  Also, the CPU that is in 75% waiting
state changes occasionally.  'Top' shows pdflush in D state, using
5-6% of CPU.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-30 Thread Florin Iucha
On Tue, Oct 30, 2007 at 07:02:42PM -0500, Florin Iucha wrote:
 I have added the patches and started a linux kernel compilation, and
 something really interesting happens.  I run the build with the
 equivalent of make -j3 and in a separate console I am watching the
 build with 'top'.  The build consumes 98% of both CPUs.  If I stop the
 output in the build console with Ctrl-S, one core goes to idle,
 while the other is in 50% waiting, then goes to 75% waiting.  When I
 resume the build with Ctrl-Q, the build starts to use both CPUs at
 98-99%.  The NFS4 use was minimal, as I did not login with Gnome, but
 just logged on the console.  Also, the CPU that is in 75% waiting
 state changes occasionally.  'Top' shows pdflush in D state, using
 5-6% of CPU.

I forgot the traces:

   http://iucha.net/2.6.24-rc1/fw.1.gz
   http://iucha.net/2.6.24-rc1/fw.2.gz
   http://iucha.net/2.6.24-rc1/fw.3.gz

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-29 Thread Florin Iucha
On Mon, Oct 29, 2007 at 02:43:32PM -0400, Trond Myklebust wrote:
> 
> On Mon, 2007-10-29 at 10:01 -0500, Florin Iucha wrote:
> > On Mon, Oct 29, 2007 at 09:46:59AM -0400, Trond Myklebust wrote:
> > > > What could cause this?  I use NFS4 to automount the home directories
> > > > from a Solaris10 server, and this box found a few bugs in the NFS4
> > > > code (fixed in the 2.6.22 kernel).
> > > > 
> > > > I'll try running with 2.6.23 again for a few days, to see if I get the
> > > > pdflush stuck.  Any other ideas?
> > > 
> > > One of them appears to be waiting for i/o congestion to clear up. If the
> > > filesystem is NFS, then that means that some other thread is busy
> > > writing data out to the server. You'll need to look at the rest of the
> > > thread dump to figure out which thread is writing the data out, and
> > > where it is getting stuck.
> > 
> > Trond,
> > 
> > The full dmesg is at http://iucha.net/2.6.24-rc1/dmesg.stuck_pdflush.gz
> 
> I can't see any evidence of NFS traffic at all in those traces, but
> there is a fuse process that is sleeping in :fuse:fuse_dev_read(). Could
> that perhaps be relevant.

That might be the overzealous Ubuntu trying to make the NTFS partition
available.  I will try to disable it and see if I can reproduce the
hang.  BTW: With 2.6.24-rc1+ it happens after a couple of hours.  With
2.6.23 it did not happen after 6 hours or so.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-29 Thread Florin Iucha
On Mon, Oct 29, 2007 at 09:46:59AM -0400, Trond Myklebust wrote:
> > What could cause this?  I use NFS4 to automount the home directories
> > from a Solaris10 server, and this box found a few bugs in the NFS4
> > code (fixed in the 2.6.22 kernel).
> > 
> > I'll try running with 2.6.23 again for a few days, to see if I get the
> > pdflush stuck.  Any other ideas?
> 
> One of them appears to be waiting for i/o congestion to clear up. If the
> filesystem is NFS, then that means that some other thread is busy
> writing data out to the server. You'll need to look at the rest of the
> thread dump to figure out which thread is writing the data out, and
> where it is getting stuck.

Trond,

The full dmesg is at http://iucha.net/2.6.24-rc1/dmesg.stuck_pdflush.gz

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-29 Thread Florin Iucha
On Mon, Oct 29, 2007 at 09:46:59AM -0400, Trond Myklebust wrote:
  What could cause this?  I use NFS4 to automount the home directories
  from a Solaris10 server, and this box found a few bugs in the NFS4
  code (fixed in the 2.6.22 kernel).
  
  I'll try running with 2.6.23 again for a few days, to see if I get the
  pdflush stuck.  Any other ideas?
 
 One of them appears to be waiting for i/o congestion to clear up. If the
 filesystem is NFS, then that means that some other thread is busy
 writing data out to the server. You'll need to look at the rest of the
 thread dump to figure out which thread is writing the data out, and
 where it is getting stuck.

Trond,

The full dmesg is at http://iucha.net/2.6.24-rc1/dmesg.stuck_pdflush.gz

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-29 Thread Florin Iucha
On Mon, Oct 29, 2007 at 02:43:32PM -0400, Trond Myklebust wrote:
 
 On Mon, 2007-10-29 at 10:01 -0500, Florin Iucha wrote:
  On Mon, Oct 29, 2007 at 09:46:59AM -0400, Trond Myklebust wrote:
What could cause this?  I use NFS4 to automount the home directories
from a Solaris10 server, and this box found a few bugs in the NFS4
code (fixed in the 2.6.22 kernel).

I'll try running with 2.6.23 again for a few days, to see if I get the
pdflush stuck.  Any other ideas?
   
   One of them appears to be waiting for i/o congestion to clear up. If the
   filesystem is NFS, then that means that some other thread is busy
   writing data out to the server. You'll need to look at the rest of the
   thread dump to figure out which thread is writing the data out, and
   where it is getting stuck.
  
  Trond,
  
  The full dmesg is at http://iucha.net/2.6.24-rc1/dmesg.stuck_pdflush.gz
 
 I can't see any evidence of NFS traffic at all in those traces, but
 there is a fuse process that is sleeping in :fuse:fuse_dev_read(). Could
 that perhaps be relevant.

That might be the overzealous Ubuntu trying to make the NTFS partition
available.  I will try to disable it and see if I can reproduce the
hang.  BTW: With 2.6.24-rc1+ it happens after a couple of hours.  With
2.6.23 it did not happen after 6 hours or so.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-28 Thread Florin Iucha
Hello,

For a week or two I started noticing that some time after I'm logged
in, my keyboard input becomes a bit staggering, there is a small delay
between the keypress and the actual character appearing in the
terminal.  This is on a AMD Athlon x2 4200+ with 2 GB RAM and just a
gnome-terminal open.  The machine is as idle as possible - monitored
via the system monitor applet.  I could not get any hard data on it,
until now.

After I logged off from GNOME, I switched to the text console and ran
top, with the option of showing one CPU stats line for each CPU.  Lo
and behold, one core is 100% idle, and the other one is 25% idle and
75% waiting.  Periodically, a pdflush process in 'D' state raises to
the top.  I did a 'echo t > /proc/sysrequest-trigger' and this is what
is says for the two pdflush processes:

   [ 3687.824424] pdflush   S 8100057ffef8 0   247  2
   [ 3687.824427]  8100057ffed0 0046 8100057ffe70 
8022a96c
   [ 3687.824431]  8100057fc000 810003040770 8100057fc208 
0297
   [ 3687.824434]  8100057ffe90 810002c1ba10 8100057ffed0 
8022b9d2
   [ 3687.824438] Call Trace:
   [ 3687.824440]  [] enqueue_task_fair+0x21/0x34
   [ 3687.82]  [] set_user_nice+0x110/0x12c
   [ 3687.824448]  [] pdflush+0x0/0x1c3
   [ 3687.824451]  [] pdflush+0xcf/0x1c3
   [ 3687.824455]  [] kthread+0x49/0x77
   [ 3687.824458]  [] child_rip+0xa/0x12
   [ 3687.824463]  [] kthread+0x0/0x77
   [ 3687.824466]  [] child_rip+0x0/0x12
   [ 3687.824468] 
   [ 3687.824470] pdflush   D 805787c0 0   248  2
   [ 3687.824473]  810006001d90 0046  
0286
   [ 3687.824476]  8100057fc770 810003062000 8100057fc978 
000106001da0
   [ 3687.824480]  0003 8023b1b2  

   [ 3687.824483] Call Trace:
   [ 3687.824488]  [] __mod_timer+0xb8/0xca
   [ 3687.824492]  [] schedule_timeout+0x8d/0xb4
   [ 3687.824496]  [] process_timeout+0x0/0xb
   [ 3687.824499]  [] io_schedule_timeout+0x28/0x33
   [ 3687.824503]  [] congestion_wait+0x6b/0x87
   [ 3687.824506]  [] autoremove_wake_function+0x0/0x38
   [ 3687.824510]  [] writeback_inodes+0xcd/0xd5
   [ 3687.824514]  [] wb_kupdate+0xbb/0x10d
   [ 3687.824518]  [] pdflush+0x0/0x1c3
   [ 3687.824520]  [] pdflush+0x118/0x1c3
   [ 3687.824523]  [] wb_kupdate+0x0/0x10d
   [ 3687.824527]  [] kthread+0x49/0x77
   [ 3687.824530]  [] child_rip+0xa/0x12
   [ 3687.824535]  [] kthread+0x0/0x77
   [ 3687.824538]  [] child_rip+0x0/0x12
   [ 3687.824540] 

What could cause this?  I use NFS4 to automount the home directories
from a Solaris10 server, and this box found a few bugs in the NFS4
code (fixed in the 2.6.22 kernel).

I'll try running with 2.6.23 again for a few days, to see if I get the
pdflush stuck.  Any other ideas?

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-10-28 Thread Florin Iucha
Hello,

For a week or two I started noticing that some time after I'm logged
in, my keyboard input becomes a bit staggering, there is a small delay
between the keypress and the actual character appearing in the
terminal.  This is on a AMD Athlon x2 4200+ with 2 GB RAM and just a
gnome-terminal open.  The machine is as idle as possible - monitored
via the system monitor applet.  I could not get any hard data on it,
until now.

After I logged off from GNOME, I switched to the text console and ran
top, with the option of showing one CPU stats line for each CPU.  Lo
and behold, one core is 100% idle, and the other one is 25% idle and
75% waiting.  Periodically, a pdflush process in 'D' state raises to
the top.  I did a 'echo t  /proc/sysrequest-trigger' and this is what
is says for the two pdflush processes:

   [ 3687.824424] pdflush   S 8100057ffef8 0   247  2
   [ 3687.824427]  8100057ffed0 0046 8100057ffe70 
8022a96c
   [ 3687.824431]  8100057fc000 810003040770 8100057fc208 
0297
   [ 3687.824434]  8100057ffe90 810002c1ba10 8100057ffed0 
8022b9d2
   [ 3687.824438] Call Trace:
   [ 3687.824440]  [8022a96c] enqueue_task_fair+0x21/0x34
   [ 3687.82]  [8022b9d2] set_user_nice+0x110/0x12c
   [ 3687.824448]  [80267165] pdflush+0x0/0x1c3
   [ 3687.824451]  [80267234] pdflush+0xcf/0x1c3
   [ 3687.824455]  [80245876] kthread+0x49/0x77
   [ 3687.824458]  [8020c598] child_rip+0xa/0x12
   [ 3687.824463]  [8024582d] kthread+0x0/0x77
   [ 3687.824466]  [8020c58e] child_rip+0x0/0x12
   [ 3687.824468] 
   [ 3687.824470] pdflush   D 805787c0 0   248  2
   [ 3687.824473]  810006001d90 0046  
0286
   [ 3687.824476]  8100057fc770 810003062000 8100057fc978 
000106001da0
   [ 3687.824480]  0003 8023b1b2  

   [ 3687.824483] Call Trace:
   [ 3687.824488]  [8023b1b2] __mod_timer+0xb8/0xca
   [ 3687.824492]  [8055c87a] schedule_timeout+0x8d/0xb4
   [ 3687.824496]  [8023ad6c] process_timeout+0x0/0xb
   [ 3687.824499]  [8055c79a] io_schedule_timeout+0x28/0x33
   [ 3687.824503]  [8026bb24] congestion_wait+0x6b/0x87
   [ 3687.824506]  [80245983] autoremove_wake_function+0x0/0x38
   [ 3687.824510]  [8029e684] writeback_inodes+0xcd/0xd5
   [ 3687.824514]  [80266dc4] wb_kupdate+0xbb/0x10d
   [ 3687.824518]  [80267165] pdflush+0x0/0x1c3
   [ 3687.824520]  [8026727d] pdflush+0x118/0x1c3
   [ 3687.824523]  [80266d09] wb_kupdate+0x0/0x10d
   [ 3687.824527]  [80245876] kthread+0x49/0x77
   [ 3687.824530]  [8020c598] child_rip+0xa/0x12
   [ 3687.824535]  [8024582d] kthread+0x0/0x77
   [ 3687.824538]  [8020c58e] child_rip+0x0/0x12
   [ 3687.824540] 

What could cause this?  I use NFS4 to automount the home directories
from a Solaris10 server, and this box found a few bugs in the NFS4
code (fixed in the 2.6.22 kernel).

I'll try running with 2.6.23 again for a few days, to see if I get the
pdflush stuck.  Any other ideas?

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: kernel NULL pointer dereference in blk_rq_map_sg with v2.6.23-6815-g0895e91

2007-10-23 Thread Florin Iucha
On Tue, Oct 23, 2007 at 07:46:37AM -0500, Florin Iucha wrote:
> [   60.656136] Unable to handle kernel NULL pointer dereference at 
>  RIP: 
> [   60.656143]  [] blk_rq_map_sg+0x10d/0x17c
> [   60.656151] PGD 4640067 PUD 46d4067 PMD 0 
> [   60.656154] Oops:  [1] SMP 
> [   60.656157] CPU 1 
> ...

There was a DVD in the drive.  After the OOPS, I cannot eject it
via the button, and the "eject" command is stuck in "D" state:

[  436.308282] eject D 80571760 0  5336   5324
[  436.308285]  810007c35d08 0082  
810007c35ca8
[  436.308288]  810006fb15f0 810003062000 810006fb17f8 
0001
[  436.308292]  0003 8100057e1070  

[  436.308295] Call Trace:
[  436.308301]  [] __mutex_lock_slowpath+0x133/0x23c
[  436.308306]  [] mutex_lock+0x19/0x1d
[  436.308309]  [] do_open+0x74/0x2d1
[  436.308313]  [] blkdev_open+0x0/0x69
[  436.308315]  [] blkdev_open+0x37/0x69
[  436.308319]  [] __dentry_open+0xe6/0x1bd
[  436.308323]  [] nameidata_to_filp+0x2d/0x3f
[  436.308326]  [] do_filp_open+0x39/0x4b
[  436.308330]  [] _spin_unlock+0x9/0xb
[  436.308333]  [] get_unused_fd_flags+0x113/0x121
[  436.308337]  [] do_sys_open+0x51/0xd9
[  436.308341]  [] sys_open+0x1b/0x1d
[  436.308343]  [] system_call+0x7e/0x83

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


kernel NULL pointer dereference in blk_rq_map_sg with v2.6.23-6815-g0895e91

2007-10-23 Thread Florin Iucha
Jens,

This is freshly after booting into this morning's kernel:

[   60.656136] Unable to handle kernel NULL pointer dereference at 
 RIP: 
[   60.656143]  [] blk_rq_map_sg+0x10d/0x17c
[   60.656151] PGD 4640067 PUD 46d4067 PMD 0 
[   60.656154] Oops:  [1] SMP 
[   60.656157] CPU 1 
[   60.656159] Modules linked in: sbp2 lp dvb_pll lgdt330x cx88_dvb 
cx88_vp3054_i2c videobuf_dvb tuner tea5767 td
a8290 tuner_simple mt20xx cx88_alsa cx8802 cx8800 cx88xx ir_common tveeprom 
videobuf_dma_sg videobuf_core btcx_ri
sc i2c_nforce2 evdev rtc forcedeth ehci_hcd fuse
[   60.656176] Pid: 4250, comm: hald-probe-stor Not tainted 2.6.24-rc0-5 #1
[   60.656178] RIP: 0010:[]  [] 
blk_rq_map_sg+0x10d/0x17c
[   60.656182] RSP: 0018:810004791930  EFLAGS: 00010246
[   60.656184] RAX: 0403b000 RBX: 1000 RCX: 6db6db6db6db6db7
[   60.656187] RDX:  RSI: 81000100 RDI: 05701000
[   60.656189] RBP: 810004791968 R08: 0570 R09: 8100044aa060
[   60.656191] R10:  R11: 8100050dea00 R12: 2000
[   60.656193] R13: 8100060d2700 R14:  R15: 807f
[   60.656196] FS:  2b5da088e6e0() GS:810003011500() 
knlGS:
[   60.656198] CS:  0010 DS:  ES:  CR0: 8005003b
[   60.656200] CR2:  CR3: 04568000 CR4: 06e0
[   60.656202] DR0:  DR1:  DR2: 
[   60.656204] DR3:  DR6: 0ff0 DR7: 0400
[   60.656207] Process hald-probe-stor (pid: 4250, threadinfo 81000479, 
task 810006312000)
[   60.656208] Stack:  81000607a000 00010001 8100040fa120 
807fe2c0
[   60.656213]  81000607a000 81000607a000 807fe2c0 
8100047919a8
[   60.656217]  8041bb58 8100047919a8 8100040fa120 
807fe2c0
[   60.656220] Call Trace:
[   60.656226]  [] ide_map_sg+0x38/0xb0
[   60.656231]  [] cdrom_start_read_continuation+0x0/0xb5
[   60.656234]  [] ide_build_sglist+0x38/0x88
[   60.656238]  [] ide_build_dmatable+0x2f/0x172
[   60.656241]  [] ide_dma_setup+0x34/0xaa
[   60.656245]  [] cdrom_start_packet_command+0x5a/0x177
[   60.656249]  [] cfq_dispatch_insert+0x38/0x50
[   60.656253]  [] ide_do_rw_cdrom+0x423/0x57c
[   60.656257]  [] ide_do_request+0x7a7/0xa74
[   60.656263]  [] del_timer+0x52/0x5d
[   60.656267]  [] sync_page+0x0/0x45
[   60.656269]  [] do_ide_request+0x1b/0x1d
[   60.656273]  [] __generic_unplug_device+0x28/0x2c
[   60.656276]  [] generic_unplug_device+0x20/0x31
[   60.656279]  [] blk_backing_dev_unplug+0x16/0x18
[   60.656283]  [] block_sync_page+0x42/0x44
[   60.656285]  [] sync_page+0x3c/0x45
[   60.656290]  [] __wait_on_bit_lock+0x42/0x79
[   60.656294]  [] __lock_page+0x64/0x6b
[   60.656298]  [] wake_bit_function+0x0/0x2a
[   60.656301]  [] do_generic_mapping_read+0x1da/0x383
[   60.656304]  [] file_read_actor+0x0/0x137
[   60.656309]  [] generic_file_aio_read+0x11e/0x15d
[   60.656315]  [] do_sync_read+0xe2/0x126
[   60.656318]  [] handle_mm_fault+0x62e/0x65e
[   60.656324]  [] __up_read+0x8f/0x97
[   60.656327]  [] autoremove_wake_function+0x0/0x38
[   60.656331]  [] __mutex_lock_slowpath+0x22f/0x23c
[   60.656337]  [] vfs_read+0xab/0x134
[   60.656341]  [] sys_read+0x47/0x6f
[   60.656345]  [] system_call+0x7e/0x83
[   60.656349] 
[   60.656350] 
[   60.656350] Code: 49 8b 02 41 c7 42 18 00 00 00 00 49 c7 42 10 00 00 00 00 
83 
[   60.656359] RIP  [] blk_rq_map_sg+0x10d/0x17c
[   60.656362]  RSP 
[   60.656363] CR2: 

Platform is AMD64 and the userspace is Ubuntu 7/10 Gutsy Gibbon.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


kernel NULL pointer dereference in blk_rq_map_sg with v2.6.23-6815-g0895e91

2007-10-23 Thread Florin Iucha
Jens,

This is freshly after booting into this morning's kernel:

[   60.656136] Unable to handle kernel NULL pointer dereference at 
 RIP: 
[   60.656143]  [80375553] blk_rq_map_sg+0x10d/0x17c
[   60.656151] PGD 4640067 PUD 46d4067 PMD 0 
[   60.656154] Oops:  [1] SMP 
[   60.656157] CPU 1 
[   60.656159] Modules linked in: sbp2 lp dvb_pll lgdt330x cx88_dvb 
cx88_vp3054_i2c videobuf_dvb tuner tea5767 td
a8290 tuner_simple mt20xx cx88_alsa cx8802 cx8800 cx88xx ir_common tveeprom 
videobuf_dma_sg videobuf_core btcx_ri
sc i2c_nforce2 evdev rtc forcedeth ehci_hcd fuse
[   60.656176] Pid: 4250, comm: hald-probe-stor Not tainted 2.6.24-rc0-5 #1
[   60.656178] RIP: 0010:[80375553]  [80375553] 
blk_rq_map_sg+0x10d/0x17c
[   60.656182] RSP: 0018:810004791930  EFLAGS: 00010246
[   60.656184] RAX: 0403b000 RBX: 1000 RCX: 6db6db6db6db6db7
[   60.656187] RDX:  RSI: 81000100 RDI: 05701000
[   60.656189] RBP: 810004791968 R08: 0570 R09: 8100044aa060
[   60.656191] R10:  R11: 8100050dea00 R12: 2000
[   60.656193] R13: 8100060d2700 R14:  R15: 807f
[   60.656196] FS:  2b5da088e6e0() GS:810003011500() 
knlGS:
[   60.656198] CS:  0010 DS:  ES:  CR0: 8005003b
[   60.656200] CR2:  CR3: 04568000 CR4: 06e0
[   60.656202] DR0:  DR1:  DR2: 
[   60.656204] DR3:  DR6: 0ff0 DR7: 0400
[   60.656207] Process hald-probe-stor (pid: 4250, threadinfo 81000479, 
task 810006312000)
[   60.656208] Stack:  81000607a000 00010001 8100040fa120 
807fe2c0
[   60.656213]  81000607a000 81000607a000 807fe2c0 
8100047919a8
[   60.656217]  8041bb58 8100047919a8 8100040fa120 
807fe2c0
[   60.656220] Call Trace:
[   60.656226]  [8041bb58] ide_map_sg+0x38/0xb0
[   60.656231]  [8042952b] cdrom_start_read_continuation+0x0/0xb5
[   60.656234]  [80423806] ide_build_sglist+0x38/0x88
[   60.656238]  [80423885] ide_build_dmatable+0x2f/0x172
[   60.656241]  [804239fc] ide_dma_setup+0x34/0xaa
[   60.656245]  [804277e5] cdrom_start_packet_command+0x5a/0x177
[   60.656249]  [8037fac4] cfq_dispatch_insert+0x38/0x50
[   60.656253]  [80428339] ide_do_rw_cdrom+0x423/0x57c
[   60.656257]  [8041c56c] ide_do_request+0x7a7/0xa74
[   60.656263]  [8023c097] del_timer+0x52/0x5d
[   60.656267]  [8025d343] sync_page+0x0/0x45
[   60.656269]  [8041cba0] do_ide_request+0x1b/0x1d
[   60.656273]  [803778a7] __generic_unplug_device+0x28/0x2c
[   60.656276]  [80377c6e] generic_unplug_device+0x20/0x31
[   60.656279]  [803751b1] blk_backing_dev_unplug+0x16/0x18
[   60.656283]  [8029decc] block_sync_page+0x42/0x44
[   60.656285]  [8025d37f] sync_page+0x3c/0x45
[   60.656290]  [805589b8] __wait_on_bit_lock+0x42/0x79
[   60.656294]  [8025d32f] __lock_page+0x64/0x6b
[   60.656298]  [8024664b] wake_bit_function+0x0/0x2a
[   60.656301]  [8025da95] do_generic_mapping_read+0x1da/0x383
[   60.656304]  [8025d08d] file_read_actor+0x0/0x137
[   60.656309]  [8025f1af] generic_file_aio_read+0x11e/0x15d
[   60.656315]  [8027ee59] do_sync_read+0xe2/0x126
[   60.656318]  [8026b15a] handle_mm_fault+0x62e/0x65e
[   60.656324]  [80386fcc] __up_read+0x8f/0x97
[   60.656327]  [80246613] autoremove_wake_function+0x0/0x38
[   60.656331]  [80559233] __mutex_lock_slowpath+0x22f/0x23c
[   60.656337]  [8027f5f0] vfs_read+0xab/0x134
[   60.656341]  [8027f9b5] sys_read+0x47/0x6f
[   60.656345]  [8020b77e] system_call+0x7e/0x83
[   60.656349] 
[   60.656350] 
[   60.656350] Code: 49 8b 02 41 c7 42 18 00 00 00 00 49 c7 42 10 00 00 00 00 
83 
[   60.656359] RIP  [80375553] blk_rq_map_sg+0x10d/0x17c
[   60.656362]  RSP 810004791930
[   60.656363] CR2: 

Platform is AMD64 and the userspace is Ubuntu 7/10 Gutsy Gibbon.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: kernel NULL pointer dereference in blk_rq_map_sg with v2.6.23-6815-g0895e91

2007-10-23 Thread Florin Iucha
On Tue, Oct 23, 2007 at 07:46:37AM -0500, Florin Iucha wrote:
 [   60.656136] Unable to handle kernel NULL pointer dereference at 
  RIP: 
 [   60.656143]  [80375553] blk_rq_map_sg+0x10d/0x17c
 [   60.656151] PGD 4640067 PUD 46d4067 PMD 0 
 [   60.656154] Oops:  [1] SMP 
 [   60.656157] CPU 1 
 ...

There was a DVD in the drive.  After the OOPS, I cannot eject it
via the button, and the eject command is stuck in D state:

[  436.308282] eject D 80571760 0  5336   5324
[  436.308285]  810007c35d08 0082  
810007c35ca8
[  436.308288]  810006fb15f0 810003062000 810006fb17f8 
0001
[  436.308292]  0003 8100057e1070  

[  436.308295] Call Trace:
[  436.308301]  [80559137] __mutex_lock_slowpath+0x133/0x23c
[  436.308306]  [80559259] mutex_lock+0x19/0x1d
[  436.308309]  [802a35a0] do_open+0x74/0x2d1
[  436.308313]  [802a3a02] blkdev_open+0x0/0x69
[  436.308315]  [802a3a39] blkdev_open+0x37/0x69
[  436.308319]  [8027d68e] __dentry_open+0xe6/0x1bd
[  436.308323]  [8027d7fd] nameidata_to_filp+0x2d/0x3f
[  436.308326]  [8027d848] do_filp_open+0x39/0x4b
[  436.308330]  [8055a16d] _spin_unlock+0x9/0xb
[  436.308333]  [8027d58d] get_unused_fd_flags+0x113/0x121
[  436.308337]  [8027d8ab] do_sys_open+0x51/0xd9
[  436.308341]  [8027d95c] sys_open+0x1b/0x1d
[  436.308343]  [8020b77e] system_call+0x7e/0x83

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: NFS woes again

2007-08-30 Thread Florin Iucha
On Thu, Aug 30, 2007 at 03:18:37PM -0700, Bret Towe wrote:
> > uptime of 3 hours and keyboard is still working fine
> > I'll hopefully get to test this on the mini tomorrow for at least 3 hours 
> > also
> 
> got 45min on mini before I had to go elsewhere
> the amd64 shutdown fine and has been up for more than 3 hours
> I'd say the patch does it

Yup.  Same here.  Many startups, shuthdowns and minutes of uptime,
with no observations.  Check it in!

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: NFS woes again

2007-08-30 Thread Florin Iucha
On Thu, Aug 30, 2007 at 03:18:37PM -0700, Bret Towe wrote:
  uptime of 3 hours and keyboard is still working fine
  I'll hopefully get to test this on the mini tomorrow for at least 3 hours 
  also
 
 got 45min on mini before I had to go elsewhere
 the amd64 shutdown fine and has been up for more than 3 hours
 I'd say the patch does it

Yup.  Same here.  Many startups, shuthdowns and minutes of uptime,
with no observations.  Check it in!

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: NFS woes again

2007-08-28 Thread Florin Iucha
On Tue, Aug 28, 2007 at 09:28:43AM -0400, Trond Myklebust wrote:
> Doh! I see the problem: cancel_delayed_work_sync() shouldn't ever be
> called recursively.
> 
> The following patch should be correct. Please just discard the previous
> one...

So far so good.  This patch got one hour uptime...  I'll stay with
this kernel for a few days, to keep an eye on it.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: NFS woes again

2007-08-28 Thread Florin Iucha
On Tue, Aug 28, 2007 at 09:28:43AM -0400, Trond Myklebust wrote:
 Doh! I see the problem: cancel_delayed_work_sync() shouldn't ever be
 called recursively.
 
 The following patch should be correct. Please just discard the previous
 one...

So far so good.  This patch got one hour uptime...  I'll stay with
this kernel for a few days, to keep an eye on it.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: NFS woes again

2007-08-27 Thread Florin Iucha
On Mon, Aug 27, 2007 at 06:19:29PM -0700, Bret Towe wrote:
> On 8/27/07, Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > > > this sounds alot like the post i did yesterday titled 'nfs4 hang 
> > > > regression'
> > > > i tracked it down to commit 3d39c691ff486142dd9aaeac12f553f4476b7a6
> > >
> > > Yes, it certainly does -- all the symptoms match!
> >
> > Could you and Bret please check if the attached patch fixes the hang?
>
> no good for me still hangs after ~30minutes

I just booted into the new kernel
(3d39c691ff486142dd9aaeac12f553f4476b7a6 + Trond's patch) and it hangs
in 10-15 minutes.

Process traces available at http://iucha.net/nfs/23-rc2-nfs-fix-1/kernel.log.gz

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: NFS woes again

2007-08-27 Thread Florin Iucha
On Mon, Aug 27, 2007 at 06:19:29PM -0700, Bret Towe wrote:
 On 8/27/07, Trond Myklebust [EMAIL PROTECTED] wrote:
this sounds alot like the post i did yesterday titled 'nfs4 hang 
regression'
i tracked it down to commit 3d39c691ff486142dd9aaeac12f553f4476b7a6
  
   Yes, it certainly does -- all the symptoms match!
 
  Could you and Bret please check if the attached patch fixes the hang?

 no good for me still hangs after ~30minutes

I just booted into the new kernel
(3d39c691ff486142dd9aaeac12f553f4476b7a6 + Trond's patch) and it hangs
in 10-15 minutes.

Process traces available at http://iucha.net/nfs/23-rc2-nfs-fix-1/kernel.log.gz

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: NFS woes again Was: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-23 Thread Florin Iucha
On Thu, Aug 23, 2007 at 10:14:38AM -0700, Bret Towe wrote:
> this sounds alot like the post i did yesterday titled 'nfs4 hang regression'
> i tracked it down to commit 3d39c691ff486142dd9aaeac12f553f4476b7a6

Yes, it certainly does -- all the symptoms match!

I'm not [alone in] seeing dead keyboards!

Now, if only somebody could clarify to me the connection between
the bad NFS4 shooting the keyboard but not the mouse, that would
be wonderful.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


NFS woes again Was: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-23 Thread Florin Iucha
Trond,

Fess up... I'm closing in:

   http://iucha.net/2.6.23-rc3/2.6.23-rc-bisect.png

[Dropping Jiri and linux-usb-devel from future postings.  You are
included now just for communicating the conclusion of this thread.]

On Wed, Aug 22, 2007 at 08:22:00AM -0500, Florin Iucha wrote:
> On Tue, Aug 21, 2007 at 03:42:26PM +0200, Jiri Kosina wrote:
> > On Tue, 21 Aug 2007, Florin Iucha wrote:
> > 
> > > There is another interesting angle to this: in the past, every time I 
> > > had keyboard problems, it used to be caused by the VFS and/or NFS... 
> > > after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, 
> > > Alan!). Now, after the keyboard "locked up", I used the mouse to close 
> > > the gnome session, then I logged-in remotely to reboot.  The reboot 
> > > process locked up and I need to use the reset button!  The second time 
> > > the keyboard "locked up" I listed my processes, and I noticed that I had 
> > > a couple of bash processes and a ssh process in "D" state. Something is 
> > > fishy again in the VFS ;)
> > 
> > Yes, there were some NFS updates in between -rc2 and 
> > 28e8351ac22de25034e048c680014ad824323c65. I'd be now even more curious 
> > what are you going to find by bisect, please let us know.
> > 
> > I added Trond to CC, full thread to be found at 
> > http://lkml.org/lkml/2007/8/21/151 for reference.
> > 
> > Florin, it also might be useful to capture the states of stuck processess 
> > via alt-sysrq-T (or better by echo t > /proc/sysrq-trigger), so that we 
> > know better where are they stuck.
> 
> This morning it took a bit longer to hang, but it happened.  The
> backtraces are at http://iucha.net/2.6.23-rc3/backtraces.gz .
> 
> I'll try a bisect session this weekend.

[EMAIL PROTECTED] $ git bisect bad
Bisecting: 5 revisions left to test after this
[EMAIL PROTECTED] $ git bisect log
git-bisect start
# good: [d4ac2477fad0f2680e84ec12e387ce67682c5c13] Linux 2.6.23-rc2
git-bisect good d4ac2477fad0f2680e84ec12e387ce67682c5c13
# bad: [28e8351ac22de25034e048c680014ad824323c65] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
git-bisect bad 28e8351ac22de25034e048c680014ad824323c65
# bad: [8f2ea1fd3f97ab7a809e939b5b9005a16f862439] [POWERPC] Fix initialization 
and usage of dma_mask
git-bisect bad 8f2ea1fd3f97ab7a809e939b5b9005a16f862439
# good: [ff95f3df54609d9d4b9572f8a67d09922a645043] sched: remove the 'u64 now' 
parameter from pick_next_task()
git-bisect good ff95f3df54609d9d4b9572f8a67d09922a645043
# good: [be12014dd7750648fde33e1e45cac24dc9a8be6d] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
git-bisect good be12014dd7750648fde33e1e45cac24dc9a8be6d
# good: [6a0ed91e361a93ee1efb4c20c4967024ed2a8dd7] hexdump: use const notation
git-bisect good 6a0ed91e361a93ee1efb4c20c4967024ed2a8dd7
# bad: [6adb31c90c47262c8a25bf5097de9b3426caf3ae] remove dubious legal statment 
from uio-howto
git-bisect bad 6adb31c90c47262c8a25bf5097de9b3426caf3ae

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


NFS woes again Was: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-23 Thread Florin Iucha
Trond,

Fess up... I'm closing in:

   http://iucha.net/2.6.23-rc3/2.6.23-rc-bisect.png

[Dropping Jiri and linux-usb-devel from future postings.  You are
included now just for communicating the conclusion of this thread.]

On Wed, Aug 22, 2007 at 08:22:00AM -0500, Florin Iucha wrote:
 On Tue, Aug 21, 2007 at 03:42:26PM +0200, Jiri Kosina wrote:
  On Tue, 21 Aug 2007, Florin Iucha wrote:
  
   There is another interesting angle to this: in the past, every time I 
   had keyboard problems, it used to be caused by the VFS and/or NFS... 
   after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, 
   Alan!). Now, after the keyboard locked up, I used the mouse to close 
   the gnome session, then I logged-in remotely to reboot.  The reboot 
   process locked up and I need to use the reset button!  The second time 
   the keyboard locked up I listed my processes, and I noticed that I had 
   a couple of bash processes and a ssh process in D state. Something is 
   fishy again in the VFS ;)
  
  Yes, there were some NFS updates in between -rc2 and 
  28e8351ac22de25034e048c680014ad824323c65. I'd be now even more curious 
  what are you going to find by bisect, please let us know.
  
  I added Trond to CC, full thread to be found at 
  http://lkml.org/lkml/2007/8/21/151 for reference.
  
  Florin, it also might be useful to capture the states of stuck processess 
  via alt-sysrq-T (or better by echo t  /proc/sysrq-trigger), so that we 
  know better where are they stuck.
 
 This morning it took a bit longer to hang, but it happened.  The
 backtraces are at http://iucha.net/2.6.23-rc3/backtraces.gz .
 
 I'll try a bisect session this weekend.

[EMAIL PROTECTED] $ git bisect bad
Bisecting: 5 revisions left to test after this
[EMAIL PROTECTED] $ git bisect log
git-bisect start
# good: [d4ac2477fad0f2680e84ec12e387ce67682c5c13] Linux 2.6.23-rc2
git-bisect good d4ac2477fad0f2680e84ec12e387ce67682c5c13
# bad: [28e8351ac22de25034e048c680014ad824323c65] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
git-bisect bad 28e8351ac22de25034e048c680014ad824323c65
# bad: [8f2ea1fd3f97ab7a809e939b5b9005a16f862439] [POWERPC] Fix initialization 
and usage of dma_mask
git-bisect bad 8f2ea1fd3f97ab7a809e939b5b9005a16f862439
# good: [ff95f3df54609d9d4b9572f8a67d09922a645043] sched: remove the 'u64 now' 
parameter from pick_next_task()
git-bisect good ff95f3df54609d9d4b9572f8a67d09922a645043
# good: [be12014dd7750648fde33e1e45cac24dc9a8be6d] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
git-bisect good be12014dd7750648fde33e1e45cac24dc9a8be6d
# good: [6a0ed91e361a93ee1efb4c20c4967024ed2a8dd7] hexdump: use const notation
git-bisect good 6a0ed91e361a93ee1efb4c20c4967024ed2a8dd7
# bad: [6adb31c90c47262c8a25bf5097de9b3426caf3ae] remove dubious legal statment 
from uio-howto
git-bisect bad 6adb31c90c47262c8a25bf5097de9b3426caf3ae

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: NFS woes again Was: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-23 Thread Florin Iucha
On Thu, Aug 23, 2007 at 10:14:38AM -0700, Bret Towe wrote:
 this sounds alot like the post i did yesterday titled 'nfs4 hang regression'
 i tracked it down to commit 3d39c691ff486142dd9aaeac12f553f4476b7a6

Yes, it certainly does -- all the symptoms match!

I'm not [alone in] seeing dead keyboards!

Now, if only somebody could clarify to me the connection between
the bad NFS4 shooting the keyboard but not the mouse, that would
be wonderful.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-22 Thread Florin Iucha
On Tue, Aug 21, 2007 at 03:42:26PM +0200, Jiri Kosina wrote:
> On Tue, 21 Aug 2007, Florin Iucha wrote:
> 
> > There is another interesting angle to this: in the past, every time I 
> > had keyboard problems, it used to be caused by the VFS and/or NFS... 
> > after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, 
> > Alan!). Now, after the keyboard "locked up", I used the mouse to close 
> > the gnome session, then I logged-in remotely to reboot.  The reboot 
> > process locked up and I need to use the reset button!  The second time 
> > the keyboard "locked up" I listed my processes, and I noticed that I had 
> > a couple of bash processes and a ssh process in "D" state. Something is 
> > fishy again in the VFS ;)
> 
> Yes, there were some NFS updates in between -rc2 and 
> 28e8351ac22de25034e048c680014ad824323c65. I'd be now even more curious 
> what are you going to find by bisect, please let us know.
> 
> I added Trond to CC, full thread to be found at 
> http://lkml.org/lkml/2007/8/21/151 for reference.
> 
> Florin, it also might be useful to capture the states of stuck processess 
> via alt-sysrq-T (or better by echo t > /proc/sysrq-trigger), so that we 
> know better where are they stuck.

This morning it took a bit longer to hang, but it happened.  The
backtraces are at http://iucha.net/2.6.23-rc3/backtraces.gz .

I'll try a bisect session this weekend.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-22 Thread Florin Iucha
On Tue, Aug 21, 2007 at 03:42:26PM +0200, Jiri Kosina wrote:
 On Tue, 21 Aug 2007, Florin Iucha wrote:
 
  There is another interesting angle to this: in the past, every time I 
  had keyboard problems, it used to be caused by the VFS and/or NFS... 
  after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter, 
  Alan!). Now, after the keyboard locked up, I used the mouse to close 
  the gnome session, then I logged-in remotely to reboot.  The reboot 
  process locked up and I need to use the reset button!  The second time 
  the keyboard locked up I listed my processes, and I noticed that I had 
  a couple of bash processes and a ssh process in D state. Something is 
  fishy again in the VFS ;)
 
 Yes, there were some NFS updates in between -rc2 and 
 28e8351ac22de25034e048c680014ad824323c65. I'd be now even more curious 
 what are you going to find by bisect, please let us know.
 
 I added Trond to CC, full thread to be found at 
 http://lkml.org/lkml/2007/8/21/151 for reference.
 
 Florin, it also might be useful to capture the states of stuck processess 
 via alt-sysrq-T (or better by echo t  /proc/sysrq-trigger), so that we 
 know better where are they stuck.

This morning it took a bit longer to hang, but it happened.  The
backtraces are at http://iucha.net/2.6.23-rc3/backtraces.gz .

I'll try a bisect session this weekend.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Tue, Aug 21, 2007 at 08:17:59AM -0500, Florin Iucha wrote:
> On Tue, Aug 21, 2007 at 03:05:25PM +0200, Jiri Kosina wrote:
> > > I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and 
> > > 'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if 
> > > the keyboard/usb behaves or not.
> > 
> > Thanks. If this doesn't give us any hint, it would be useful if you could 
> > do git-bisect between rc2 and rc3, I really can't immediately see anything 
> > in the list of commits that might directly cause the behavior you are 
> > seeing (most importantly because there were no USB and no HID updates in 
> > this window).
> 
> The keyboard still locked up.  There is absolutely nothing in the
> kernel log.
> 
> > There are approximately 290 commits, so it shouldn't require more than 9 
> > reboots plus the time needed to check whether the bug triggers or not.
> 
> The top commit is not v2.6.23-rc3 but
> 
>commit 28e8351ac22de25034e048c680014ad824323c65
>Merge: 3b993e8... d18c4d6...
>Author: Linus Torvalds <[EMAIL PROTECTED]>
>Date:   Tue Aug 14 10:00:29 2007 -0700
> 
>Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
> 
> I'll try to make time to bisect it...

There is another interesting angle to this: in the past, every time I
had keyboard problems, it used to be caused by the VFS and/or NFS...
after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter,
Alan!).

Now, after the keyboard "locked up", I used the mouse to close the
gnome session, then I logged-in remotely to reboot.  The reboot
process locked up and I need to use the reset button!  The second
time the keyboard "locked up" I listed my processes, and I noticed
that I had a couple of bash processes and a ssh process in "D" state.

Something is fishy again in the VFS ;)

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Tue, Aug 21, 2007 at 03:05:25PM +0200, Jiri Kosina wrote:
> > I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and 
> > 'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if 
> > the keyboard/usb behaves or not.
> 
> Thanks. If this doesn't give us any hint, it would be useful if you could 
> do git-bisect between rc2 and rc3, I really can't immediately see anything 
> in the list of commits that might directly cause the behavior you are 
> seeing (most importantly because there were no USB and no HID updates in 
> this window).

The keyboard still locked up.  There is absolutely nothing in the
kernel log.

> There are approximately 290 commits, so it shouldn't require more than 9 
> reboots plus the time needed to check whether the bug triggers or not.

The top commit is not v2.6.23-rc3 but

   commit 28e8351ac22de25034e048c680014ad824323c65
   Merge: 3b993e8... d18c4d6...
   Author: Linus Torvalds <[EMAIL PROTECTED]>
   Date:   Tue Aug 14 10:00:29 2007 -0700

   Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes

I'll try to make time to bisect it...

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Tue, Aug 21, 2007 at 06:51:15AM -0500, Florin Iucha wrote:
> On Wed, Aug 15, 2007 at 04:58:33PM +0200, Jiri Kosina wrote:
> > On Wed, 15 Aug 2007, Florin Iucha wrote:
> > 
> > > [See my message to Alan]: It happened twice, within 15 minutes of 
> > > boot+login, with 2.6.23-rc3-$whatever .  I does not happen with 
> > > 2.6.2[123](-rc*)?  After the two incidents, I rebooted in 2.6.23-rc2 and 
> > > it is working for an hour now.
> > 
> > It is not immediately clear what might be causing this, 2.6.23-rc3 didn't 
> > get any USB nor HID updates at all compared to 2.6.23-rc2.
> > 
> > Could you please enable USB and HID debugging to see whether we can see 
> > anything spurious in the logs at the time the keyboard gets stuck?
> 
> Jiri,
> 
> I have enabled USB debugging and I see a bunch (=46) of these messages:
> 
>[  $timestamp] usb 1-9: usb auto-suspend
>[  $timestamp] usb 1-9: usb auto-resume
>[  $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER 
> sig=se0 PE CONNECT
>[  $timestamp] usb 1-9: finish resume
> 
> The messages continued to be logged, even after the keyboard has
> become unresponsive.

[snip]

> Do you need me to build a -rc2 with USB debug enabled to compare and
> contrast?

With 2.6.23-rc2 and USB_DEBUG enabled, I see the same messages but no
keyboard "dissapearance".

I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and
'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if 
the keyboard/usb behaves or not.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Tue, Aug 21, 2007 at 02:04:26PM +0200, Jiri Kosina wrote:
> > I have enabled USB debugging and I see a bunch (=46) of these messages:
> 
> >[  $timestamp] usb 1-9: usb auto-suspend
> >[  $timestamp] usb 1-9: usb auto-resume
> >[  $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 
> > POWER sig=se0 PE CONNECT
> >[  $timestamp] usb 1-9: finish resume
> > The messages continued to be logged, even after the keyboard has
> > become unresponsive.
> 
> I guess that this is the card reader being suspended and resumed 
> afterwards. Do you by any chance see any improvement when you
> 
> - rmmod ehci_hcd

It's built-in.  Should I build it as a module?  This machine has only
usb 2.0 ports.  If I rmmod it, will my USB keyboard still work?

[The card reader is one of those that fit into a 3.5" bay, connected
straight to the motherboard controller, so it's a bit of a pain to
disconnect.]

> - disable USB_AUTOSUSPEND

You mean CONFIG_USB_SUSPEND?

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Wed, Aug 15, 2007 at 04:58:33PM +0200, Jiri Kosina wrote:
> On Wed, 15 Aug 2007, Florin Iucha wrote:
> 
> > [See my message to Alan]: It happened twice, within 15 minutes of 
> > boot+login, with 2.6.23-rc3-$whatever .  I does not happen with 
> > 2.6.2[123](-rc*)?  After the two incidents, I rebooted in 2.6.23-rc2 and 
> > it is working for an hour now.
> 
> It is not immediately clear what might be causing this, 2.6.23-rc3 didn't 
> get any USB nor HID updates at all compared to 2.6.23-rc2.
> 
> Could you please enable USB and HID debugging to see whether we can see 
> anything spurious in the logs at the time the keyboard gets stuck?

Jiri,

I have enabled USB debugging and I see a bunch (=46) of these messages:

   [  $timestamp] usb 1-9: usb auto-suspend
   [  $timestamp] usb 1-9: usb auto-resume
   [  $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER 
sig=se0 PE CONNECT
   [  $timestamp] usb 1-9: finish resume

The messages continued to be logged, even after the keyboard has
become unresponsive.

The entire kernel log is at http://iucha.net/usb/log-2.6.23-rc3-2 .
The dump of /proc/bus/usb/devices is at http://iucha.net/usb/devices .
The output of 'lsusb -t' is at http://iucha.net/usb/lsusb-t .  Plain
lsusb is not working.  The version of usbutils is '0.72-7ubuntu2' .

Do you need me to build a -rc2 with USB debug enabled to compare and
contrast?

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Wed, Aug 15, 2007 at 04:58:33PM +0200, Jiri Kosina wrote:
 On Wed, 15 Aug 2007, Florin Iucha wrote:
 
  [See my message to Alan]: It happened twice, within 15 minutes of 
  boot+login, with 2.6.23-rc3-$whatever .  I does not happen with 
  2.6.2[123](-rc*)?  After the two incidents, I rebooted in 2.6.23-rc2 and 
  it is working for an hour now.
 
 It is not immediately clear what might be causing this, 2.6.23-rc3 didn't 
 get any USB nor HID updates at all compared to 2.6.23-rc2.
 
 Could you please enable USB and HID debugging to see whether we can see 
 anything spurious in the logs at the time the keyboard gets stuck?

Jiri,

I have enabled USB debugging and I see a bunch (=46) of these messages:

   [  $timestamp] usb 1-9: usb auto-suspend
   [  $timestamp] usb 1-9: usb auto-resume
   [  $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER 
sig=se0 PE CONNECT
   [  $timestamp] usb 1-9: finish resume

The messages continued to be logged, even after the keyboard has
become unresponsive.

The entire kernel log is at http://iucha.net/usb/log-2.6.23-rc3-2 .
The dump of /proc/bus/usb/devices is at http://iucha.net/usb/devices .
The output of 'lsusb -t' is at http://iucha.net/usb/lsusb-t .  Plain
lsusb is not working.  The version of usbutils is '0.72-7ubuntu2' .

Do you need me to build a -rc2 with USB debug enabled to compare and
contrast?

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Tue, Aug 21, 2007 at 06:51:15AM -0500, Florin Iucha wrote:
 On Wed, Aug 15, 2007 at 04:58:33PM +0200, Jiri Kosina wrote:
  On Wed, 15 Aug 2007, Florin Iucha wrote:
  
   [See my message to Alan]: It happened twice, within 15 minutes of 
   boot+login, with 2.6.23-rc3-$whatever .  I does not happen with 
   2.6.2[123](-rc*)?  After the two incidents, I rebooted in 2.6.23-rc2 and 
   it is working for an hour now.
  
  It is not immediately clear what might be causing this, 2.6.23-rc3 didn't 
  get any USB nor HID updates at all compared to 2.6.23-rc2.
  
  Could you please enable USB and HID debugging to see whether we can see 
  anything spurious in the logs at the time the keyboard gets stuck?
 
 Jiri,
 
 I have enabled USB debugging and I see a bunch (=46) of these messages:
 
[  $timestamp] usb 1-9: usb auto-suspend
[  $timestamp] usb 1-9: usb auto-resume
[  $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 POWER 
 sig=se0 PE CONNECT
[  $timestamp] usb 1-9: finish resume
 
 The messages continued to be logged, even after the keyboard has
 become unresponsive.

[snip]

 Do you need me to build a -rc2 with USB debug enabled to compare and
 contrast?

With 2.6.23-rc2 and USB_DEBUG enabled, I see the same messages but no
keyboard dissapearance.

I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and
'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if 
the keyboard/usb behaves or not.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Tue, Aug 21, 2007 at 02:04:26PM +0200, Jiri Kosina wrote:
  I have enabled USB debugging and I see a bunch (=46) of these messages:
 
 [  $timestamp] usb 1-9: usb auto-suspend
 [  $timestamp] usb 1-9: usb auto-resume
 [  $timestamp] ehci_hcd :00:02.1: GetStatus port 9 status 001005 
  POWER sig=se0 PE CONNECT
 [  $timestamp] usb 1-9: finish resume
  The messages continued to be logged, even after the keyboard has
  become unresponsive.
 
 I guess that this is the card reader being suspended and resumed 
 afterwards. Do you by any chance see any improvement when you
 
 - rmmod ehci_hcd

It's built-in.  Should I build it as a module?  This machine has only
usb 2.0 ports.  If I rmmod it, will my USB keyboard still work?

[The card reader is one of those that fit into a 3.5 bay, connected
straight to the motherboard controller, so it's a bit of a pain to
disconnect.]

 - disable USB_AUTOSUSPEND

You mean CONFIG_USB_SUSPEND?

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Tue, Aug 21, 2007 at 03:05:25PM +0200, Jiri Kosina wrote:
  I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and 
  'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if 
  the keyboard/usb behaves or not.
 
 Thanks. If this doesn't give us any hint, it would be useful if you could 
 do git-bisect between rc2 and rc3, I really can't immediately see anything 
 in the list of commits that might directly cause the behavior you are 
 seeing (most importantly because there were no USB and no HID updates in 
 this window).

The keyboard still locked up.  There is absolutely nothing in the
kernel log.

 There are approximately 290 commits, so it shouldn't require more than 9 
 reboots plus the time needed to check whether the bug triggers or not.

The top commit is not v2.6.23-rc3 but

   commit 28e8351ac22de25034e048c680014ad824323c65
   Merge: 3b993e8... d18c4d6...
   Author: Linus Torvalds [EMAIL PROTECTED]
   Date:   Tue Aug 14 10:00:29 2007 -0700

   Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes

I'll try to make time to bisect it...

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-21 Thread Florin Iucha
On Tue, Aug 21, 2007 at 08:17:59AM -0500, Florin Iucha wrote:
 On Tue, Aug 21, 2007 at 03:05:25PM +0200, Jiri Kosina wrote:
   I have rebuilt 2.6.23-rc3 with 'CONFIG_USB_EHCI_HCD=m' and 
   'CONFIG_USB_SUSPEND is not set' and will use it for a while, to see if 
   the keyboard/usb behaves or not.
  
  Thanks. If this doesn't give us any hint, it would be useful if you could 
  do git-bisect between rc2 and rc3, I really can't immediately see anything 
  in the list of commits that might directly cause the behavior you are 
  seeing (most importantly because there were no USB and no HID updates in 
  this window).
 
 The keyboard still locked up.  There is absolutely nothing in the
 kernel log.
 
  There are approximately 290 commits, so it shouldn't require more than 9 
  reboots plus the time needed to check whether the bug triggers or not.
 
 The top commit is not v2.6.23-rc3 but
 
commit 28e8351ac22de25034e048c680014ad824323c65
Merge: 3b993e8... d18c4d6...
Author: Linus Torvalds [EMAIL PROTECTED]
Date:   Tue Aug 14 10:00:29 2007 -0700
 
Merge 
 git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
 
 I'll try to make time to bisect it...

There is another interesting angle to this: in the past, every time I
had keyboard problems, it used to be caused by the VFS and/or NFS...
after much wrangling, a bunch of bugs were fixed (Hi Trond, Peter,
Alan!).

Now, after the keyboard locked up, I used the mouse to close the
gnome session, then I logged-in remotely to reboot.  The reboot
process locked up and I need to use the reset button!  The second
time the keyboard locked up I listed my processes, and I noticed
that I had a couple of bash processes and a ssh process in D state.

Something is fishy again in the VFS ;)

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-15 Thread Florin Iucha
On Wed, Aug 15, 2007 at 04:49:02PM +0200, Jiri Kosina wrote:
> On Wed, 15 Aug 2007, Florin Iucha wrote:
> 
> > Today my USB keyboard stopped working in the middle of composing and
> > e-mail.  I unplugged it and plugged it back, with no success.  I
> > logged in remotely and found this lovely message:
> 
> The error message seems unrelated to your keyboard becoming dead.

Yes, it was related to me unplugging it in the hopes that a re-plug
will make it work again ;)

> > I am testing each rcX kernel, and I did not see this problem so far. 
> > Smells like a new regression.
> 
> Is that reproducible, or did it happen just once? Any error message 
> present in log prior to that sysfs dump please?

[See my message to Alan]: It happened twice, within 15 minutes of
boot+login, with 2.6.23-rc3-$whatever .  I does not happen with
2.6.2[123](-rc*)?  After the two incidents, I rebooted in 2.6.23-rc2
and it is working for an hour now.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-15 Thread Florin Iucha
On Wed, Aug 15, 2007 at 10:38:54AM -0400, Alan Stern wrote:
> This patch will get rid of the annoying error messages.  It won't do 
> anything about your keyboard's tendency to spontaneously stop working, 
> alas.

My keyboard works fine for days, with kernels up to and including
2.6.23-rc2 . I have booted into 2.6.23-rc3-$whatever this morning, and
after 10-15 minutes the keyboard stopped working.  The mice which were
plugged in the keyboard's built-in hub were fine though.

The first time it happened, I removed the keyboard and got the oops
that started this thread.  The second time, I just logged-in remotely
and rebooted, and the reboot process stopped at "KILLING all
processes" step.  I simply reset the box and rebooted into 2.6.23-rc2
and it is fine since (over an hour ago).

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-15 Thread Florin Iucha
Today my USB keyboard stopped working in the middle of composing and
e-mail.  I unplugged it and plugged it back, with no success.  I
logged in remotely and found this lovely message:

[ 1301.567351] usb 1-4: USB disconnect, address 3
[ 1301.567356] usb 1-4.2: USB disconnect, address 5
[ 1301.567599] sysfs_remove_bin_file: bad dentry or inode or no such file: 
"descriptors"
[ 1301.567604] 
[ 1301.567605] Call Trace:
[ 1301.567614]  [] sysfs_remove_bin_file+0x39/0x3d
[ 1301.567619]  [] device_remove_bin_file+0x15/0x17
[ 1301.567623]  [] usb_remove_sysfs_dev_files+0x89/0x9d
[ 1301.567627]  [] generic_disconnect+0x2e/0x32
[ 1301.567630]  [] usb_unbind_device+0x15/0x19
[ 1301.567634]  [] __device_release_driver+0x93/0xb3
[ 1301.567637]  [] device_release_driver+0x31/0x49
[ 1301.567640]  [] bus_remove_device+0x76/0x87
[ 1301.567644]  [] device_del+0x216/0x297
[ 1301.567648]  [] usb_disconnect+0xc8/0x151
[ 1301.567651]  [] usb_disconnect+0x8f/0x151
[ 1301.567655]  [] hub_thread+0x442/0xc47
[ 1301.567659]  [] _spin_unlock_irq+0x9/0xc
[ 1301.567664]  [] autoremove_wake_function+0x0/0x38
[ 1301.567668]  [] hub_thread+0x0/0xc47
[ 1301.567671]  [] kthread+0x49/0x76
[ 1301.567674]  [] child_rip+0xa/0x12
[ 1301.567679]  [] kthread+0x0/0x76
[ 1301.567682]  [] child_rip+0x0/0x12
[ 1301.567684] 

I have rebooted, and while composing this message, I thought useful to
include the output from 'lsusb'.  Funny enough, lsusb does not list any
devices.  A 'cat /proc/bus/usb/devices' yields the following:

T:  Bus=02 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=12  MxCh=10
B:  Alloc= 25/900 us ( 3%), #Int=  5, #Iso=  0
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor= ProdID= Rev= 2.06
S:  Manufacturer=Linux 2.6.23-rc3-1 ohci_hcd
S:  Product=OHCI Host Controller
S:  SerialNumber=:00:02.0
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=255ms

T:  Bus=02 Lev=01 Prnt=01 Port=02 Cnt=01 Dev#=  2 Spd=12  MxCh= 4
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=05f3 ProdID=0081 Rev= 3.10
S:  Manufacturer=PI Engineering
S:  Product=Kinesis Keyboard Hub
C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr= 50mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   1 Ivl=255ms

T:  Bus=02 Lev=02 Prnt=02 Port=01 Cnt=01 Dev#=  3 Spd=12  MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=05f3 ProdID=0007 Rev= 3.10
C:* #Ifs= 2 Cfg#= 1 Atr=a0 MxPwr= 64mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=01 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   8 Ivl=8ms
I:* If#= 1 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=00 Prot=00 Driver=usbhid
E:  Ad=82(I) Atr=03(Int.) MxPS=   4 Ivl=8ms

T:  Bus=02 Lev=02 Prnt=02 Port=02 Cnt=02 Dev#=  4 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=05bc ProdID=0102 Rev= 2.00
S:  Manufacturer=Forward
S:  Product=USB Optical Mouse
C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=02 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=10ms

T:  Bus=02 Lev=02 Prnt=02 Port=03 Cnt=03 Dev#=  5 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=045e ProdID=0039 Rev= 1.21
S:  Manufacturer=Microsoft
S:  Product=Microsoft IntelliMouse® Optical
C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=02 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=10ms

T:  Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=480 MxCh=10
B:  Alloc=  0/800 us ( 0%), #Int=  1, #Iso=  0
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor= ProdID= Rev= 2.06
S:  Manufacturer=Linux 2.6.23-rc3-1 ehci_hcd
S:  Product=EHCI Host Controller
S:  SerialNumber=:00:02.1
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=256ms

T:  Bus=01 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#=  3 Spd=480 MxCh= 4
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=0409 ProdID=0058 Rev= 1.00
S:  Manufacturer=NEC Corporation
S:  Product=USB2.0 Hub Controller
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   1 Ivl=256ms

T:  Bus=01 Lev=02 Prnt=03 Port=01 Cnt=01 Dev#=  5 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=0fe9 ProdID=9010 Rev= 1.00
S:  Manufacturer=DVICO
S:  Product=DVICO USB HID Remocon V1.00
C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 2 Cls=03(HID  ) Sub=00 Prot=00 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   3 Ivl=32ms
E:  Ad=02(O) Atr=03(Int.) MxPS=   1 Ivl=32ms

T:  Bus=01 Lev=01 Prnt=01 Port=08 Cnt=02 Dev#=  4 Spd=480 MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=07cc ProdID=0501 

USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-15 Thread Florin Iucha
Today my USB keyboard stopped working in the middle of composing and
e-mail.  I unplugged it and plugged it back, with no success.  I
logged in remotely and found this lovely message:

[ 1301.567351] usb 1-4: USB disconnect, address 3
[ 1301.567356] usb 1-4.2: USB disconnect, address 5
[ 1301.567599] sysfs_remove_bin_file: bad dentry or inode or no such file: 
descriptors
[ 1301.567604] 
[ 1301.567605] Call Trace:
[ 1301.567614]  [802b89a6] sysfs_remove_bin_file+0x39/0x3d
[ 1301.567619]  [803f1d24] device_remove_bin_file+0x15/0x17
[ 1301.567623]  [8045ea0d] usb_remove_sysfs_dev_files+0x89/0x9d
[ 1301.567627]  [804625cc] generic_disconnect+0x2e/0x32
[ 1301.567630]  [8045b58d] usb_unbind_device+0x15/0x19
[ 1301.567634]  [803f4161] __device_release_driver+0x93/0xb3
[ 1301.567637]  [803f45af] device_release_driver+0x31/0x49
[ 1301.567640]  [803f39f1] bus_remove_device+0x76/0x87
[ 1301.567644]  [803f2014] device_del+0x216/0x297
[ 1301.567648]  [80455c8f] usb_disconnect+0xc8/0x151
[ 1301.567651]  [80455c56] usb_disconnect+0x8f/0x151
[ 1301.567655]  [804564c8] hub_thread+0x442/0xc47
[ 1301.567659]  [80553cdb] _spin_unlock_irq+0x9/0xc
[ 1301.567664]  [80244ad7] autoremove_wake_function+0x0/0x38
[ 1301.567668]  [80456086] hub_thread+0x0/0xc47
[ 1301.567671]  [802449cb] kthread+0x49/0x76
[ 1301.567674]  [8020c618] child_rip+0xa/0x12
[ 1301.567679]  [80244982] kthread+0x0/0x76
[ 1301.567682]  [8020c60e] child_rip+0x0/0x12
[ 1301.567684] 

I have rebooted, and while composing this message, I thought useful to
include the output from 'lsusb'.  Funny enough, lsusb does not list any
devices.  A 'cat /proc/bus/usb/devices' yields the following:

T:  Bus=02 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=12  MxCh=10
B:  Alloc= 25/900 us ( 3%), #Int=  5, #Iso=  0
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor= ProdID= Rev= 2.06
S:  Manufacturer=Linux 2.6.23-rc3-1 ohci_hcd
S:  Product=OHCI Host Controller
S:  SerialNumber=:00:02.0
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=255ms

T:  Bus=02 Lev=01 Prnt=01 Port=02 Cnt=01 Dev#=  2 Spd=12  MxCh= 4
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=05f3 ProdID=0081 Rev= 3.10
S:  Manufacturer=PI Engineering
S:  Product=Kinesis Keyboard Hub
C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr= 50mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   1 Ivl=255ms

T:  Bus=02 Lev=02 Prnt=02 Port=01 Cnt=01 Dev#=  3 Spd=12  MxCh= 0
D:  Ver= 1.10 Cls=00(ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=05f3 ProdID=0007 Rev= 3.10
C:* #Ifs= 2 Cfg#= 1 Atr=a0 MxPwr= 64mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=01 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   8 Ivl=8ms
I:* If#= 1 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=00 Prot=00 Driver=usbhid
E:  Ad=82(I) Atr=03(Int.) MxPS=   4 Ivl=8ms

T:  Bus=02 Lev=02 Prnt=02 Port=02 Cnt=02 Dev#=  4 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=05bc ProdID=0102 Rev= 2.00
S:  Manufacturer=Forward
S:  Product=USB Optical Mouse
C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=02 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=10ms

T:  Bus=02 Lev=02 Prnt=02 Port=03 Cnt=03 Dev#=  5 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=045e ProdID=0039 Rev= 1.21
S:  Manufacturer=Microsoft
S:  Product=Microsoft IntelliMouse® Optical
C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=02 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=10ms

T:  Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=480 MxCh=10
B:  Alloc=  0/800 us ( 0%), #Int=  1, #Iso=  0
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor= ProdID= Rev= 2.06
S:  Manufacturer=Linux 2.6.23-rc3-1 ehci_hcd
S:  Product=EHCI Host Controller
S:  SerialNumber=:00:02.1
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=256ms

T:  Bus=01 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#=  3 Spd=480 MxCh= 4
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=0409 ProdID=0058 Rev= 1.00
S:  Manufacturer=NEC Corporation
S:  Product=USB2.0 Hub Controller
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   1 Ivl=256ms

T:  Bus=01 Lev=02 Prnt=03 Port=01 Cnt=01 Dev#=  5 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=0fe9 ProdID=9010 Rev= 1.00
S:  Manufacturer=DVICO
S:  Product=DVICO USB HID Remocon V1.00
C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=100mA
I:* If#= 0 Alt= 0 

Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-15 Thread Florin Iucha
On Wed, Aug 15, 2007 at 10:38:54AM -0400, Alan Stern wrote:
 This patch will get rid of the annoying error messages.  It won't do 
 anything about your keyboard's tendency to spontaneously stop working, 
 alas.

My keyboard works fine for days, with kernels up to and including
2.6.23-rc2 . I have booted into 2.6.23-rc3-$whatever this morning, and
after 10-15 minutes the keyboard stopped working.  The mice which were
plugged in the keyboard's built-in hub were fine though.

The first time it happened, I removed the keyboard and got the oops
that started this thread.  The second time, I just logged-in remotely
and rebooted, and the reboot process stopped at KILLING all
processes step.  I simply reset the box and rebooted into 2.6.23-rc2
and it is fine since (over an hour ago).

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [linux-usb-devel] USB-related oops in sysfs with linux v2.6.23-rc3-50-g28e8351

2007-08-15 Thread Florin Iucha
On Wed, Aug 15, 2007 at 04:49:02PM +0200, Jiri Kosina wrote:
 On Wed, 15 Aug 2007, Florin Iucha wrote:
 
  Today my USB keyboard stopped working in the middle of composing and
  e-mail.  I unplugged it and plugged it back, with no success.  I
  logged in remotely and found this lovely message:
 
 The error message seems unrelated to your keyboard becoming dead.

Yes, it was related to me unplugging it in the hopes that a re-plug
will make it work again ;)

  I am testing each rcX kernel, and I did not see this problem so far. 
  Smells like a new regression.
 
 Is that reproducible, or did it happen just once? Any error message 
 present in log prior to that sysfs dump please?

[See my message to Alan]: It happened twice, within 15 minutes of
boot+login, with 2.6.23-rc3-$whatever .  I does not happen with
2.6.2[123](-rc*)?  After the two incidents, I rebooted in 2.6.23-rc2
and it is working for an hour now.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


"upping" a semaphore from interrupt context?

2007-06-22 Thread Florin Iucha
Hello,

I am writing a USB driver for some custom hardware, and I need to
synchronize between the user-space and the USB subsystem.  Can I
create a semaphore and "down" it in the reader then "up" it in the
completion handler?

I know the completion handler runs in interrupt context so you are not
allowed to acquire any semaphores: but can you release them?  Will the
waiting tasks wake up after the handler and its caller returned - IOW
will the waking up task run in interrupt context as well?

This is with Linux 2.4 (if it makes a difference).

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


upping a semaphore from interrupt context?

2007-06-22 Thread Florin Iucha
Hello,

I am writing a USB driver for some custom hardware, and I need to
synchronize between the user-space and the USB subsystem.  Can I
create a semaphore and down it in the reader then up it in the
completion handler?

I know the completion handler runs in interrupt context so you are not
allowed to acquire any semaphores: but can you release them?  Will the
waiting tasks wake up after the handler and its caller returned - IOW
will the waking up task run in interrupt context as well?

This is with Linux 2.4 (if it makes a difference).

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: spurious completions during NCQ?

2007-06-08 Thread Florin Iucha
On Fri, Jun 08, 2007 at 09:55:58PM +0900, Tejun Heo wrote:
> Florin Iucha wrote:
> >> It means the drive reported command tags were completed that were not 
> >> outstanding. What kind of drive is this?
> > 
> > [   29.033142] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors 
> > = 156
> > 301488
> > [   29.033146] ata1.00: ATA-7: HITACHI HTS541680J9SA00, SB2IC7EP, max 
> > UDMA/100
> > [   29.033149] ata1.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 31/32)
> > [   29.034230] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors 
> > = 156
> > 301488
> 
> Please report the result of 'hdparm -I /dev/sda'.  We'll probably have
> to blacklist it for NCQ.

Here it is: I only jumbled the serial number a bit.

/dev/sda:

ATA device, with non-removable media
Model Number:   HITACHI HTS541680J9SA00 
Serial Number:  SB**I57L4A
Firmware Revision:  SB2IC7EP
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 1 
Supported: 7 6 5 4 
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  156301488
LBA48  user addressable sectors:  156301488
device size with M = 1024*1024:   76319 MBytes
device size with M = 1000*1000:   80026 MBytes (80 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Vendor, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 16
Advanced power management level: 128 (0x80)
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4 
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_BUFFER command
   *READ_BUFFER command
   *NOP cmd
   *DOWNLOAD_MICROCODE
   *Advanced Power Management feature set
SET_MAX security extension
Automatic Acoustic Management feature set
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test
   *General Purpose Logging feature set
   *WRITE_{DMA|MULTIPLE}_FUA_EXT
   *64-bit World wide name
   *IDLE_IMMEDIATE with UNLOAD
   *SATA-I signaling speed (1.5Gb/s)
   *Native Command Queueing (NCQ)
   *Host-initiated interface power management
   *Phy event counters
Non-Zero buffer offsets in DMA Setup FIS
DMA Setup Auto-Activate optimization
Device-initiated interface power management
In-order data delivery
   *Software settings preservation
Security: 
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
not supported: enhanced erase
42min for SECURITY ERASE UNIT. 
Checksum: correct

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: spurious completions during NCQ?

2007-06-08 Thread Florin Iucha
On Fri, Jun 08, 2007 at 09:55:58PM +0900, Tejun Heo wrote:
 Florin Iucha wrote:
  It means the drive reported command tags were completed that were not 
  outstanding. What kind of drive is this?
  
  [   29.033142] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors 
  = 156
  301488
  [   29.033146] ata1.00: ATA-7: HITACHI HTS541680J9SA00, SB2IC7EP, max 
  UDMA/100
  [   29.033149] ata1.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 31/32)
  [   29.034230] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors 
  = 156
  301488
 
 Please report the result of 'hdparm -I /dev/sda'.  We'll probably have
 to blacklist it for NCQ.

Here it is: I only jumbled the serial number a bit.

/dev/sda:

ATA device, with non-removable media
Model Number:   HITACHI HTS541680J9SA00 
Serial Number:  SB**I57L4A
Firmware Revision:  SB2IC7EP
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 1 
Supported: 7 6 5 4 
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  156301488
LBA48  user addressable sectors:  156301488
device size with M = 1024*1024:   76319 MBytes
device size with M = 1000*1000:   80026 MBytes (80 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Vendor, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 16
Advanced power management level: 128 (0x80)
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4 
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_BUFFER command
   *READ_BUFFER command
   *NOP cmd
   *DOWNLOAD_MICROCODE
   *Advanced Power Management feature set
SET_MAX security extension
Automatic Acoustic Management feature set
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test
   *General Purpose Logging feature set
   *WRITE_{DMA|MULTIPLE}_FUA_EXT
   *64-bit World wide name
   *IDLE_IMMEDIATE with UNLOAD
   *SATA-I signaling speed (1.5Gb/s)
   *Native Command Queueing (NCQ)
   *Host-initiated interface power management
   *Phy event counters
Non-Zero buffer offsets in DMA Setup FIS
DMA Setup Auto-Activate optimization
Device-initiated interface power management
In-order data delivery
   *Software settings preservation
Security: 
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
not supported: enhanced erase
42min for SECURITY ERASE UNIT. 
Checksum: correct

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: spurious completions during NCQ?

2007-06-07 Thread Florin Iucha
On Wed, Jun 06, 2007 at 08:28:07AM -0600, Robert Hancock wrote:
> >This is on a Thinkpad T60 with 2 GB RAM, running Ubuntu 7.04 (kernel
> >2.6.20-16-generic).  No proprietary drivers (ok, maybe the Intel
> >Wi-Fi - but that should not count).
> >
> >The laptop came with Windows but I blew that away - did I mess some
> >thing up regarding HPA and its ilk?
> 
> It means the drive reported command tags were completed that were not 
> outstanding. What kind of drive is this?

[   29.033142] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156
301488
[   29.033146] ata1.00: ATA-7: HITACHI HTS541680J9SA00, SB2IC7EP, max UDMA/100
[   29.033149] ata1.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 31/32)
[   29.034230] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156
301488

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: spurious completions during NCQ?

2007-06-07 Thread Florin Iucha
On Wed, Jun 06, 2007 at 08:28:07AM -0600, Robert Hancock wrote:
 This is on a Thinkpad T60 with 2 GB RAM, running Ubuntu 7.04 (kernel
 2.6.20-16-generic).  No proprietary drivers (ok, maybe the Intel
 Wi-Fi - but that should not count).
 
 The laptop came with Windows but I blew that away - did I mess some
 thing up regarding HPA and its ilk?
 
 It means the drive reported command tags were completed that were not 
 outstanding. What kind of drive is this?

[   29.033142] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156
301488
[   29.033146] ata1.00: ATA-7: HITACHI HTS541680J9SA00, SB2IC7EP, max UDMA/100
[   29.033149] ata1.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 31/32)
[   29.034230] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156
301488

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


spurious completions during NCQ?

2007-06-05 Thread Florin Iucha
Hello,

I was working on a I/O heavy workload (parsing 100K spam messages to
extract certain structures) when I got this in the kernel log:

[ 2320.132893] ata1.00: exception Emask 0x2 SAct 0x701f SErr 0x0 action 0x2 
frozen
[ 2320.132899] ata1.00: (spurious completions during NCQ issue=0x0 
SAct=0x701f FIS=005040a1:0800)
[ 2320.132905] ata1.00: cmd 61/10:00:59:fc:d0/00:00:07:00:00/40 tag 0 cdb 0x0 
data 8192 out
[ 2320.132906]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132911] ata1.00: cmd 61/10:08:69:fc:d0/00:00:07:00:00/40 tag 1 cdb 0x0 
data 8192 out
[ 2320.132913]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132918] ata1.00: cmd 61/08:10:19:4c:d1/00:00:07:00:00/40 tag 2 cdb 0x0 
data 4096 out
[ 2320.132919]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132924] ata1.00: cmd 61/01:18:fb:27:0f/00:00:08:00:00/40 tag 3 cdb 0x0 
data 512 out
[ 2320.132925]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132930] ata1.00: cmd 61/08:20:11:28:0f/00:00:08:00:00/40 tag 4 cdb 0x0 
data 4096 out
[ 2320.132932]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132937] ata1.00: cmd 61/08:28:91:92:17/00:00:08:00:00/40 tag 5 cdb 0x0 
data 4096 out
[ 2320.132938]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132943] ata1.00: cmd 61/08:30:99:b3:17/00:00:08:00:00/40 tag 6 cdb 0x0 
data 4096 out
[ 2320.132944]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132949] ata1.00: cmd 61/01:38:5b:56:4b/00:00:05:00:00/40 tag 7 cdb 0x0 
data 512 out
[ 2320.132950]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132956] ata1.00: cmd 61/08:40:71:56:4b/00:00:05:00:00/40 tag 8 cdb 0x0 
data 4096 out
[ 2320.132957]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132962] ata1.00: cmd 61/08:48:09:cf:5a/00:00:05:00:00/40 tag 9 cdb 0x0 
data 4096 out
[ 2320.132963]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132968] ata1.00: cmd 61/01:50:3b:af:8b/00:00:05:00:00/40 tag 10 cdb 0x0 
data 512 out
[ 2320.132969]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132974] ata1.00: cmd 61/08:58:51:af:8b/00:00:05:00:00/40 tag 11 cdb 0x0 
data 4096 out
[ 2320.132976]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132981] ata1.00: cmd 61/08:60:d9:f1:8d/00:00:05:00:00/40 tag 12 cdb 0x0 
data 4096 out
[ 2320.132982]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132987] ata1.00: cmd 61/08:68:49:bd:8e/00:00:05:00:00/40 tag 13 cdb 0x0 
data 4096 out
[ 2320.132988]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132993] ata1.00: cmd 61/08:70:d9:63:d1/00:00:05:00:00/40 tag 14 cdb 0x0 
data 4096 out
[ 2320.132995]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133000] ata1.00: cmd 61/08:78:91:39:0f/00:00:06:00:00/40 tag 15 cdb 0x0 
data 4096 out
[ 2320.133001]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133006] ata1.00: cmd 61/08:80:49:30:97/00:00:06:00:00/40 tag 16 cdb 0x0 
data 4096 out
[ 2320.133007]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133013] ata1.00: cmd 61/08:88:41:c0:d0/00:00:06:00:00/40 tag 17 cdb 0x0 
data 4096 out
[ 2320.133014]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133019] ata1.00: cmd 61/08:90:b9:8c:d1/00:00:06:00:00/40 tag 18 cdb 0x0 
data 4096 out
[ 2320.133020]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133025] ata1.00: cmd 61/01:98:1b:cf:ce/00:00:07:00:00/40 tag 19 cdb 0x0 
data 512 out
[ 2320.133027]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133032] ata1.00: cmd 61/08:a0:31:cf:ce/00:00:07:00:00/40 tag 20 cdb 0x0 
data 4096 out
[ 2320.133033]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133038] ata1.00: cmd 61/10:e0:41:af:8b/00:00:05:00:00/40 tag 28 cdb 0x0 
data 8192 out
[ 2320.133039]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133044] ata1.00: cmd 61/01:e8:ba:12:8d/00:00:06:00:00/40 tag 29 cdb 0x0 
data 512 out
[ 2320.133046]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133051] ata1.00: cmd 61/10:f0:c1:12:8d/00:00:06:00:00/40 tag 30 cdb 0x0 
data 8192 out
[ 2320.133052]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.378387] ata1: soft resetting port
[ 2320.442169] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 2320.460012] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 
156301488
[ 2320.461395] ata1.00: ata_hpa_resize 

spurious completions during NCQ?

2007-06-05 Thread Florin Iucha
Hello,

I was working on a I/O heavy workload (parsing 100K spam messages to
extract certain structures) when I got this in the kernel log:

[ 2320.132893] ata1.00: exception Emask 0x2 SAct 0x701f SErr 0x0 action 0x2 
frozen
[ 2320.132899] ata1.00: (spurious completions during NCQ issue=0x0 
SAct=0x701f FIS=005040a1:0800)
[ 2320.132905] ata1.00: cmd 61/10:00:59:fc:d0/00:00:07:00:00/40 tag 0 cdb 0x0 
data 8192 out
[ 2320.132906]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132911] ata1.00: cmd 61/10:08:69:fc:d0/00:00:07:00:00/40 tag 1 cdb 0x0 
data 8192 out
[ 2320.132913]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132918] ata1.00: cmd 61/08:10:19:4c:d1/00:00:07:00:00/40 tag 2 cdb 0x0 
data 4096 out
[ 2320.132919]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132924] ata1.00: cmd 61/01:18:fb:27:0f/00:00:08:00:00/40 tag 3 cdb 0x0 
data 512 out
[ 2320.132925]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132930] ata1.00: cmd 61/08:20:11:28:0f/00:00:08:00:00/40 tag 4 cdb 0x0 
data 4096 out
[ 2320.132932]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132937] ata1.00: cmd 61/08:28:91:92:17/00:00:08:00:00/40 tag 5 cdb 0x0 
data 4096 out
[ 2320.132938]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132943] ata1.00: cmd 61/08:30:99:b3:17/00:00:08:00:00/40 tag 6 cdb 0x0 
data 4096 out
[ 2320.132944]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132949] ata1.00: cmd 61/01:38:5b:56:4b/00:00:05:00:00/40 tag 7 cdb 0x0 
data 512 out
[ 2320.132950]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132956] ata1.00: cmd 61/08:40:71:56:4b/00:00:05:00:00/40 tag 8 cdb 0x0 
data 4096 out
[ 2320.132957]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132962] ata1.00: cmd 61/08:48:09:cf:5a/00:00:05:00:00/40 tag 9 cdb 0x0 
data 4096 out
[ 2320.132963]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132968] ata1.00: cmd 61/01:50:3b:af:8b/00:00:05:00:00/40 tag 10 cdb 0x0 
data 512 out
[ 2320.132969]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132974] ata1.00: cmd 61/08:58:51:af:8b/00:00:05:00:00/40 tag 11 cdb 0x0 
data 4096 out
[ 2320.132976]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132981] ata1.00: cmd 61/08:60:d9:f1:8d/00:00:05:00:00/40 tag 12 cdb 0x0 
data 4096 out
[ 2320.132982]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132987] ata1.00: cmd 61/08:68:49:bd:8e/00:00:05:00:00/40 tag 13 cdb 0x0 
data 4096 out
[ 2320.132988]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.132993] ata1.00: cmd 61/08:70:d9:63:d1/00:00:05:00:00/40 tag 14 cdb 0x0 
data 4096 out
[ 2320.132995]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133000] ata1.00: cmd 61/08:78:91:39:0f/00:00:06:00:00/40 tag 15 cdb 0x0 
data 4096 out
[ 2320.133001]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133006] ata1.00: cmd 61/08:80:49:30:97/00:00:06:00:00/40 tag 16 cdb 0x0 
data 4096 out
[ 2320.133007]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133013] ata1.00: cmd 61/08:88:41:c0:d0/00:00:06:00:00/40 tag 17 cdb 0x0 
data 4096 out
[ 2320.133014]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133019] ata1.00: cmd 61/08:90:b9:8c:d1/00:00:06:00:00/40 tag 18 cdb 0x0 
data 4096 out
[ 2320.133020]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133025] ata1.00: cmd 61/01:98:1b:cf:ce/00:00:07:00:00/40 tag 19 cdb 0x0 
data 512 out
[ 2320.133027]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133032] ata1.00: cmd 61/08:a0:31:cf:ce/00:00:07:00:00/40 tag 20 cdb 0x0 
data 4096 out
[ 2320.133033]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133038] ata1.00: cmd 61/10:e0:41:af:8b/00:00:05:00:00/40 tag 28 cdb 0x0 
data 8192 out
[ 2320.133039]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133044] ata1.00: cmd 61/01:e8:ba:12:8d/00:00:06:00:00/40 tag 29 cdb 0x0 
data 512 out
[ 2320.133046]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.133051] ata1.00: cmd 61/10:f0:c1:12:8d/00:00:06:00:00/40 tag 30 cdb 0x0 
data 8192 out
[ 2320.133052]  res 50/00:08:31:cf:ce/00:00:07:00:00/40 Emask 0x2 (HSM 
violation)
[ 2320.378387] ata1: soft resetting port
[ 2320.442169] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 2320.460012] ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 
156301488
[ 2320.461395] ata1.00: ata_hpa_resize 

Re: [1/4] 2.6.22-rc3: known regressions

2007-05-29 Thread Florin Iucha
On Tue, May 29, 2007 at 04:34:59PM +0200, Jan Kara wrote:
> On Tue 29-05-07 14:52:53, Michal Piotrowski wrote:
> > Here is a list of some known regressions in 2.6.22-rc3.
> > 
> > Subject: Oops in dentry_iput with 2.6.22-rc2 on AMD64
> > References : http://lkml.org/lkml/2007/5/22/4
> > Submitter  : Florin Iucha <[EMAIL PROTECTED]>
> > Status : Unknown
>   Actually, the bug seems to be unreproducible and it has probably been a
> 1-bit flip. So I'd be reluctant to call it a regression...

I agree with this statement.  I'll ping Michal and Jan if the oops
resurfaces.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [1/4] 2.6.22-rc3: known regressions

2007-05-29 Thread Florin Iucha
On Tue, May 29, 2007 at 04:34:59PM +0200, Jan Kara wrote:
 On Tue 29-05-07 14:52:53, Michal Piotrowski wrote:
  Here is a list of some known regressions in 2.6.22-rc3.
  
  Subject: Oops in dentry_iput with 2.6.22-rc2 on AMD64
  References : http://lkml.org/lkml/2007/5/22/4
  Submitter  : Florin Iucha [EMAIL PROTECTED]
  Status : Unknown
   Actually, the bug seems to be unreproducible and it has probably been a
 1-bit flip. So I'd be reluctant to call it a regression...

I agree with this statement.  I'll ping Michal and Jan if the oops
resurfaces.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: Oops in dentry_iput with 2.6.22-rc2 on AMD64

2007-05-24 Thread Florin Iucha
On Tue, May 22, 2007 at 11:57:11AM +0200, Jan Kara wrote:
> > I was running a multithreaded perl application that leaks some memory
> > so it gets to eat up a significant chunk of my 2 GB and even push a
> > bit into swap.  I left it running before going out for a walk.
>   Hmm, what seems suspitious is, that in R12 (which probably contains
> the address dereferenced later) is address 9100... while all other
> addresses start with 8100. So it seems to me it could be a 1-bit
> flip. Care to check your memory with memtest?

I let it run overnight: 8 passes and no surprises.  This is a system
that has been quite stable for a year and a half, except finding the
occasional kernel bug ;)

>   Also this is a code all other people use all the time so I guess we
> would see more reports if this was some general bug...

Haven't got any more oopses like that, either.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: Oops in dentry_iput with 2.6.22-rc2 on AMD64

2007-05-24 Thread Florin Iucha
On Tue, May 22, 2007 at 11:57:11AM +0200, Jan Kara wrote:
  I was running a multithreaded perl application that leaks some memory
  so it gets to eat up a significant chunk of my 2 GB and even push a
  bit into swap.  I left it running before going out for a walk.
   Hmm, what seems suspitious is, that in R12 (which probably contains
 the address dereferenced later) is address 9100... while all other
 addresses start with 8100. So it seems to me it could be a 1-bit
 flip. Care to check your memory with memtest?

I let it run overnight: 8 passes and no surprises.  This is a system
that has been quite stable for a year and a half, except finding the
occasional kernel bug ;)

   Also this is a code all other people use all the time so I guess we
 would see more reports if this was some general bug...

Haven't got any more oopses like that, either.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Oops in dentry_iput with 2.6.22-rc2 on AMD64

2007-05-21 Thread Florin Iucha
I was running a multithreaded perl application that leaks some memory
so it gets to eat up a significant chunk of my 2 GB and even push a
bit into swap.  I left it running before going out for a walk.

When I got back, I found this in the log:

[28818.103829] Unable to handle kernel paging request at 910025c9a8b4 RIP: 
[28818.103836]  [] dentry_iput+0x58/0xbb
[28818.103845] PGD 0 
[28818.103848] Oops:  [1] SMP 
[28818.103851] CPU 0 
[28818.103853] Modules linked in: radeon ntfs sbp2 lp lgdt330x cx88_dvb 
cx88_vp3054_i2c dvb_pll video_buf_dvb tuner cx8802 cx88_alsa cx8800 cx88xx 
ir_common rtc tveeprom video_buf btcx_risc i2c_nforce2 evdev forcedeth
[28818.103868] Pid: 253, comm: kswapd0 Not tainted 2.6.22-rc2 #1
[28818.103871] RIP: 0010:[]  [] 
dentry_iput+0x58/0xbb
[28818.103877] RSP: :81007de75d40  EFLAGS: 00010246
[28818.103880] RAX:  RBX: 810075662e00 RCX: 810025c9a898
[28818.103883] RDX: 810025c9a898 RSI: 0001 RDI: 0282
[28818.103886] RBP: 81007de75d50 R08: 534d R09: 0064
[28818.103890] R10: 81007de75d80 R11: 000a R12: 910025c9a868
[28818.103893] R13: 810075662e08 R14:  R15: 005e
[28818.103897] FS:  42804940() GS:806c8000() 
knlGS:
[28818.103900] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[28818.103903] CR2: 910025c9a8b4 CR3: 254bf000 CR4: 06e0
[28818.103907] Process kswapd0 (pid: 253, threadinfo 81007de74000, task 
81007de706d0)
[28818.103909] Stack:  810075662e00 0001 81007de75d70 
80288aaf
[28818.103916]  81007e606070 0001 81007de75d90 
80289a35
[28818.103921]  81007e606070 810075662e00 81007de75dd0 
80289d3d
[28818.103926] Call Trace:
[28818.103931]  [] d_kill+0x38/0x58
[28818.103935]  [] prune_one_dentry+0x3c/0x10f
[28818.103939]  [] prune_dcache+0x137/0x1a0
[28818.103944]  [] shrink_dcache_memory+0x1c/0x35
[28818.103948]  [] shrink_slab+0xe6/0x162
[28818.103953]  [] kswapd+0x329/0x4ca
[28818.103958]  [] autoremove_wake_function+0x0/0x38
[28818.103963]  [] kswapd+0x0/0x4ca
[28818.103967]  [] kthread+0x49/0x76
[28818.103971]  [] child_rip+0xa/0x12
[28818.103977]  [] kthread+0x0/0x76
[28818.103980]  [] child_rip+0x0/0x12
[28818.103982] 
[28818.103984] 
[28818.103985] Code: 41 83 7c 24 4c 00 75 1c 4c 89 e7 45 31 c0 31 c9 31 d2 be 
00 
[28818.103994] RIP  [] dentry_iput+0x58/0xbb
[28818.103999]  RSP 
[28818.104001] CR2: 910025c9a8b4

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Oops in dentry_iput with 2.6.22-rc2 on AMD64

2007-05-21 Thread Florin Iucha
I was running a multithreaded perl application that leaks some memory
so it gets to eat up a significant chunk of my 2 GB and even push a
bit into swap.  I left it running before going out for a walk.

When I got back, I found this in the log:

[28818.103829] Unable to handle kernel paging request at 910025c9a8b4 RIP: 
[28818.103836]  [80288998] dentry_iput+0x58/0xbb
[28818.103845] PGD 0 
[28818.103848] Oops:  [1] SMP 
[28818.103851] CPU 0 
[28818.103853] Modules linked in: radeon ntfs sbp2 lp lgdt330x cx88_dvb 
cx88_vp3054_i2c dvb_pll video_buf_dvb tuner cx8802 cx88_alsa cx8800 cx88xx 
ir_common rtc tveeprom video_buf btcx_risc i2c_nforce2 evdev forcedeth
[28818.103868] Pid: 253, comm: kswapd0 Not tainted 2.6.22-rc2 #1
[28818.103871] RIP: 0010:[80288998]  [80288998] 
dentry_iput+0x58/0xbb
[28818.103877] RSP: :81007de75d40  EFLAGS: 00010246
[28818.103880] RAX:  RBX: 810075662e00 RCX: 810025c9a898
[28818.103883] RDX: 810025c9a898 RSI: 0001 RDI: 0282
[28818.103886] RBP: 81007de75d50 R08: 534d R09: 0064
[28818.103890] R10: 81007de75d80 R11: 000a R12: 910025c9a868
[28818.103893] R13: 810075662e08 R14:  R15: 005e
[28818.103897] FS:  42804940() GS:806c8000() 
knlGS:
[28818.103900] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[28818.103903] CR2: 910025c9a8b4 CR3: 254bf000 CR4: 06e0
[28818.103907] Process kswapd0 (pid: 253, threadinfo 81007de74000, task 
81007de706d0)
[28818.103909] Stack:  810075662e00 0001 81007de75d70 
80288aaf
[28818.103916]  81007e606070 0001 81007de75d90 
80289a35
[28818.103921]  81007e606070 810075662e00 81007de75dd0 
80289d3d
[28818.103926] Call Trace:
[28818.103931]  [80288aaf] d_kill+0x38/0x58
[28818.103935]  [80289a35] prune_one_dentry+0x3c/0x10f
[28818.103939]  [80289d3d] prune_dcache+0x137/0x1a0
[28818.103944]  [80289dc2] shrink_dcache_memory+0x1c/0x35
[28818.103948]  [80261178] shrink_slab+0xe6/0x162
[28818.103953]  [802615ef] kswapd+0x329/0x4ca
[28818.103958]  [8024458b] autoremove_wake_function+0x0/0x38
[28818.103963]  [802612c6] kswapd+0x0/0x4ca
[28818.103967]  [8024447f] kthread+0x49/0x76
[28818.103971]  [8020a578] child_rip+0xa/0x12
[28818.103977]  [80244436] kthread+0x0/0x76
[28818.103980]  [8020a56e] child_rip+0x0/0x12
[28818.103982] 
[28818.103984] 
[28818.103985] Code: 41 83 7c 24 4c 00 75 1c 4c 89 e7 45 31 c0 31 c9 31 d2 be 
00 
[28818.103994] RIP  [80288998] dentry_iput+0x58/0xbb
[28818.103999]  RSP 81007de75d40
[28818.104001] CR2: 910025c9a8b4

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


BUG with today's git

2007-05-17 Thread Florin Iucha
Kernel: v2.6.22-rc1-g0479ea0

Got this in the logs:

[ 8314.672340] BUG: at fs/inotify.c:172 set_dentry_child_flags()
[ 8314.672345]
[ 8314.672346] Call Trace:
[ 8314.672353]  [] _spin_lock+0x9/0xb
[ 8314.672361]  [] set_dentry_child_flags+0x6d/0x14f
[ 8314.672366]  [] remove_watch_no_event+0x68/0x78
[ 8314.672370]  [] inotify_remove_watch_locked+0x18/0x3a
[ 8314.672374]  [] inotify_rm_wd+0x87/0xad
[ 8314.672378]  [] sys_inotify_rm_watch+0x4e/0x6d
[ 8314.672383]  [] system_call+0x7e/0x83
[ 8314.672387]

Machine was mostly idle.  A movie dvd was in the DVD-ROM, but not being played.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


BUG with today's git

2007-05-17 Thread Florin Iucha
Kernel: v2.6.22-rc1-g0479ea0

Got this in the logs:

[ 8314.672340] BUG: at fs/inotify.c:172 set_dentry_child_flags()
[ 8314.672345]
[ 8314.672346] Call Trace:
[ 8314.672353]  [8054a44c] _spin_lock+0x9/0xb
[ 8314.672361]  [8029fc0f] set_dentry_child_flags+0x6d/0x14f
[ 8314.672366]  [8029fd59] remove_watch_no_event+0x68/0x78
[ 8314.672370]  [8029fe5d] inotify_remove_watch_locked+0x18/0x3a
[ 8314.672374]  [802a0164] inotify_rm_wd+0x87/0xad
[ 8314.672378]  [802a07b3] sys_inotify_rm_watch+0x4e/0x6d
[ 8314.672383]  [8020975e] system_call+0x7e/0x83
[ 8314.672387]

Machine was mostly idle.  A movie dvd was in the DVD-ROM, but not being played.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Current git fails to compile on AMD64 for three days in a row

2007-05-10 Thread Florin Iucha
I just pulled the current git (de5603748af8bf7deac403e6ba92887f8d18e812)
and tried to compile it on my AMD64 box, to test Chuck's RPC fix.  It
fails:

   arch/x86_64/kernel/head64.c: In function ‘x86_64_start_kernel’:
   arch/x86_64/kernel/head64.c:70: error: size of array ‘type name’ is negative

It is failing in the same way since Monday.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Current git fails to compile on AMD64 for three days in a row

2007-05-10 Thread Florin Iucha
I just pulled the current git (de5603748af8bf7deac403e6ba92887f8d18e812)
and tried to compile it on my AMD64 box, to test Chuck's RPC fix.  It
fails:

   arch/x86_64/kernel/head64.c: In function ‘x86_64_start_kernel’:
   arch/x86_64/kernel/head64.c:70: error: size of array ‘type name’ is negative

It is failing in the same way since Monday.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


REGRESSION: cannot change file access permissions on files NFS-mounted

2007-05-05 Thread Florin Iucha
Hello,

This morning I updated the kernel on my workstation to the current git 
tree (62ea6d80211ecc88ef516927ecebf64cb505be3f).   Upon reboot, I
cannot change file access permissions of files in a directory that is
nfs mounted (using NFS4):

   $ chmod 0600 $path
   chmod: Changing permissions of `$path': Input/output error

In the kernel log I am getting these gems:

   [  275.990233] decode_attr_group: reply buffer overflowed in line
   2678.<4>decode
   _attr_group: reply buffer overflowed in line
   2678.<4>decode_attr_group: reply bu
   ffer overflowed in line 2678.<4>decode_attr_group: reply buffer
   overflowed in li
   ne 2678.<4>decode_attr_group: reply buffer overflowed in line
   2678.<4>decode_att
   r_group: reply buffer overflowed in line 2678.

The server is Linux running Debian testing with the kernel updated to 
2.6.20-1-686 from
(http://ftp.us.debian.org/debian/pool/main/l/linux-2.6/linux-image-2.6.20-1-686_2.6.20-3_i386.deb)

Kernel 2.6.21 works fine on the client.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


REGRESSION: cannot change file access permissions on files NFS-mounted

2007-05-05 Thread Florin Iucha
Hello,

This morning I updated the kernel on my workstation to the current git 
tree (62ea6d80211ecc88ef516927ecebf64cb505be3f).   Upon reboot, I
cannot change file access permissions of files in a directory that is
nfs mounted (using NFS4):

   $ chmod 0600 $path
   chmod: Changing permissions of `$path': Input/output error

In the kernel log I am getting these gems:

   [  275.990233] decode_attr_group: reply buffer overflowed in line
   2678.4decode
   _attr_group: reply buffer overflowed in line
   2678.4decode_attr_group: reply bu
   ffer overflowed in line 2678.4decode_attr_group: reply buffer
   overflowed in li
   ne 2678.4decode_attr_group: reply buffer overflowed in line
   2678.4decode_att
   r_group: reply buffer overflowed in line 2678.

The server is Linux running Debian testing with the kernel updated to 
2.6.20-1-686 from
(http://ftp.us.debian.org/debian/pool/main/l/linux-2.6/linux-image-2.6.20-1-686_2.6.20-3_i386.deb)

Kernel 2.6.21 works fine on the client.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/5] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-21 Thread Florin Iucha
On Fri, Apr 20, 2007 at 04:03:58PM -0400, Trond Myklebust wrote:
> I've split the issues introduced by the 2.6.21-rcX write code up into 4
> subproblems.
[snip]
> My thanks to the various patient victim^Wpeople who helped with extensive
> testing.

I've pulled the tree this morning
(0f851021c0f91e5073fa89f26b5ac68e23df8e11) and ran big-copy from a
gnome session.  All is well.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/5] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-21 Thread Florin Iucha
On Fri, Apr 20, 2007 at 04:03:58PM -0400, Trond Myklebust wrote:
 I've split the issues introduced by the 2.6.21-rcX write code up into 4
 subproblems.
[snip]
 My thanks to the various patient victim^Wpeople who helped with extensive
 testing.

I've pulled the tree this morning
(0f851021c0f91e5073fa89f26b5ac68e23df8e11) and ran big-copy from a
gnome session.  All is well.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-20 Thread Florin Iucha
On Fri, Apr 20, 2007 at 09:37:30AM -0400, Trond Myklebust wrote:
> Thanks! Did you ever find out what had happened to the test that hung
> last night?

Nope.  I could not ssh into it and the machine was needed for some
windows duty before I got home ;)  I'll try again this coming week-end
and let you know if I see any problems.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-20 Thread Florin Iucha
On Thu, Apr 19, 2007 at 04:49:31PM -0500, Florin Iucha wrote:
> On Thu, Apr 19, 2007 at 05:30:42PM -0400, Trond Myklebust wrote:
> > > I'm far from the machine right now, so I will do some more tests
> > > tonight, but right now, the new patchset is not good.  What is the
> > > difference between reverting the patch you sent yesterday and your
> > > current fifth patch?  I assume the other four are identical, right?
> > 
> > The only difference is the way in which we handle retries of an NFSv4
> > request: the new patch disconnects if and only if a timeout has
> > occurred, or the server sends us garbage.
> 
> I have to mention that I rebased to the head of the tree
> (895e1fc7226e6732bc77138955b6c7dfa279f57a) before applying your
> patches, in order to test what I expect the official tree to be.
> 
> Tonight I'll test this kernel once more, then go back to 21-rc7 and
> apply your 5 patches and re-test.

It passed big-copy, and the copy run from the gnome-session while I
did my morning light browsing, email reading, etc.

kernel:
   895e1fc7226e6732bc77138955b6c7dfa279f57a

patches:
   linux-2.6.21-001-cleanup_unstable_write.dif
   linux-2.6.21-002-defer_clearing_pg_writeback.dif
   linux-2.6.21-003-fix_desynchronised_ncommit.dif
   linux-2.6.21-004-fix_nfs_set_page_dirty.dif
   linux-2.6.21-005-fix_nfsv4_resend.dif

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-20 Thread Florin Iucha
On Thu, Apr 19, 2007 at 04:49:31PM -0500, Florin Iucha wrote:
 On Thu, Apr 19, 2007 at 05:30:42PM -0400, Trond Myklebust wrote:
   I'm far from the machine right now, so I will do some more tests
   tonight, but right now, the new patchset is not good.  What is the
   difference between reverting the patch you sent yesterday and your
   current fifth patch?  I assume the other four are identical, right?
  
  The only difference is the way in which we handle retries of an NFSv4
  request: the new patch disconnects if and only if a timeout has
  occurred, or the server sends us garbage.
 
 I have to mention that I rebased to the head of the tree
 (895e1fc7226e6732bc77138955b6c7dfa279f57a) before applying your
 patches, in order to test what I expect the official tree to be.
 
 Tonight I'll test this kernel once more, then go back to 21-rc7 and
 apply your 5 patches and re-test.

It passed big-copy, and the copy run from the gnome-session while I
did my morning light browsing, email reading, etc.

kernel:
   895e1fc7226e6732bc77138955b6c7dfa279f57a

patches:
   linux-2.6.21-001-cleanup_unstable_write.dif
   linux-2.6.21-002-defer_clearing_pg_writeback.dif
   linux-2.6.21-003-fix_desynchronised_ncommit.dif
   linux-2.6.21-004-fix_nfs_set_page_dirty.dif
   linux-2.6.21-005-fix_nfsv4_resend.dif

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-20 Thread Florin Iucha
On Fri, Apr 20, 2007 at 09:37:30AM -0400, Trond Myklebust wrote:
 Thanks! Did you ever find out what had happened to the test that hung
 last night?

Nope.  I could not ssh into it and the machine was needed for some
windows duty before I got home ;)  I'll try again this coming week-end
and let you know if I see any problems.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: Failure! Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-19 Thread Florin Iucha
On Thu, Apr 19, 2007 at 05:30:42PM -0400, Trond Myklebust wrote:
> > I'm far from the machine right now, so I will do some more tests
> > tonight, but right now, the new patchset is not good.  What is the
> > difference between reverting the patch you sent yesterday and your
> > current fifth patch?  I assume the other four are identical, right?
> 
> The only difference is the way in which we handle retries of an NFSv4
> request: the new patch disconnects if and only if a timeout has
> occurred, or the server sends us garbage.

I have to mention that I rebased to the head of the tree
(895e1fc7226e6732bc77138955b6c7dfa279f57a) before applying your
patches, in order to test what I expect the official tree to be.

Tonight I'll test this kernel once more, then go back to 21-rc7 and
apply your 5 patches and re-test.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Failure! Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-19 Thread Florin Iucha
On Thu, Apr 19, 2007 at 12:09:42PM -0400, Trond Myklebust wrote:
> See
>http://client.linux-nfs.org/Linux-2.6.x/2.6.21-rc7/
> 
> I'm giving the first 5 patches of that series (i.e.
> linux-2.6.21-001-cleanup_unstable_write.dif to
> linux-2.6.21-005-fix_nfsv4_resend.dif) an extra beating since those are
> the ones that I feel should go into 2.6.21 final in order to fix the
> read/write regressions that have been reported. They should be identical
> to the patches that I posted on lkml in the past 3 days.
> 
> Please feel free to grab them and give them a test.

The copy completed some time ago, but now I cannot ssh into the box!
This is a new development, as before I was always able to ssh into,
even when the copy slowed down to a trickle.

I'm far from the machine right now, so I will do some more tests
tonight, but right now, the new patchset is not good.  What is the
difference between reverting the patch you sent yesterday and your
current fifth patch?  I assume the other four are identical, right?

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-19 Thread Florin Iucha
On Thu, Apr 19, 2007 at 11:17:28AM -0400, Trond Myklebust wrote:
> On Thu, 2007-04-19 at 11:12 -0400, Chuck Lever wrote:
> > Perhaps instead of looking at the number of bytes sent, the logic in the 
> > last hunk of this patch should check which queue the request is sitting on.
> 
> ??? It would be a bug for the request to be sitting on _any_ queue when
> it enters xprt_transmit().
> 
> Here is the patch that I'm currently testing.

Trond,

What is the set of patches that are you testing?  I'd like to give
that a spin tonight as well.

It is possible that what makes my configuration more susceptible
to the problem is the fact that the client significantly overpowers
the server: Athlon x2 4200+ with 2Gb of RAM for the client vs. PIII
1Ghz 512 MB RAM for the server.  They both have gigabit ethernet
and both NICs and the switch support jumbo frames.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-19 Thread Florin Iucha
On Thu, Apr 19, 2007 at 11:17:28AM -0400, Trond Myklebust wrote:
 On Thu, 2007-04-19 at 11:12 -0400, Chuck Lever wrote:
  Perhaps instead of looking at the number of bytes sent, the logic in the 
  last hunk of this patch should check which queue the request is sitting on.
 
 ??? It would be a bug for the request to be sitting on _any_ queue when
 it enters xprt_transmit().
 
 Here is the patch that I'm currently testing.

Trond,

What is the set of patches that are you testing?  I'd like to give
that a spin tonight as well.

It is possible that what makes my configuration more susceptible
to the problem is the fact that the client significantly overpowers
the server: Athlon x2 4200+ with 2Gb of RAM for the client vs. PIII
1Ghz 512 MB RAM for the server.  They both have gigabit ethernet
and both NICs and the switch support jumbo frames.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Failure! Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-19 Thread Florin Iucha
On Thu, Apr 19, 2007 at 12:09:42PM -0400, Trond Myklebust wrote:
 See
http://client.linux-nfs.org/Linux-2.6.x/2.6.21-rc7/
 
 I'm giving the first 5 patches of that series (i.e.
 linux-2.6.21-001-cleanup_unstable_write.dif to
 linux-2.6.21-005-fix_nfsv4_resend.dif) an extra beating since those are
 the ones that I feel should go into 2.6.21 final in order to fix the
 read/write regressions that have been reported. They should be identical
 to the patches that I posted on lkml in the past 3 days.
 
 Please feel free to grab them and give them a test.

The copy completed some time ago, but now I cannot ssh into the box!
This is a new development, as before I was always able to ssh into,
even when the copy slowed down to a trickle.

I'm far from the machine right now, so I will do some more tests
tonight, but right now, the new patchset is not good.  What is the
difference between reverting the patch you sent yesterday and your
current fifth patch?  I assume the other four are identical, right?

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: Failure! Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-19 Thread Florin Iucha
On Thu, Apr 19, 2007 at 05:30:42PM -0400, Trond Myklebust wrote:
  I'm far from the machine right now, so I will do some more tests
  tonight, but right now, the new patchset is not good.  What is the
  difference between reverting the patch you sent yesterday and your
  current fifth patch?  I assume the other four are identical, right?
 
 The only difference is the way in which we handle retries of an NFSv4
 request: the new patch disconnects if and only if a timeout has
 occurred, or the server sends us garbage.

I have to mention that I rebased to the head of the tree
(895e1fc7226e6732bc77138955b6c7dfa279f57a) before applying your
patches, in order to test what I expect the official tree to be.

Tonight I'll test this kernel once more, then go back to 21-rc7 and
apply your 5 patches and re-test.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Success! Was: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Wed, Apr 18, 2007 at 10:45:13PM -0400, Trond Myklebust wrote:
> On Wed, 2007-04-18 at 20:52 -0500, Florin Iucha wrote:
> > It seems that my original problem report had a big mistake!  There is
> > no hang, but at some point the write slows down to a trickle (from
> > 40,000 blocks/s to 22 blocks/s) as can be seen from the iostat log.
> 
> Yeah. You only captured the outgoing traffic to the server, but already
> it looks as if there were 'interesting' things going on. In frames 29346
> to 29350, the traffic stops altogether for 5 seconds (I only see
> keepalives) then it starts up again. Ditto for frames 40477-40482
> (another 5 seconds). ...
> Then at around frame 92072, the client starts to send a bunch of RSTs.
> Aha I'll bet that reverting the appended patch fixes the problem.

You win!

Reverting this patch (on top of your previous 5) allowed the big copy
to complete (70GB) as well as successful log-in to gnome!

Acked-By: Florin Iucha <[EMAIL PROTECTED]>

Thanks so much for the patience with this elusive bug and stubborn
bugreporter!

Regards,
florin

> ---
> commit 43d78ef2ba5bec26d0315859e8324bfc0be23766
> Author: Chuck Lever <[EMAIL PROTECTED]>
> Date:   Tue Feb 6 18:26:11 2007 -0500
> 
> NFS: disconnect before retrying NFSv4 requests over TCP
> 
> RFC3530 section 3.1.1 states an NFSv4 client MUST NOT send a request
> twice on the same connection unless it is the NULL procedure.  Section
> 3.1.1 suggests that the client should disconnect and reconnect if it
> wants to retry a request.
> 
> Implement this by adding an rpc_clnt flag that an ULP can use to
> specify that the underlying transport should be disconnected on a
> major timeout.  The NFSv4 client asserts this new flag, and requests
> no retries after a minor retransmit timeout.
> 
> Note that disconnecting on a retransmit is in general not safe to do
> if the RPC client does not reuse the TCP port number when reconnecting.
> 
> See http://bugzilla.linux-nfs.org/show_bug.cgi?id=6
> 
> Signed-off-by: Chuck Lever <[EMAIL PROTECTED]>
> Signed-off-by: Trond Myklebust <[EMAIL PROTECTED]>
> 
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index a3191f0..c46e94f 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -394,7 +394,8 @@ static void nfs_init_timeout_values(struct rpc_timeout 
> *to, int proto,
>  static int nfs_create_rpc_client(struct nfs_client *clp, int proto,
>   unsigned int timeo,
>   unsigned int retrans,
> - rpc_authflavor_t flavor)
> + rpc_authflavor_t flavor,
> + int flags)
>  {
>   struct rpc_timeout  timeparms;
>   struct rpc_clnt *clnt = NULL;
> @@ -407,6 +408,7 @@ static int nfs_create_rpc_client(struct nfs_client *clp, 
> int proto,
>   .program= _program,
>   .version= clp->rpc_ops->version,
>   .authflavor = flavor,
> + .flags  = flags,
>   };
>  
>   if (!IS_ERR(clp->cl_rpcclient))
> @@ -548,7 +550,7 @@ static int nfs_init_client(struct nfs_client *clp, const 
> struct nfs_mount_data *
>* - RFC 2623, sec 2.3.2
>*/
>   error = nfs_create_rpc_client(clp, proto, data->timeo, data->retrans,
> - RPC_AUTH_UNIX);
> + RPC_AUTH_UNIX, 0);
>   if (error < 0)
>   goto error;
>   nfs_mark_client_ready(clp, NFS_CS_READY);
> @@ -868,7 +870,8 @@ static int nfs4_init_client(struct nfs_client *clp,
>   /* Check NFS protocol revision and initialize RPC op vector */
>   clp->rpc_ops = _v4_clientops;
>  
> - error = nfs_create_rpc_client(clp, proto, timeo, retrans, authflavour);
> + error = nfs_create_rpc_client(clp, proto, timeo, retrans, authflavour,
> + RPC_CLNT_CREATE_DISCRTRY);
>   if (error < 0)
>   goto error;
>   memcpy(clp->cl_ipaddr, ip_addr, sizeof(clp->cl_ipaddr));
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index a1be89d..c7a78ee 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -40,6 +40,7 @@ struct rpc_clnt {
>  
>   unsigned intcl_softrtry : 1,/* soft timeouts */
>   cl_intr : 1,/* interruptible */
> + cl_discrtry : 1,/* disconnect bef

Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Wed, Apr 18, 2007 at 10:11:46AM -0400, Trond Myklebust wrote:
> Do you have a copy of wireshark or ethereal on hand? If so, could you
> take a look at whether or not any NFS traffic is going between the
> client and server once the hang happens?

I used the following command 

   tcpdump -w nfs-traffic -i eth0 -vv -tt dst port nfs

to capture

   http://iucha.net/nfs/21-rc7-nfs4/nfs-traffic.bz2

I started the capture before starting the copy and left it to run for
a few minutes after the traffic slowed to a crawl.

The iostat and vmstat are at:

   http://iucha.net/nfs/21-rc7-nfs4/iostat
   http://iucha.net/nfs/21-rc7-nfs4/vmstat
   
It seems that my original problem report had a big mistake!  There is
no hang, but at some point the write slows down to a trickle (from
40,000 blocks/s to 22 blocks/s) as can be seen from the iostat log.

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Wed, Apr 18, 2007 at 10:11:46AM -0400, Trond Myklebust wrote:
> On Wed, 2007-04-18 at 08:42 -0500, Florin Iucha wrote:
> > Could the port in CLOSE_WAIT state be the culprit?  (FWIW
> > the server has been up for 38 days and subjected to
> > this nfs test quite a bit without showing any stress).
> 
> The port in CLOSE_WAIT shows that a socket was closed down recently, but
> once the connection is re-established, the client should start sending
> data.
> Do you have a copy of wireshark or ethereal on hand? If so, could you
> take a look at whether or not any NFS traffic is going between the
> client and server once the hang happens?
> Note that the timeout value is 60 seconds, so if you see no immediate
> traffic, then let the ethereal/wireshark session keep running for a
> couple more minutes.

Should I run wireshark/ethereal on the client or on the server?

I'll get a trace tonight (10 PM CST) and get back to you.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Wed, Apr 18, 2007 at 08:42:25AM -0500, Florin Iucha wrote:
> On Wed, Apr 18, 2007 at 09:15:31AM -0400, Trond Myklebust wrote:
> The netstat outputs are stable (not changed in 5 minutes):
> 
>http://iucha.net/nfs/21-rc7-nfs3/netstat-server :
> 
> tcp1  0 hermes.iucha.org:nfszeus.iucha.org:799  
> CLOSE_WAIT 
> tcp0  0 hermes.iucha.org:nfszeus.iucha.org:976  
> ESTABLISHED
> 
>http://iucha.net/nfs/21-rc7-nfs3/netstat-client
> 
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address   Foreign Address State 
>  
> tcp0  0 zeus.iucha.org:976  hermes.iucha.org:nfs
> ESTABLISHED
> tcp0  0 zeus.iucha.org:ssh  hermes.iucha.org:56880  
> ESTABLISHED
> tcp0  0 zeus.iucha.org:ssh  hermes.iucha.org:45176  
> ESTABLISHED
> 
> Could the port in CLOSE_WAIT state be the culprit?  (FWIW
> the server has been up for 38 days and subjected to
> this nfs test quite a bit without showing any stress).

The CLOSE_WAIT went away as soon as I rebooted the client.  Something
was holding it up...

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Wed, Apr 18, 2007 at 09:15:31AM -0400, Trond Myklebust wrote:
> There is only one request on the 'pending' queue. That would usually
> indicate that the connection to the server is down. Can you check using
> "netstat -t" whether or not there is a connection in the 'ESTABLISHED'
> state to the server? Please also repeat the command a couple of times in
> order to see if the socket/port number on the connection changes.

This is with your fifth patch on top of the previous four patches:

   http://iucha.net/nfs/21-rc7-nfs3/big-copy

Again, it has memory, stack traces and rpc_debug.

The iostat 5 output:

   http://iucha.net/nfs/21-rc7-nfs3/iostat

The netstat outputs are stable (not changed in 5 minutes):

   http://iucha.net/nfs/21-rc7-nfs3/netstat-server :

tcp1  0 hermes.iucha.org:nfszeus.iucha.org:799  CLOSE_WAIT 
tcp0  0 hermes.iucha.org:nfszeus.iucha.org:976  ESTABLISHED

   http://iucha.net/nfs/21-rc7-nfs3/netstat-client

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address   Foreign Address State  
tcp0  0 zeus.iucha.org:976  hermes.iucha.org:nfsESTABLISHED
tcp0  0 zeus.iucha.org:ssh  hermes.iucha.org:56880  ESTABLISHED
tcp0  0 zeus.iucha.org:ssh  hermes.iucha.org:45176  ESTABLISHED

Could the port in CLOSE_WAIT state be the culprit?  (FWIW
the server has been up for 38 days and subjected to
this nfs test quite a bit without showing any stress).

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Tue, Apr 17, 2007 at 10:37:38PM -0700, Andrew Morton wrote:
> Florin, can we please see /proc/meminfo as well?

   http://iucha.net/nfs/21-rc7-nfs2/meminfo

> Also the result of `echo m > /proc/sysrq-trigger'

   http://iucha.net/nfs/21-rc7-nfs2/big-copy

This has 'echo m > /proc/sysrq-trigger', 'echo t >
/proc/sysrq-trigger' and 'echo 0 > /proc/sys/sunrpc/rpc_debug'.

The output from the server's 'iostat 5' is at

   http://iucha.net/nfs/21-rc7-nfs2/iostat

This run it copied 5.6G (vs yesterday's 2.5G).

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Tue, Apr 17, 2007 at 10:37:38PM -0700, Andrew Morton wrote:
 Florin, can we please see /proc/meminfo as well?

   http://iucha.net/nfs/21-rc7-nfs2/meminfo

 Also the result of `echo m  /proc/sysrq-trigger'

   http://iucha.net/nfs/21-rc7-nfs2/big-copy

This has 'echo m  /proc/sysrq-trigger', 'echo t 
/proc/sysrq-trigger' and 'echo 0  /proc/sys/sunrpc/rpc_debug'.

The output from the server's 'iostat 5' is at

   http://iucha.net/nfs/21-rc7-nfs2/iostat

This run it copied 5.6G (vs yesterday's 2.5G).

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Wed, Apr 18, 2007 at 09:15:31AM -0400, Trond Myklebust wrote:
 There is only one request on the 'pending' queue. That would usually
 indicate that the connection to the server is down. Can you check using
 netstat -t whether or not there is a connection in the 'ESTABLISHED'
 state to the server? Please also repeat the command a couple of times in
 order to see if the socket/port number on the connection changes.

This is with your fifth patch on top of the previous four patches:

   http://iucha.net/nfs/21-rc7-nfs3/big-copy

Again, it has memory, stack traces and rpc_debug.

The iostat 5 output:

   http://iucha.net/nfs/21-rc7-nfs3/iostat

The netstat outputs are stable (not changed in 5 minutes):

   http://iucha.net/nfs/21-rc7-nfs3/netstat-server :

tcp1  0 hermes.iucha.org:nfszeus.iucha.org:799  CLOSE_WAIT 
tcp0  0 hermes.iucha.org:nfszeus.iucha.org:976  ESTABLISHED

   http://iucha.net/nfs/21-rc7-nfs3/netstat-client

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address   Foreign Address State  
tcp0  0 zeus.iucha.org:976  hermes.iucha.org:nfsESTABLISHED
tcp0  0 zeus.iucha.org:ssh  hermes.iucha.org:56880  ESTABLISHED
tcp0  0 zeus.iucha.org:ssh  hermes.iucha.org:45176  ESTABLISHED

Could the port in CLOSE_WAIT state be the culprit?  (FWIW
the server has been up for 38 days and subjected to
this nfs test quite a bit without showing any stress).

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Wed, Apr 18, 2007 at 08:42:25AM -0500, Florin Iucha wrote:
 On Wed, Apr 18, 2007 at 09:15:31AM -0400, Trond Myklebust wrote:
 The netstat outputs are stable (not changed in 5 minutes):
 
http://iucha.net/nfs/21-rc7-nfs3/netstat-server :
 
 tcp1  0 hermes.iucha.org:nfszeus.iucha.org:799  
 CLOSE_WAIT 
 tcp0  0 hermes.iucha.org:nfszeus.iucha.org:976  
 ESTABLISHED
 
http://iucha.net/nfs/21-rc7-nfs3/netstat-client
 
 Active Internet connections (w/o servers)
 Proto Recv-Q Send-Q Local Address   Foreign Address State 
  
 tcp0  0 zeus.iucha.org:976  hermes.iucha.org:nfs
 ESTABLISHED
 tcp0  0 zeus.iucha.org:ssh  hermes.iucha.org:56880  
 ESTABLISHED
 tcp0  0 zeus.iucha.org:ssh  hermes.iucha.org:45176  
 ESTABLISHED
 
 Could the port in CLOSE_WAIT state be the culprit?  (FWIW
 the server has been up for 38 days and subjected to
 this nfs test quite a bit without showing any stress).

The CLOSE_WAIT went away as soon as I rebooted the client.  Something
was holding it up...

Regards,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Florin Iucha
On Wed, Apr 18, 2007 at 10:11:46AM -0400, Trond Myklebust wrote:
 On Wed, 2007-04-18 at 08:42 -0500, Florin Iucha wrote:
  Could the port in CLOSE_WAIT state be the culprit?  (FWIW
  the server has been up for 38 days and subjected to
  this nfs test quite a bit without showing any stress).
 
 The port in CLOSE_WAIT shows that a socket was closed down recently, but
 once the connection is re-established, the client should start sending
 data.
 Do you have a copy of wireshark or ethereal on hand? If so, could you
 take a look at whether or not any NFS traffic is going between the
 client and server once the hang happens?
 Note that the timeout value is 60 seconds, so if you see no immediate
 traffic, then let the ethereal/wireshark session keep running for a
 couple more minutes.

Should I run wireshark/ethereal on the client or on the server?

I'll get a trace tonight (10 PM CST) and get back to you.

Thanks,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163


signature.asc
Description: Digital signature


  1   2   >