Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Petr Vandrovec

On  3 Jan 01 at 13:08, Udo A. Steinberg wrote:
> Alexander Viro wrote:
> >
> > In principle, it might be that d_find_alias() is broken. I don't see where
> > it could happen, but then I'm half-asleep right now...  While we are at it,
> > do you have
> 
> > * autofs
> 
> Yes.
> 
> > * knfsd
> > * ncpfs
> 
> No, neither of these two.

I saw oopses in prune_dcache() during umount() of ncpfs circa 6 months
ago. As I was never able to reproduce problem, and it just stopped from
happenning as unexpected as it appeared, I never reported that. And
~2 times I got endless loop in d_prune_aliases() where it somewhat
happened that d_alias list looked like

1 -> 2 -> 3 -> 4 -> 2 -> 3 -> 4 ... (maybe after pruning d_count = 0
entries...)

so it never stopped :-( But it really happened long long ago, I think
that sometime June-September 2000, and couple of logic changed since
then in both ncpfs and vfs.
Best regards,
Petr Vandrovec
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Udo A. Steinberg

Hi,

Alexander Viro wrote:
>
> In principle, it might be that d_find_alias() is broken. I don't see where
> it could happen, but then I'm half-asleep right now...  While we are at it,
> do you have

> * autofs

Yes.

> * knfsd
> * ncpfs

No, neither of these two.

-Udo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Alexander Viro



On Wed, 3 Jan 2001, Udo A. Steinberg wrote:

> Dan Aloni wrote:
> > 
> > After a bit of few code reviewing, it looks like the only code that
> > assigns stuff to ->d_op in a nonstandard way is in fs/vfat/namei.c.
> > 
> > Udo, are you using vfat?
> 
> Yes.

In principle, it might be that d_find_alias() is broken. I don't see where
it could happen, but then I'm half-asleep right now...  While we are at it,
do you have
* autofs
* knfsd
* ncpfs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Udo A. Steinberg

Dan Aloni wrote:
> 
> After a bit of few code reviewing, it looks like the only code that
> assigns stuff to ->d_op in a nonstandard way is in fs/vfat/namei.c.
> 
> Udo, are you using vfat?

Yes.

-Udo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Alexander Viro



On Wed, 3 Jan 2001, Dan Aloni wrote:

> After a bit of few code reviewing, it looks like the only code that
> assigns stuff to ->d_op in a nonstandard way is in fs/vfat/namei.c. 
> 
> Udo, are you using vfat?

If it was assigned by something that was supposed to set ->d_op
it would not get such value. Whatever had done that had no idea of the
->d_op or struct dentry in the first place.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Dan Aloni

On Tue, 2 Jan 2001, Linus Torvalds wrote:

> On Wed, 3 Jan 2001, Udo A. Steinberg wrote:
> > 
> > While under massive disk and cpu load, 2.4.0-prerelease produced
> > the following oops (decode see below)

[..]

> Now, I assume this machine has been historically stable, with no history
> of memory corruption problems.. It's entirely possible (and likely) that
> the one-bit error is due to some wild kernel pointer. Which makes this
> _really_ hard to debug.

After a bit of few code reviewing, it looks like the only code that
assigns stuff to ->d_op in a nonstandard way is in fs/vfat/namei.c. 

Udo, are you using vfat?

-- 
Dan Aloni 
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Dan Aloni

On Tue, 2 Jan 2001, Linus Torvalds wrote:

 On Wed, 3 Jan 2001, Udo A. Steinberg wrote:
  
  While under massive disk and cpu load, 2.4.0-prerelease produced
  the following oops (decode see below)

[..]

 Now, I assume this machine has been historically stable, with no history
 of memory corruption problems.. It's entirely possible (and likely) that
 the one-bit error is due to some wild kernel pointer. Which makes this
 _really_ hard to debug.

After a bit of few code reviewing, it looks like the only code that
assigns stuff to -d_op in a nonstandard way is in fs/vfat/namei.c. 

Udo, are you using vfat?

-- 
Dan Aloni 
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Alexander Viro



On Wed, 3 Jan 2001, Dan Aloni wrote:

 After a bit of few code reviewing, it looks like the only code that
 assigns stuff to -d_op in a nonstandard way is in fs/vfat/namei.c. 
 
 Udo, are you using vfat?

If it was assigned by something that was supposed to set -d_op
it would not get such value. Whatever had done that had no idea of the
-d_op or struct dentry in the first place.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Udo A. Steinberg

Dan Aloni wrote:
 
 After a bit of few code reviewing, it looks like the only code that
 assigns stuff to -d_op in a nonstandard way is in fs/vfat/namei.c.
 
 Udo, are you using vfat?

Yes.

-Udo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Alexander Viro



On Wed, 3 Jan 2001, Udo A. Steinberg wrote:

 Dan Aloni wrote:
  
  After a bit of few code reviewing, it looks like the only code that
  assigns stuff to -d_op in a nonstandard way is in fs/vfat/namei.c.
  
  Udo, are you using vfat?
 
 Yes.

In principle, it might be that d_find_alias() is broken. I don't see where
it could happen, but then I'm half-asleep right now...  While we are at it,
do you have
* autofs
* knfsd
* ncpfs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Udo A. Steinberg

Hi,

Alexander Viro wrote:

 In principle, it might be that d_find_alias() is broken. I don't see where
 it could happen, but then I'm half-asleep right now...  While we are at it,
 do you have

 * autofs

Yes.

 * knfsd
 * ncpfs

No, neither of these two.

-Udo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-03 Thread Petr Vandrovec

On  3 Jan 01 at 13:08, Udo A. Steinberg wrote:
 Alexander Viro wrote:
 
  In principle, it might be that d_find_alias() is broken. I don't see where
  it could happen, but then I'm half-asleep right now...  While we are at it,
  do you have
 
  * autofs
 
 Yes.
 
  * knfsd
  * ncpfs
 
 No, neither of these two.

I saw oopses in prune_dcache() during umount() of ncpfs circa 6 months
ago. As I was never able to reproduce problem, and it just stopped from
happenning as unexpected as it appeared, I never reported that. And
~2 times I got endless loop in d_prune_aliases() where it somewhat
happened that d_alias list looked like

1 - 2 - 3 - 4 - 2 - 3 - 4 ... (maybe after pruning d_count = 0
entries...)

so it never stopped :-( But it really happened long long ago, I think
that sometime June-September 2000, and couple of logic changed since
then in both ncpfs and vfs.
Best regards,
Petr Vandrovec
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-02 Thread Udo A. Steinberg

Hi,

Linus Torvalds wrote:
> 
> The strange thing is that 0x0100 value, which almost certainly should
> just be NULL. A one-bit error.
> 
> Now, I assume this machine has been historically stable, with no history
> of memory corruption problems.. It's entirely possible (and likely) that
> the one-bit error is due to some wild kernel pointer. Which makes this
> _really_ hard to debug.

Yes the machine is otherwise rock stable, not overclocked and memory timings
are rather conservative. Before the oops the machine had been compiling some
major application for like 5 hours and maybe the excessive stress kicked a
bit somewhere - who knows.

> I'll try to think about it some more, but I'd love to have more reports to
> go on to try to find a pattern..

That's one I can't reproduce. I've just run memtest86 over the entire ram
and it doesn't show any oddities - which doesn't really rule out an
occassional bit-flip due to neutrino storms though ;-)
If someone else has seen something similar lately, it's time to speak up.

-Udo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-02 Thread Alexander Viro

On Wed, 3 Jan 2001, Udo A. Steinberg wrote:

> 
> Hi Linus et. all
> 
> While under massive disk and cpu load, 2.4.0-prerelease produced
> the following oops (decode see below)
 
> Unable to handle kernel paging request at virtual address 0114
 
> Code;  c01419cc<=
>0:   8b 40 14  movl   0x14(%eax),%eax   <=
> Code;  c01419cf 
>3:   85 c0 testl  %eax,%eax
> Code;  c01419d1 
>5:   74 09 je 10 <_EIP+0x10> c01419dc 
>

dentry->d_op == 0x100 in dentry_iput(). 9:1 that you've got bit 24 flipped
(i.e. it was supposed to be NULL and you are seeing an effect of hardware
problem).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-02 Thread Linus Torvalds



On Wed, 3 Jan 2001, Udo A. Steinberg wrote:
> 
> While under massive disk and cpu load, 2.4.0-prerelease produced
> the following oops (decode see below)

Hmm.. If I'm not mistaken, this is in dentry_iput() (inline function
called by prune_one_dentry(), which is _also_ an inline function, which
is why it gets reported as being in prune_dcache):

if (dentry->d_op && dentry->d_op->d_iput)
dentry->d_op->d_iput(dentry, inode);

and it looks like your dentry->d_op has a value of 0x0100, so when we
load the d_op->d_iput pointer, we get a page fault.

The strange thing is that 0x0100 value, which almost certainly should
just be NULL. A one-bit error.

Now, I assume this machine has been historically stable, with no history
of memory corruption problems.. It's entirely possible (and likely) that
the one-bit error is due to some wild kernel pointer. Which makes this
_really_ hard to debug.

I'll try to think about it some more, but I'd love to have more reports to
go on to try to find a pattern..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-02 Thread Linus Torvalds



On Wed, 3 Jan 2001, Udo A. Steinberg wrote:
 
 While under massive disk and cpu load, 2.4.0-prerelease produced
 the following oops (decode see below)

Hmm.. If I'm not mistaken, this is in dentry_iput() (inline function
called by prune_one_dentry(), which is _also_ an inline function, which
is why it gets reported as being in prune_dcache):

if (dentry-d_op  dentry-d_op-d_iput)
dentry-d_op-d_iput(dentry, inode);

and it looks like your dentry-d_op has a value of 0x0100, so when we
load the d_op-d_iput pointer, we get a page fault.

The strange thing is that 0x0100 value, which almost certainly should
just be NULL. A one-bit error.

Now, I assume this machine has been historically stable, with no history
of memory corruption problems.. It's entirely possible (and likely) that
the one-bit error is due to some wild kernel pointer. Which makes this
_really_ hard to debug.

I'll try to think about it some more, but I'd love to have more reports to
go on to try to find a pattern..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-02 Thread Alexander Viro

On Wed, 3 Jan 2001, Udo A. Steinberg wrote:

 
 Hi Linus et. all
 
 While under massive disk and cpu load, 2.4.0-prerelease produced
 the following oops (decode see below)
 
 Unable to handle kernel paging request at virtual address 0114
 
 Code;  c01419cc prune_dcache+9c/120   =
0:   8b 40 14  movl   0x14(%eax),%eax   =
 Code;  c01419cf prune_dcache+9f/120
3:   85 c0 testl  %eax,%eax
 Code;  c01419d1 prune_dcache+a1/120
5:   74 09 je 10 _EIP+0x10 c01419dc 
prune_dcache+ac/120

dentry-d_op == 0x100 in dentry_iput(). 9:1 that you've got bit 24 flipped
(i.e. it was supposed to be NULL and you are seeing an effect of hardware
problem).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Oops in prune_dcache (2.4.0-prerelease)

2001-01-02 Thread Udo A. Steinberg

Hi,

Linus Torvalds wrote:
 
 The strange thing is that 0x0100 value, which almost certainly should
 just be NULL. A one-bit error.
 
 Now, I assume this machine has been historically stable, with no history
 of memory corruption problems.. It's entirely possible (and likely) that
 the one-bit error is due to some wild kernel pointer. Which makes this
 _really_ hard to debug.

Yes the machine is otherwise rock stable, not overclocked and memory timings
are rather conservative. Before the oops the machine had been compiling some
major application for like 5 hours and maybe the excessive stress kicked a
bit somewhere - who knows.

 I'll try to think about it some more, but I'd love to have more reports to
 go on to try to find a pattern..

That's one I can't reproduce. I've just run memtest86 over the entire ram
and it doesn't show any oddities - which doesn't really rule out an
occassional bit-flip due to neutrino storms though ;-)
If someone else has seen something similar lately, it's time to speak up.

-Udo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/