Re: Oops in prune_dcache (2.4.0-prerelease)
On 3 Jan 01 at 13:08, Udo A. Steinberg wrote: > Alexander Viro wrote: > > > > In principle, it might be that d_find_alias() is broken. I don't see where > > it could happen, but then I'm half-asleep right now... While we are at it, > > do you have > > > * autofs > > Yes. > > > * knfsd > > * ncpfs > > No, neither of these two. I saw oopses in prune_dcache() during umount() of ncpfs circa 6 months ago. As I was never able to reproduce problem, and it just stopped from happenning as unexpected as it appeared, I never reported that. And ~2 times I got endless loop in d_prune_aliases() where it somewhat happened that d_alias list looked like 1 -> 2 -> 3 -> 4 -> 2 -> 3 -> 4 ... (maybe after pruning d_count = 0 entries...) so it never stopped :-( But it really happened long long ago, I think that sometime June-September 2000, and couple of logic changed since then in both ncpfs and vfs. Best regards, Petr Vandrovec [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
Hi, Alexander Viro wrote: > > In principle, it might be that d_find_alias() is broken. I don't see where > it could happen, but then I'm half-asleep right now... While we are at it, > do you have > * autofs Yes. > * knfsd > * ncpfs No, neither of these two. -Udo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Wed, 3 Jan 2001, Udo A. Steinberg wrote: > Dan Aloni wrote: > > > > After a bit of few code reviewing, it looks like the only code that > > assigns stuff to ->d_op in a nonstandard way is in fs/vfat/namei.c. > > > > Udo, are you using vfat? > > Yes. In principle, it might be that d_find_alias() is broken. I don't see where it could happen, but then I'm half-asleep right now... While we are at it, do you have * autofs * knfsd * ncpfs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
Dan Aloni wrote: > > After a bit of few code reviewing, it looks like the only code that > assigns stuff to ->d_op in a nonstandard way is in fs/vfat/namei.c. > > Udo, are you using vfat? Yes. -Udo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Wed, 3 Jan 2001, Dan Aloni wrote: > After a bit of few code reviewing, it looks like the only code that > assigns stuff to ->d_op in a nonstandard way is in fs/vfat/namei.c. > > Udo, are you using vfat? If it was assigned by something that was supposed to set ->d_op it would not get such value. Whatever had done that had no idea of the ->d_op or struct dentry in the first place. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Tue, 2 Jan 2001, Linus Torvalds wrote: > On Wed, 3 Jan 2001, Udo A. Steinberg wrote: > > > > While under massive disk and cpu load, 2.4.0-prerelease produced > > the following oops (decode see below) [..] > Now, I assume this machine has been historically stable, with no history > of memory corruption problems.. It's entirely possible (and likely) that > the one-bit error is due to some wild kernel pointer. Which makes this > _really_ hard to debug. After a bit of few code reviewing, it looks like the only code that assigns stuff to ->d_op in a nonstandard way is in fs/vfat/namei.c. Udo, are you using vfat? -- Dan Aloni [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Tue, 2 Jan 2001, Linus Torvalds wrote: On Wed, 3 Jan 2001, Udo A. Steinberg wrote: While under massive disk and cpu load, 2.4.0-prerelease produced the following oops (decode see below) [..] Now, I assume this machine has been historically stable, with no history of memory corruption problems.. It's entirely possible (and likely) that the one-bit error is due to some wild kernel pointer. Which makes this _really_ hard to debug. After a bit of few code reviewing, it looks like the only code that assigns stuff to -d_op in a nonstandard way is in fs/vfat/namei.c. Udo, are you using vfat? -- Dan Aloni [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Wed, 3 Jan 2001, Dan Aloni wrote: After a bit of few code reviewing, it looks like the only code that assigns stuff to -d_op in a nonstandard way is in fs/vfat/namei.c. Udo, are you using vfat? If it was assigned by something that was supposed to set -d_op it would not get such value. Whatever had done that had no idea of the -d_op or struct dentry in the first place. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
Dan Aloni wrote: After a bit of few code reviewing, it looks like the only code that assigns stuff to -d_op in a nonstandard way is in fs/vfat/namei.c. Udo, are you using vfat? Yes. -Udo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Wed, 3 Jan 2001, Udo A. Steinberg wrote: Dan Aloni wrote: After a bit of few code reviewing, it looks like the only code that assigns stuff to -d_op in a nonstandard way is in fs/vfat/namei.c. Udo, are you using vfat? Yes. In principle, it might be that d_find_alias() is broken. I don't see where it could happen, but then I'm half-asleep right now... While we are at it, do you have * autofs * knfsd * ncpfs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
Hi, Alexander Viro wrote: In principle, it might be that d_find_alias() is broken. I don't see where it could happen, but then I'm half-asleep right now... While we are at it, do you have * autofs Yes. * knfsd * ncpfs No, neither of these two. -Udo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On 3 Jan 01 at 13:08, Udo A. Steinberg wrote: Alexander Viro wrote: In principle, it might be that d_find_alias() is broken. I don't see where it could happen, but then I'm half-asleep right now... While we are at it, do you have * autofs Yes. * knfsd * ncpfs No, neither of these two. I saw oopses in prune_dcache() during umount() of ncpfs circa 6 months ago. As I was never able to reproduce problem, and it just stopped from happenning as unexpected as it appeared, I never reported that. And ~2 times I got endless loop in d_prune_aliases() where it somewhat happened that d_alias list looked like 1 - 2 - 3 - 4 - 2 - 3 - 4 ... (maybe after pruning d_count = 0 entries...) so it never stopped :-( But it really happened long long ago, I think that sometime June-September 2000, and couple of logic changed since then in both ncpfs and vfs. Best regards, Petr Vandrovec [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
Hi, Linus Torvalds wrote: > > The strange thing is that 0x0100 value, which almost certainly should > just be NULL. A one-bit error. > > Now, I assume this machine has been historically stable, with no history > of memory corruption problems.. It's entirely possible (and likely) that > the one-bit error is due to some wild kernel pointer. Which makes this > _really_ hard to debug. Yes the machine is otherwise rock stable, not overclocked and memory timings are rather conservative. Before the oops the machine had been compiling some major application for like 5 hours and maybe the excessive stress kicked a bit somewhere - who knows. > I'll try to think about it some more, but I'd love to have more reports to > go on to try to find a pattern.. That's one I can't reproduce. I've just run memtest86 over the entire ram and it doesn't show any oddities - which doesn't really rule out an occassional bit-flip due to neutrino storms though ;-) If someone else has seen something similar lately, it's time to speak up. -Udo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Wed, 3 Jan 2001, Udo A. Steinberg wrote: > > Hi Linus et. all > > While under massive disk and cpu load, 2.4.0-prerelease produced > the following oops (decode see below) > Unable to handle kernel paging request at virtual address 0114 > Code; c01419cc<= >0: 8b 40 14 movl 0x14(%eax),%eax <= > Code; c01419cf >3: 85 c0 testl %eax,%eax > Code; c01419d1 >5: 74 09 je 10 <_EIP+0x10> c01419dc > dentry->d_op == 0x100 in dentry_iput(). 9:1 that you've got bit 24 flipped (i.e. it was supposed to be NULL and you are seeing an effect of hardware problem). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Wed, 3 Jan 2001, Udo A. Steinberg wrote: > > While under massive disk and cpu load, 2.4.0-prerelease produced > the following oops (decode see below) Hmm.. If I'm not mistaken, this is in dentry_iput() (inline function called by prune_one_dentry(), which is _also_ an inline function, which is why it gets reported as being in prune_dcache): if (dentry->d_op && dentry->d_op->d_iput) dentry->d_op->d_iput(dentry, inode); and it looks like your dentry->d_op has a value of 0x0100, so when we load the d_op->d_iput pointer, we get a page fault. The strange thing is that 0x0100 value, which almost certainly should just be NULL. A one-bit error. Now, I assume this machine has been historically stable, with no history of memory corruption problems.. It's entirely possible (and likely) that the one-bit error is due to some wild kernel pointer. Which makes this _really_ hard to debug. I'll try to think about it some more, but I'd love to have more reports to go on to try to find a pattern.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Wed, 3 Jan 2001, Udo A. Steinberg wrote: While under massive disk and cpu load, 2.4.0-prerelease produced the following oops (decode see below) Hmm.. If I'm not mistaken, this is in dentry_iput() (inline function called by prune_one_dentry(), which is _also_ an inline function, which is why it gets reported as being in prune_dcache): if (dentry-d_op dentry-d_op-d_iput) dentry-d_op-d_iput(dentry, inode); and it looks like your dentry-d_op has a value of 0x0100, so when we load the d_op-d_iput pointer, we get a page fault. The strange thing is that 0x0100 value, which almost certainly should just be NULL. A one-bit error. Now, I assume this machine has been historically stable, with no history of memory corruption problems.. It's entirely possible (and likely) that the one-bit error is due to some wild kernel pointer. Which makes this _really_ hard to debug. I'll try to think about it some more, but I'd love to have more reports to go on to try to find a pattern.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
On Wed, 3 Jan 2001, Udo A. Steinberg wrote: Hi Linus et. all While under massive disk and cpu load, 2.4.0-prerelease produced the following oops (decode see below) Unable to handle kernel paging request at virtual address 0114 Code; c01419cc prune_dcache+9c/120 = 0: 8b 40 14 movl 0x14(%eax),%eax = Code; c01419cf prune_dcache+9f/120 3: 85 c0 testl %eax,%eax Code; c01419d1 prune_dcache+a1/120 5: 74 09 je 10 _EIP+0x10 c01419dc prune_dcache+ac/120 dentry-d_op == 0x100 in dentry_iput(). 9:1 that you've got bit 24 flipped (i.e. it was supposed to be NULL and you are seeing an effect of hardware problem). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Oops in prune_dcache (2.4.0-prerelease)
Hi, Linus Torvalds wrote: The strange thing is that 0x0100 value, which almost certainly should just be NULL. A one-bit error. Now, I assume this machine has been historically stable, with no history of memory corruption problems.. It's entirely possible (and likely) that the one-bit error is due to some wild kernel pointer. Which makes this _really_ hard to debug. Yes the machine is otherwise rock stable, not overclocked and memory timings are rather conservative. Before the oops the machine had been compiling some major application for like 5 hours and maybe the excessive stress kicked a bit somewhere - who knows. I'll try to think about it some more, but I'd love to have more reports to go on to try to find a pattern.. That's one I can't reproduce. I've just run memtest86 over the entire ram and it doesn't show any oddities - which doesn't really rule out an occassional bit-flip due to neutrino storms though ;-) If someone else has seen something similar lately, it's time to speak up. -Udo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/