Re: 2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-21 Thread Justin Piszcz



On Fri, 21 Sep 2007, David Chinner wrote:


On Wed, Sep 19, 2007 at 04:47:38AM -0400, Justin Piszcz wrote:

On Mon, 17 Sep 2007, Justin Piszcz wrote:


Including the XFS mailing list in here too because it may be an XFS bug
looking at the call trace.

System: Debian Testing
Kernel: 2.6.20
Config: Attached

I was running apt-get dist-upgrade as I always do to get the latest
packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and
the process went into D-state and I had to reboot.

The config file is from 2.6.20 but it had been moved to a 2.6.22 directory
for an upgrade, but all of the options have been left unchanged.

Here is the *OOPS I captured via dmesg before I rebooted:




Also,

Not sure if this helps but when this happened, any file that was open()
for read/write seem to have also been corrupted..


Is that all files, or just ones that were being changed?
It is the only one I noticed because another program depended upon it 
being not-corrupt.





$ /usr/sbin/xfs_bmap -v myconfig.txt.orig
myconfig.txt.orig:
 EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
   0: [0..7]:  64601112..64601119 14 (52040..52047)   8
$ /usr/sbin/xfs_bmap -v myconfig.txt
myconfig.txt:
 EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
   0: [0..7]:  64625720..64625727 14 (76648..76655)   8
$ md5sum myconfig*
db8c50ca2c86d2e757ecef1d6b3fcc69  myconfig.txt
09fb630623b3ae614511cef4c7a21063  myconfig.txt.orig
$ file myconfig.txt myconfig.txt.orig
myconfig.txt:  ASCII text
myconfig.txt.orig: data
$

$ strings -a myconfig.txt.orig
$

$ od -c myconfig.txt.orig
000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 *
0003500  \0  \0  \0  \0  \0  \0
0003506

Seems like it was NULL'd out?


A single block of zeros - its possible that the crash occurred between
the allocation transaction and the data write - the allocation gets
replayed (along with the new file size), but the data write does
not (not journalled). This is one of the rarer "NULL files on crash"
failure modes fixed in 6.5.22.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of 180 days during apt-get dist-upgrade on Debian Testing

2007-09-21 Thread Justin Piszcz



On Fri, 21 Sep 2007, David Chinner wrote:


On Wed, Sep 19, 2007 at 04:47:38AM -0400, Justin Piszcz wrote:

On Mon, 17 Sep 2007, Justin Piszcz wrote:


Including the XFS mailing list in here too because it may be an XFS bug
looking at the call trace.

System: Debian Testing
Kernel: 2.6.20
Config: Attached

I was running apt-get dist-upgrade as I always do to get the latest
packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and
the process went into D-state and I had to reboot.

The config file is from 2.6.20 but it had been moved to a 2.6.22 directory
for an upgrade, but all of the options have been left unchanged.

Here is the *OOPS I captured via dmesg before I rebooted:




Also,

Not sure if this helps but when this happened, any file that was open()
for read/write seem to have also been corrupted..


Is that all files, or just ones that were being changed?
It is the only one I noticed because another program depended upon it 
being not-corrupt.





$ /usr/sbin/xfs_bmap -v myconfig.txt.orig
myconfig.txt.orig:
 EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
   0: [0..7]:  64601112..64601119 14 (52040..52047)   8
$ /usr/sbin/xfs_bmap -v myconfig.txt
myconfig.txt:
 EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
   0: [0..7]:  64625720..64625727 14 (76648..76655)   8
$ md5sum myconfig*
db8c50ca2c86d2e757ecef1d6b3fcc69  myconfig.txt
09fb630623b3ae614511cef4c7a21063  myconfig.txt.orig
$ file myconfig.txt myconfig.txt.orig
myconfig.txt:  ASCII text
myconfig.txt.orig: data
$

$ strings -a myconfig.txt.orig
$

$ od -c myconfig.txt.orig
000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 *
0003500  \0  \0  \0  \0  \0  \0
0003506

Seems like it was NULL'd out?


A single block of zeros - its possible that the crash occurred between
the allocation transaction and the data write - the allocation gets
replayed (along with the new file size), but the data write does
not (not journalled). This is one of the rarer NULL files on crash
failure modes fixed in 6.5.22.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-20 Thread David Chinner
On Wed, Sep 19, 2007 at 04:47:38AM -0400, Justin Piszcz wrote:
> On Mon, 17 Sep 2007, Justin Piszcz wrote:
> 
> >Including the XFS mailing list in here too because it may be an XFS bug 
> >looking at the call trace.
> >
> >System: Debian Testing
> >Kernel: 2.6.20
> >Config: Attached
> >
> >I was running apt-get dist-upgrade as I always do to get the latest 
> >packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and 
> >the process went into D-state and I had to reboot.
> >
> >The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
> >for an upgrade, but all of the options have been left unchanged.
> >
> >Here is the *OOPS I captured via dmesg before I rebooted:
> >
> >
> 
> Also,
> 
> Not sure if this helps but when this happened, any file that was open()
> for read/write seem to have also been corrupted..

Is that all files, or just ones that were being changed?

> $ /usr/sbin/xfs_bmap -v myconfig.txt.orig
> myconfig.txt.orig:
>  EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
>0: [0..7]:  64601112..64601119 14 (52040..52047)   8
> $ /usr/sbin/xfs_bmap -v myconfig.txt
> myconfig.txt:
>  EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
>0: [0..7]:  64625720..64625727 14 (76648..76655)   8
> $ md5sum myconfig*
> db8c50ca2c86d2e757ecef1d6b3fcc69  myconfig.txt
> 09fb630623b3ae614511cef4c7a21063  myconfig.txt.orig
> $ file myconfig.txt myconfig.txt.orig
> myconfig.txt:  ASCII text
> myconfig.txt.orig: data
> $
> 
> $ strings -a myconfig.txt.orig
> $
> 
> $ od -c myconfig.txt.orig
> 000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 *
> 0003500  \0  \0  \0  \0  \0  \0
> 0003506
> 
> Seems like it was NULL'd out?

A single block of zeros - its possible that the crash occurred between
the allocation transaction and the data write - the allocation gets
replayed (along with the new file size), but the data write does
not (not journalled). This is one of the rarer "NULL files on crash"
failure modes fixed in 6.5.22.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of 180 days during apt-get dist-upgrade on Debian Testing

2007-09-20 Thread David Chinner
On Wed, Sep 19, 2007 at 04:47:38AM -0400, Justin Piszcz wrote:
 On Mon, 17 Sep 2007, Justin Piszcz wrote:
 
 Including the XFS mailing list in here too because it may be an XFS bug 
 looking at the call trace.
 
 System: Debian Testing
 Kernel: 2.6.20
 Config: Attached
 
 I was running apt-get dist-upgrade as I always do to get the latest 
 packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and 
 the process went into D-state and I had to reboot.
 
 The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
 for an upgrade, but all of the options have been left unchanged.
 
 Here is the *OOPS I captured via dmesg before I rebooted:
 
 
 
 Also,
 
 Not sure if this helps but when this happened, any file that was open()
 for read/write seem to have also been corrupted..

Is that all files, or just ones that were being changed?

 $ /usr/sbin/xfs_bmap -v myconfig.txt.orig
 myconfig.txt.orig:
  EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
0: [0..7]:  64601112..64601119 14 (52040..52047)   8
 $ /usr/sbin/xfs_bmap -v myconfig.txt
 myconfig.txt:
  EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
0: [0..7]:  64625720..64625727 14 (76648..76655)   8
 $ md5sum myconfig*
 db8c50ca2c86d2e757ecef1d6b3fcc69  myconfig.txt
 09fb630623b3ae614511cef4c7a21063  myconfig.txt.orig
 $ file myconfig.txt myconfig.txt.orig
 myconfig.txt:  ASCII text
 myconfig.txt.orig: data
 $
 
 $ strings -a myconfig.txt.orig
 $
 
 $ od -c myconfig.txt.orig
 000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 *
 0003500  \0  \0  \0  \0  \0  \0
 0003506
 
 Seems like it was NULL'd out?

A single block of zeros - its possible that the crash occurred between
the allocation transaction and the data write - the allocation gets
replayed (along with the new file size), but the data write does
not (not journalled). This is one of the rarer NULL files on crash
failure modes fixed in 6.5.22.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-19 Thread Justin Piszcz

On Mon, 17 Sep 2007, Justin Piszcz wrote:

Including the XFS mailing list in here too because it may be an XFS bug 
looking at the call trace.


System: Debian Testing
Kernel: 2.6.20
Config: Attached

I was running apt-get dist-upgrade as I always do to get the latest packages 
upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and the process 
went into D-state and I had to reboot.


The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
for an upgrade, but all of the options have been left unchanged.


Here is the *OOPS I captured via dmesg before I rebooted:




Also,

Not sure if this helps but when this happened, any file that was open()
for read/write seem to have also been corrupted..

$ /usr/sbin/xfs_bmap -v myconfig.txt.orig
myconfig.txt.orig:
 EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
   0: [0..7]:  64601112..64601119 14 (52040..52047)   8
$ /usr/sbin/xfs_bmap -v myconfig.txt
myconfig.txt:
 EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
   0: [0..7]:  64625720..64625727 14 (76648..76655)   8
$ md5sum myconfig*
db8c50ca2c86d2e757ecef1d6b3fcc69  myconfig.txt
09fb630623b3ae614511cef4c7a21063  myconfig.txt.orig
$ file myconfig.txt myconfig.txt.orig
myconfig.txt:  ASCII text
myconfig.txt.orig: data
$

$ strings -a myconfig.txt.orig
$

$ od -c myconfig.txt.orig
000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 *
0003500  \0  \0  \0  \0  \0  \0
0003506

Seems like it was NULL'd out?

Justin.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of 180 days during apt-get dist-upgrade on Debian Testing

2007-09-19 Thread Justin Piszcz

On Mon, 17 Sep 2007, Justin Piszcz wrote:

Including the XFS mailing list in here too because it may be an XFS bug 
looking at the call trace.


System: Debian Testing
Kernel: 2.6.20
Config: Attached

I was running apt-get dist-upgrade as I always do to get the latest packages 
upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and the process 
went into D-state and I had to reboot.


The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
for an upgrade, but all of the options have been left unchanged.


Here is the *OOPS I captured via dmesg before I rebooted:




Also,

Not sure if this helps but when this happened, any file that was open()
for read/write seem to have also been corrupted..

$ /usr/sbin/xfs_bmap -v myconfig.txt.orig
myconfig.txt.orig:
 EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
   0: [0..7]:  64601112..64601119 14 (52040..52047)   8
$ /usr/sbin/xfs_bmap -v myconfig.txt
myconfig.txt:
 EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
   0: [0..7]:  64625720..64625727 14 (76648..76655)   8
$ md5sum myconfig*
db8c50ca2c86d2e757ecef1d6b3fcc69  myconfig.txt
09fb630623b3ae614511cef4c7a21063  myconfig.txt.orig
$ file myconfig.txt myconfig.txt.orig
myconfig.txt:  ASCII text
myconfig.txt.orig: data
$

$ strings -a myconfig.txt.orig
$

$ od -c myconfig.txt.orig
000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 *
0003500  \0  \0  \0  \0  \0  \0
0003506

Seems like it was NULL'd out?

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-18 Thread Christoph Hellwig
On Tue, Sep 18, 2007 at 08:39:16PM +1000, David Chinner wrote:
> The i_sem leak was hitting us on sles9 - 2.6.5 base kernel - and it was fixed
> before the i_sem -> i_mutex conversion in mainline. Some time around 2.6.16,
> IIRC. Given this was a 2.6.20 kernel, there'd be an almighty kaboom if that
> bug still existed after the i_mutex conversion

We didn't fix it before the i_mutex conversion - I saw the mutex debugging
messages after the i_mutex conversion, but didn't see the corelation.
Probably still long before 2.6.22, though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-18 Thread David Chinner
On Tue, Sep 18, 2007 at 10:20:13AM +0100, Christoph Hellwig wrote:
> On Tue, Sep 18, 2007 at 11:45:37AM +1000, David Chinner wrote:
> > No idea - it looks like dkpg was trying to remove a directory on the
> > same path the lookup was and both have gone splat in __d_lookup on
> > the same dentry. Something happened in  those 180 days that left a
> > landmine that was tripped over here, I think. I can't see any way of
> > tracking it down from this, but thanks for reporting it anyway,
> 
> This looks a lot like the i_sem leak that Vlad debugged.  Do you remember
> where this was fixed?

The i_sem leak was hitting us on sles9 - 2.6.5 base kernel - and it was fixed
before the i_sem -> i_mutex conversion in mainline. Some time around 2.6.16,
IIRC. Given this was a 2.6.20 kernel, there'd be an almighty kaboom if that
bug still existed after the i_mutex conversion

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-18 Thread Christoph Hellwig
On Tue, Sep 18, 2007 at 11:45:37AM +1000, David Chinner wrote:
> No idea - it looks like dkpg was trying to remove a directory on the
> same path the lookup was and both have gone splat in __d_lookup on
> the same dentry. Something happened in  those 180 days that left a
> landmine that was tripped over here, I think. I can't see any way of
> tracking it down from this, but thanks for reporting it anyway,

This looks a lot like the i_sem leak that Vlad debugged.  Do you remember
where this was fixed?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of 180 days during apt-get dist-upgrade on Debian Testing

2007-09-18 Thread Christoph Hellwig
On Tue, Sep 18, 2007 at 11:45:37AM +1000, David Chinner wrote:
 No idea - it looks like dkpg was trying to remove a directory on the
 same path the lookup was and both have gone splat in __d_lookup on
 the same dentry. Something happened in  those 180 days that left a
 landmine that was tripped over here, I think. I can't see any way of
 tracking it down from this, but thanks for reporting it anyway,

This looks a lot like the i_sem leak that Vlad debugged.  Do you remember
where this was fixed?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of 180 days during apt-get dist-upgrade on Debian Testing

2007-09-18 Thread David Chinner
On Tue, Sep 18, 2007 at 10:20:13AM +0100, Christoph Hellwig wrote:
 On Tue, Sep 18, 2007 at 11:45:37AM +1000, David Chinner wrote:
  No idea - it looks like dkpg was trying to remove a directory on the
  same path the lookup was and both have gone splat in __d_lookup on
  the same dentry. Something happened in  those 180 days that left a
  landmine that was tripped over here, I think. I can't see any way of
  tracking it down from this, but thanks for reporting it anyway,
 
 This looks a lot like the i_sem leak that Vlad debugged.  Do you remember
 where this was fixed?

The i_sem leak was hitting us on sles9 - 2.6.5 base kernel - and it was fixed
before the i_sem - i_mutex conversion in mainline. Some time around 2.6.16,
IIRC. Given this was a 2.6.20 kernel, there'd be an almighty kaboom if that
bug still existed after the i_mutex conversion

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of 180 days during apt-get dist-upgrade on Debian Testing

2007-09-18 Thread Christoph Hellwig
On Tue, Sep 18, 2007 at 08:39:16PM +1000, David Chinner wrote:
 The i_sem leak was hitting us on sles9 - 2.6.5 base kernel - and it was fixed
 before the i_sem - i_mutex conversion in mainline. Some time around 2.6.16,
 IIRC. Given this was a 2.6.20 kernel, there'd be an almighty kaboom if that
 bug still existed after the i_mutex conversion

We didn't fix it before the i_mutex conversion - I saw the mutex debugging
messages after the i_mutex conversion, but didn't see the corelation.
Probably still long before 2.6.22, though.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-17 Thread David Chinner
On Mon, Sep 17, 2007 at 01:20:17PM -0400, Justin Piszcz wrote:
> Including the XFS mailing list in here too because it may be an XFS bug 
> looking at the call trace.
> 
> System: Debian Testing
> Kernel: 2.6.20
> Config: Attached
> 
> I was running apt-get dist-upgrade as I always do to get the latest 
> packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and 
> the process went into D-state and I had to reboot.
> 
> The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
> for an upgrade, but all of the options have been left unchanged.
> 
> Here is the *OOPS I captured via dmesg before I rebooted:
> 
> [16201055.214559] nfsd: last server has exited
> [16201055.214566] nfsd: unexporting all filesystems
> [17341583.697472] BUG: unable to handle kernel paging request at virtual 
> address 99e00750
> [17341583.697480]  printing eip:
> [17341583.697482] c01531b0
> [17341583.697484] *pde = 
> [17341583.697488] Oops:  [#1]
> [17341583.697491] CPU:0
> [17341583.697493] EIP:0060:[]Not tainted VLI
> [17341583.697494] EFLAGS: 00210286   (2.6.20 #3)
> [17341583.697502] EIP is at __d_lookup+0x5d/0xd6
> [17341583.697505] eax: c8d7c17e   ebx: 99e00750   ecx: 0011   edx: 
> c17f9200
> [17341583.697508] esi: 99e00750   edi: d2a10016   ebp: c7fe2304   esp: 
> dba35d98
> [17341583.697511] ds: 007b   es: 007b   ss: 0068
> [17341583.697514] Process kdm_greet (pid: 22119, ti=dba34000 task=f52d4a70 
> task.ti=dba34000)
> [17341583.697516] Stack: c8d7c17e  dba35e10 f705d478 dba35db8 
> 002c d2a10016 d2a10042 [17341583.697522]dba35e10 dba35f30 
> dba35e10 c014ab6d dba35e1c c18c5240 dba35f04 c021877e [17341583.697528] 
> d2a10042 dba35e10 c8d7c17e dba35f30 c014c38f d2a10016 0101 dba35e48 
> [17341583.697534] Call Trace:
> [17341583.697537]  [] do_lookup+0x1c/0x168
> [17341583.697540]  [] xfs_vn_lookup+0x53/0x77
> [17341583.697547]  [] __link_path_walk+0x6e8/0xb1b
> [17341583.697551]  [] dput+0x18/0x121
> [17341583.697554]  [] link_path_walk+0x43/0xb8
> [17341583.697558]  [] do_path_lookup+0x75/0x181
> [17341583.697561]  [] get_empty_filp+0x2f/0xe5
> [17341583.697566]  [] __path_lookup_intent_open+0x45/0x80
> [17341583.697570]  [] path_lookup_open+0x20/0x25
> [17341583.697573]  [] open_namei+0x66/0x58a
> [17341583.697576]  [] do_filp_open+0x25/0x40
> [17341583.697580]  [] do_sys_open+0x3e/0xc7
> [17341583.697584]  [] sys_open+0x1c/0x20
> [17341583.697587]  [] syscall_call+0x7/0xb
> [17341583.697591]  ===
> [17341583.697593] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 
> 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 
> 47 <8b> 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
> [17341583.697621] EIP: [] __d_lookup+0x5d/0xd6 SS:ESP 
> 0068:dba35d98
> [17341583.697626]  <1>BUG: unable to handle kernel paging request at 
> virtual address 99e00750
> [17341648.066740]  printing eip:
> [17341648.066786] c01531b0
> [17341648.066868] *pde = 
> [17341648.066916] Oops:  [#2]
> [17341648.066965] CPU:0
> [17341648.066966] EIP:0060:[]Not tainted VLI
> [17341648.066967] EFLAGS: 00010286   (2.6.20 #3)
> [17341648.067115] EIP is at __d_lookup+0x5d/0xd6
> [17341648.067165] eax: 1efcce0e   ebx: 99e00750   ecx: 0011   edx: 
> c17f9200
> [17341648.067219] esi: 99e00750   edi: cc87901a   ebp: c7fe2304   esp: 
> f7755f04
> [17341648.067271] ds: 007b   es: 007b   ss: 0068
> [17341648.067320] Process dpkg (pid: 24684, ti=f7754000 task=d9846a70 
> task.ti=f7754000)
> [17341648.067371] Stack: 1efcce0e 46dd3a20 f7755f5c e489fe28  
> 0010 cc87901a  [17341648.067715]e489fe28 0001 
> f7755f54 c014b7cb f7755f5c ef0d4098 ffd9 cc879000 [17341648.068056] 
> 0001 f7755f54 c014cf84 f7755f54 e489fe28 c18c5240 1efcce0e 0010 
> [17341648.068397] Call Trace:
> [17341648.068482]  [] __lookup_hash+0x4a/0xef
> [17341648.068563]  [] do_rmdir+0x69/0xbb
> [17341648.068642]  [] syscall_call+0x7/0xb
> [17341648.068724]  ===
> [17341648.068770] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 
> 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 
> 47 <8b> 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
> [17341648.070874] EIP: [] __d_lookup+0x5d/0xd6 SS:ESP 
> 0068:f7755f04
> [17341648.070988]
> 
> I doubt I can reproduce it as it has happened after 180 days or so, and I 
> am upgrading to 2.6.22.6 but I was wondering what exactly happened here?

No idea - it looks like dkpg was trying to remove a directory on the
same path the lookup was and both have gone splat in __d_lookup on
the same dentry. Something happened in  those 180 days that left a
landmine that was tripped over here, I think. I can't see any way of
tracking it down from this, but thanks for reporting it anyway,
Justin.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To 

2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-17 Thread Justin Piszcz
Including the XFS mailing list in here too because it may be an XFS bug 
looking at the call trace.


System: Debian Testing
Kernel: 2.6.20
Config: Attached

I was running apt-get dist-upgrade as I always do to get the latest 
packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and 
the process went into D-state and I had to reboot.


The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
for an upgrade, but all of the options have been left unchanged.


Here is the *OOPS I captured via dmesg before I rebooted:

[16201055.214559] nfsd: last server has exited
[16201055.214566] nfsd: unexporting all filesystems
[17341583.697472] BUG: unable to handle kernel paging request at virtual 
address 99e00750
[17341583.697480]  printing eip:
[17341583.697482] c01531b0
[17341583.697484] *pde = 
[17341583.697488] Oops:  [#1]
[17341583.697491] CPU:0
[17341583.697493] EIP:0060:[]Not tainted VLI
[17341583.697494] EFLAGS: 00210286   (2.6.20 #3)
[17341583.697502] EIP is at __d_lookup+0x5d/0xd6
[17341583.697505] eax: c8d7c17e   ebx: 99e00750   ecx: 0011   edx: c17f9200
[17341583.697508] esi: 99e00750   edi: d2a10016   ebp: c7fe2304   esp: dba35d98
[17341583.697511] ds: 007b   es: 007b   ss: 0068
[17341583.697514] Process kdm_greet (pid: 22119, ti=dba34000 task=f52d4a70 
task.ti=dba34000)
[17341583.697516] Stack: c8d7c17e  dba35e10 f705d478 dba35db8 002c d2a10016 d2a10042 
[17341583.697522]dba35e10 dba35f30 dba35e10 c014ab6d dba35e1c c18c5240 dba35f04 c021877e 
[17341583.697528]d2a10042 dba35e10 c8d7c17e dba35f30 c014c38f d2a10016 0101 dba35e48 
[17341583.697534] Call Trace:

[17341583.697537]  [] do_lookup+0x1c/0x168
[17341583.697540]  [] xfs_vn_lookup+0x53/0x77
[17341583.697547]  [] __link_path_walk+0x6e8/0xb1b
[17341583.697551]  [] dput+0x18/0x121
[17341583.697554]  [] link_path_walk+0x43/0xb8
[17341583.697558]  [] do_path_lookup+0x75/0x181
[17341583.697561]  [] get_empty_filp+0x2f/0xe5
[17341583.697566]  [] __path_lookup_intent_open+0x45/0x80
[17341583.697570]  [] path_lookup_open+0x20/0x25
[17341583.697573]  [] open_namei+0x66/0x58a
[17341583.697576]  [] do_filp_open+0x25/0x40
[17341583.697580]  [] do_sys_open+0x3e/0xc7
[17341583.697584]  [] sys_open+0x1c/0x20
[17341583.697587]  [] syscall_call+0x7/0xb
[17341583.697591]  ===
[17341583.697593] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 47 <8b> 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
[17341583.697621] EIP: [] __d_lookup+0x5d/0xd6 SS:ESP 0068:dba35d98

[17341583.697626]  <1>BUG: unable to handle kernel paging request at virtual 
address 99e00750
[17341648.066740]  printing eip:
[17341648.066786] c01531b0
[17341648.066868] *pde = 
[17341648.066916] Oops:  [#2]
[17341648.066965] CPU:0
[17341648.066966] EIP:0060:[]Not tainted VLI
[17341648.066967] EFLAGS: 00010286   (2.6.20 #3)
[17341648.067115] EIP is at __d_lookup+0x5d/0xd6
[17341648.067165] eax: 1efcce0e   ebx: 99e00750   ecx: 0011   edx: c17f9200
[17341648.067219] esi: 99e00750   edi: cc87901a   ebp: c7fe2304   esp: f7755f04
[17341648.067271] ds: 007b   es: 007b   ss: 0068
[17341648.067320] Process dpkg (pid: 24684, ti=f7754000 task=d9846a70 
task.ti=f7754000)
[17341648.067371] Stack: 1efcce0e 46dd3a20 f7755f5c e489fe28  0010 cc87901a  
[17341648.067715]e489fe28 0001 f7755f54 c014b7cb f7755f5c ef0d4098 ffd9 cc879000 
[17341648.068056]0001 f7755f54 c014cf84 f7755f54 e489fe28 c18c5240 1efcce0e 0010 
[17341648.068397] Call Trace:

[17341648.068482]  [] __lookup_hash+0x4a/0xef
[17341648.068563]  [] do_rmdir+0x69/0xbb
[17341648.068642]  [] syscall_call+0x7/0xb
[17341648.068724]  ===
[17341648.068770] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 47 <8b> 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
[17341648.070874] EIP: [] __d_lookup+0x5d/0xd6 SS:ESP 0068:f7755f04

[17341648.070988]

I doubt I can reproduce it as it has happened after 180 days or so, and I 
am upgrading to 2.6.22.6 but I was wondering what exactly happened here?


Thanks!

Justin.

config-2.6.22.bz2
Description: Binary data


2.6.20 (XFS? related) crash after uptime of 180 days during apt-get dist-upgrade on Debian Testing

2007-09-17 Thread Justin Piszcz
Including the XFS mailing list in here too because it may be an XFS bug 
looking at the call trace.


System: Debian Testing
Kernel: 2.6.20
Config: Attached

I was running apt-get dist-upgrade as I always do to get the latest 
packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and 
the process went into D-state and I had to reboot.


The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
for an upgrade, but all of the options have been left unchanged.


Here is the *OOPS I captured via dmesg before I rebooted:

[16201055.214559] nfsd: last server has exited
[16201055.214566] nfsd: unexporting all filesystems
[17341583.697472] BUG: unable to handle kernel paging request at virtual 
address 99e00750
[17341583.697480]  printing eip:
[17341583.697482] c01531b0
[17341583.697484] *pde = 
[17341583.697488] Oops:  [#1]
[17341583.697491] CPU:0
[17341583.697493] EIP:0060:[c01531b0]Not tainted VLI
[17341583.697494] EFLAGS: 00210286   (2.6.20 #3)
[17341583.697502] EIP is at __d_lookup+0x5d/0xd6
[17341583.697505] eax: c8d7c17e   ebx: 99e00750   ecx: 0011   edx: c17f9200
[17341583.697508] esi: 99e00750   edi: d2a10016   ebp: c7fe2304   esp: dba35d98
[17341583.697511] ds: 007b   es: 007b   ss: 0068
[17341583.697514] Process kdm_greet (pid: 22119, ti=dba34000 task=f52d4a70 
task.ti=dba34000)
[17341583.697516] Stack: c8d7c17e  dba35e10 f705d478 dba35db8 002c d2a10016 d2a10042 
[17341583.697522]dba35e10 dba35f30 dba35e10 c014ab6d dba35e1c c18c5240 dba35f04 c021877e 
[17341583.697528]d2a10042 dba35e10 c8d7c17e dba35f30 c014c38f d2a10016 0101 dba35e48 
[17341583.697534] Call Trace:

[17341583.697537]  [c014ab6d] do_lookup+0x1c/0x168
[17341583.697540]  [c021877e] xfs_vn_lookup+0x53/0x77
[17341583.697547]  [c014c38f] __link_path_walk+0x6e8/0xb1b
[17341583.697551]  [c0153698] dput+0x18/0x121
[17341583.697554]  [c014c805] link_path_walk+0x43/0xb8
[17341583.697558]  [c014ca0a] do_path_lookup+0x75/0x181
[17341583.697561]  [c0145fda] get_empty_filp+0x2f/0xe5
[17341583.697566]  [c014d468] __path_lookup_intent_open+0x45/0x80
[17341583.697570]  [c014d517] path_lookup_open+0x20/0x25
[17341583.697573]  [c014d5db] open_namei+0x66/0x58a
[17341583.697576]  [c0143c35] do_filp_open+0x25/0x40
[17341583.697580]  [c0143c8e] do_sys_open+0x3e/0xc7
[17341583.697584]  [c0143d52] sys_open+0x1c/0x20
[17341583.697587]  [c0102920] syscall_call+0x7/0xb
[17341583.697591]  ===
[17341583.697593] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 47 8b 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
[17341583.697621] EIP: [c01531b0] __d_lookup+0x5d/0xd6 SS:ESP 0068:dba35d98

[17341583.697626]  1BUG: unable to handle kernel paging request at virtual 
address 99e00750
[17341648.066740]  printing eip:
[17341648.066786] c01531b0
[17341648.066868] *pde = 
[17341648.066916] Oops:  [#2]
[17341648.066965] CPU:0
[17341648.066966] EIP:0060:[c01531b0]Not tainted VLI
[17341648.066967] EFLAGS: 00010286   (2.6.20 #3)
[17341648.067115] EIP is at __d_lookup+0x5d/0xd6
[17341648.067165] eax: 1efcce0e   ebx: 99e00750   ecx: 0011   edx: c17f9200
[17341648.067219] esi: 99e00750   edi: cc87901a   ebp: c7fe2304   esp: f7755f04
[17341648.067271] ds: 007b   es: 007b   ss: 0068
[17341648.067320] Process dpkg (pid: 24684, ti=f7754000 task=d9846a70 
task.ti=f7754000)
[17341648.067371] Stack: 1efcce0e 46dd3a20 f7755f5c e489fe28  0010 cc87901a  
[17341648.067715]e489fe28 0001 f7755f54 c014b7cb f7755f5c ef0d4098 ffd9 cc879000 
[17341648.068056]0001 f7755f54 c014cf84 f7755f54 e489fe28 c18c5240 1efcce0e 0010 
[17341648.068397] Call Trace:

[17341648.068482]  [c014b7cb] __lookup_hash+0x4a/0xef
[17341648.068563]  [c014cf84] do_rmdir+0x69/0xbb
[17341648.068642]  [c0102920] syscall_call+0x7/0xb
[17341648.068724]  ===
[17341648.068770] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 47 8b 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
[17341648.070874] EIP: [c01531b0] __d_lookup+0x5d/0xd6 SS:ESP 0068:f7755f04

[17341648.070988]

I doubt I can reproduce it as it has happened after 180 days or so, and I 
am upgrading to 2.6.22.6 but I was wondering what exactly happened here?


Thanks!

Justin.

config-2.6.22.bz2
Description: Binary data


Re: 2.6.20 (XFS? related) crash after uptime of 180 days during apt-get dist-upgrade on Debian Testing

2007-09-17 Thread David Chinner
On Mon, Sep 17, 2007 at 01:20:17PM -0400, Justin Piszcz wrote:
 Including the XFS mailing list in here too because it may be an XFS bug 
 looking at the call trace.
 
 System: Debian Testing
 Kernel: 2.6.20
 Config: Attached
 
 I was running apt-get dist-upgrade as I always do to get the latest 
 packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and 
 the process went into D-state and I had to reboot.
 
 The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
 for an upgrade, but all of the options have been left unchanged.
 
 Here is the *OOPS I captured via dmesg before I rebooted:
 
 [16201055.214559] nfsd: last server has exited
 [16201055.214566] nfsd: unexporting all filesystems
 [17341583.697472] BUG: unable to handle kernel paging request at virtual 
 address 99e00750
 [17341583.697480]  printing eip:
 [17341583.697482] c01531b0
 [17341583.697484] *pde = 
 [17341583.697488] Oops:  [#1]
 [17341583.697491] CPU:0
 [17341583.697493] EIP:0060:[c01531b0]Not tainted VLI
 [17341583.697494] EFLAGS: 00210286   (2.6.20 #3)
 [17341583.697502] EIP is at __d_lookup+0x5d/0xd6
 [17341583.697505] eax: c8d7c17e   ebx: 99e00750   ecx: 0011   edx: 
 c17f9200
 [17341583.697508] esi: 99e00750   edi: d2a10016   ebp: c7fe2304   esp: 
 dba35d98
 [17341583.697511] ds: 007b   es: 007b   ss: 0068
 [17341583.697514] Process kdm_greet (pid: 22119, ti=dba34000 task=f52d4a70 
 task.ti=dba34000)
 [17341583.697516] Stack: c8d7c17e  dba35e10 f705d478 dba35db8 
 002c d2a10016 d2a10042 [17341583.697522]dba35e10 dba35f30 
 dba35e10 c014ab6d dba35e1c c18c5240 dba35f04 c021877e [17341583.697528] 
 d2a10042 dba35e10 c8d7c17e dba35f30 c014c38f d2a10016 0101 dba35e48 
 [17341583.697534] Call Trace:
 [17341583.697537]  [c014ab6d] do_lookup+0x1c/0x168
 [17341583.697540]  [c021877e] xfs_vn_lookup+0x53/0x77
 [17341583.697547]  [c014c38f] __link_path_walk+0x6e8/0xb1b
 [17341583.697551]  [c0153698] dput+0x18/0x121
 [17341583.697554]  [c014c805] link_path_walk+0x43/0xb8
 [17341583.697558]  [c014ca0a] do_path_lookup+0x75/0x181
 [17341583.697561]  [c0145fda] get_empty_filp+0x2f/0xe5
 [17341583.697566]  [c014d468] __path_lookup_intent_open+0x45/0x80
 [17341583.697570]  [c014d517] path_lookup_open+0x20/0x25
 [17341583.697573]  [c014d5db] open_namei+0x66/0x58a
 [17341583.697576]  [c0143c35] do_filp_open+0x25/0x40
 [17341583.697580]  [c0143c8e] do_sys_open+0x3e/0xc7
 [17341583.697584]  [c0143d52] sys_open+0x1c/0x20
 [17341583.697587]  [c0102920] syscall_call+0x7/0xb
 [17341583.697591]  ===
 [17341583.697593] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 
 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 
 47 8b 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
 [17341583.697621] EIP: [c01531b0] __d_lookup+0x5d/0xd6 SS:ESP 
 0068:dba35d98
 [17341583.697626]  1BUG: unable to handle kernel paging request at 
 virtual address 99e00750
 [17341648.066740]  printing eip:
 [17341648.066786] c01531b0
 [17341648.066868] *pde = 
 [17341648.066916] Oops:  [#2]
 [17341648.066965] CPU:0
 [17341648.066966] EIP:0060:[c01531b0]Not tainted VLI
 [17341648.066967] EFLAGS: 00010286   (2.6.20 #3)
 [17341648.067115] EIP is at __d_lookup+0x5d/0xd6
 [17341648.067165] eax: 1efcce0e   ebx: 99e00750   ecx: 0011   edx: 
 c17f9200
 [17341648.067219] esi: 99e00750   edi: cc87901a   ebp: c7fe2304   esp: 
 f7755f04
 [17341648.067271] ds: 007b   es: 007b   ss: 0068
 [17341648.067320] Process dpkg (pid: 24684, ti=f7754000 task=d9846a70 
 task.ti=f7754000)
 [17341648.067371] Stack: 1efcce0e 46dd3a20 f7755f5c e489fe28  
 0010 cc87901a  [17341648.067715]e489fe28 0001 
 f7755f54 c014b7cb f7755f5c ef0d4098 ffd9 cc879000 [17341648.068056] 
 0001 f7755f54 c014cf84 f7755f54 e489fe28 c18c5240 1efcce0e 0010 
 [17341648.068397] Call Trace:
 [17341648.068482]  [c014b7cb] __lookup_hash+0x4a/0xef
 [17341648.068563]  [c014cf84] do_rmdir+0x69/0xbb
 [17341648.068642]  [c0102920] syscall_call+0x7/0xb
 [17341648.068724]  ===
 [17341648.068770] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 
 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 
 47 8b 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
 [17341648.070874] EIP: [c01531b0] __d_lookup+0x5d/0xd6 SS:ESP 
 0068:f7755f04
 [17341648.070988]
 
 I doubt I can reproduce it as it has happened after 180 days or so, and I 
 am upgrading to 2.6.22.6 but I was wondering what exactly happened here?

No idea - it looks like dkpg was trying to remove a directory on the
same path the lookup was and both have gone splat in __d_lookup on
the same dentry. Something happened in  those 180 days that left a
landmine that was tripped over here, I think. I can't see any way of
tracking it down from this, but thanks for reporting it anyway,
Justin.

Cheers,

Dave.
--