Bug#292290: kernel-image-2.6.8-1-k7: XFS filesystem corruption: Input/output error

Loīc Minier Thu, 03 Feb 2005 07:38:11 -0800

        Hi,

 This is a followup for Debian bug <http://bugs.debian.org/292290>.


Joost van Baal <[EMAIL PROTECTED]> - Wed, Jan 26, 2005:
>   `./lib/modules/2.6.10-1-k7/kernel/drivers/atm/zatm.ko': Unknown error 990
> I've heard of one other victim of this problem with this kernel.
Wessel Dankers <[EMAIL PROTECTED]> - Thu, Jan 27, 2005:
> I myself have been a victim of this too, so I thought I'd join in.

 Well, me too.

> - the kernel was Debian's 2.6.8;
> - the filesystem in question was XFS;
> - software raid1 (mirroring) was used.
> XFS complained about corrupted in-memory structures in some of the cases.
> However, it is very unlikely that all three machines have bad RAM, and
> memtest86+ reports no problems.

 I am also using Debian's kernel-image-2.6.8-2-686 in Version 2.6.8-13.

 First of all, I'm using a PIV, so this aint K7 specific.  I am NOT
 using RAID 1 nor LVM, pure XFS.

 This first corruption appeared with my "mail/debian-project/" folder,
 precisely on the "tmp/" subdirectory.  The second appeared today, on
 the ./usr/share/doc/texmf/help/Catalogue/entries/romannum.html:
 dpkg: error processing
 /var/cache/apt/archives/tetex-doc_2.0.2c-6_all.deb (--unpack):
  unable to stat
  `./usr/share/doc/texmf/help/Catalogue/entries/romannum.html' (which I
  was about to install): Unknown error 990

 This is a really serious XFS problem it seems.

 Trying to understand the problem suggested I tried stracing:
 bee% LC_ALL=C strace -f ls debian-project-fucked/tmp 2>&1
 ...
 rt_sigprocmask(SIG_UNBLOCK, [RTMIN], NULL, 8) = 0
 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) =
 0
 brk(0)                                  = 0x805b000
 brk(0x807c000)                          = 0x807c000
 brk(0)                                  = 0x807c000
 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo
 ...}) = 0
 ioctl(1, TIOCGWINSZ, {ws_row=24, ws_col=80, ws_xpixel=644,
 ws_ypixel=388}) = 0
 stat64("debian-project-fucked/tmp", {st_mode=0, st_size=0, ...}) = -990
 write(2, "ls: ", 4ls: )                     = 4
 write(2, "debian-project-fucked/tmp", 25debian-project-fucked/tmp) = 25
 write(2, ": Unknown error 990", 19: Unknown error 990)     = 19
 write(2, "\n", 1
 )                       = 1

 The problems seems to occur with the stat64() syscall, but I couldn't
 find out what error 990 is supposed to be in the /usr/include headers
 so I moved on to the kernel source and looked to the various syscalls
 implementations.  I also tried understanding what syscalls could
 trigger the problem:
   I checked with:
    bee% LC_ALL=C strace zsh -e -c "cd debian-project-fucked/tmp; ls"
 and got the error with a chdir() too, and hence looked at sys_chdir().
   Then I checked whether this was directory specific, and tried:
 bee% LC_ALL=C strace -f ls -i \
     /usr/share/doc/texmf/help/Catalogue/entries/ 2>&1
 I got errors on a bunch of files, in the lstat64().

 Then I looked upstream, first at bugme.osdl.org, and found:
 http://bugme.osdl.org/show_bug.cgi?id=3224 (still open)

 Finally, I looked at SGI's bugzilla, and found a first bug bubble:
 http://oss.sgi.com/bugzilla/show_bug.cgi?id=197
 The problem also seems to appear in a comment of:
 http://oss.sgi.com/bugzilla/show_bug.cgi?id=383

 197 is really worth reading, and using MD / LVM devices seems to help
 trigger the bug.

 These are dups of the above:
 http://oss.sgi.com/bugzilla/show_bug.cgi?id=204
 http://oss.sgi.com/bugzilla/show_bug.cgi?id=207

 The final patch attached to the bug report is:
 http://oss.sgi.com/bugzilla/attachment.cgi?id=59&action=view

 I couldn't find an applied version in the kernel, it looked somehow too
 much different but the xfs_finish_reclaim_all() was there...

 2.6.8 was released in august 2004, and the patch mentionned dates
 january 2003, so I can only think we face a different bug.


 Then I went thoroughly through the bugzilla and found another bug which
 might be related:
 http://oss.sgi.com/bugzilla/show_bug.cgi?id=338 is on a 2.4 kernel


 When I found out error 990 means EFSCORRUPTED, I thought I wouldn't be
 able to track down the problem any further...

 So I'm about to get a fresh xfsprogs or a live CD and xfs_repair my FS
 to get a log and send it upstream.

   Regards,

-- 
Loīc Minier <[EMAIL PROTECTED]>
"Neutral President: I have no strong feelings one way or the other."

Bug#292290: kernel-image-2.6.8-1-k7: XFS filesystem corruption: Input/output error

Reply via email to