Re: Major SATA / EXT3 Issue?
Hi all! (Please Cc) I alsohave to report a very similar incident. Debian/sid, kernel 2.6.22. Doing some hard work for the disk (svn up of two big repositories, some copying of files, etc etc). Suddently the PC froze. Nothing, I had to reboot. But then: - BIOS didn't detect the disks, or better, it took extremely long - booting into linux gave those time out messages already mentioned (I am away, cannot give you details for now till sunday) - booting into windows frooze windows when accessing the second harddisk (from which stuff was copied). - reseting the computer didn't help., but turning physically off, and turning on again did the trick, some fsck-ing. - booting into windows needed chkdsk from ewindows, and severalk files destroyed. Both the disks and the computer are quite new, and are NOT heavily used, only now and then. AFAIR nv SATA driver. Ic ould repeat these problems with big copying actions. The problem with logging is that the computer freezes hard and nothing remains in the log files. Best wishes Norbert --- Dr. Norbert Preining <[EMAIL PROTECTED]>Vienna University of Technology Debian Developer <[EMAIL PROTECTED]> Debian TeX Group gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094 --- JARROW (adj.) An agricultural device which, when towed behind a tractor, enables the farmer to spread his dung evenly across the width of the road. --- Douglas Adams, The Meaning of Liff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major SATA / EXT3 Issue?
First, let me thank you for your response, and apologize for taking so long to get back to you. I have not been able to "cut loose" from my day job for the past several days as I have been completly tied up loading and configuring a dozen Thinkpads scheduled to be sacraficed next week. With this failure there was no data to be collected from the time of the failure- as previously stated, the system "appeared" to lock up. I could not switch to an alternitive tty nor could I ssh into the system from one of my other boxes. Nothing showed in the logs at the time of the failure. On subsequent reboots I received "scrolling" errors on both of the devices in question of the type previously reported. This continued through every attempt to reboot. Attempts to clean things up with an fsck always resulted in the following message fsck.ext2: No such file or directory while trying to open /dev/sdb1 (or /dev/sda1 dependent on which device was being used) I attempted to run gparted to see what if any file system the system thought might be lurking on these drives, but gparted would always hang up "scanning devices" This morning I plugged the drives back in and brought the system up on a 2.6.24-rc1-git4 kernel and shazamm! Both drives were seen by the system and the only errors seen by fsck were a few unused / orphan inodes. I am not sure where to go from here - I have been in the game long enough to be doubtful about multiple INDEPENDENT hardware failures - which is why after "blowing off" the first dead SATA drive, I took another look when I had a second one fail in the same manner within a few days. Both of the previously failed drives are up and running at this time. Understandably I am not ready to put "real data" back on them but I will button up the case, bring the system back up, and come up withsome way to flow "test data" between these two devices looking for another failure. Once again thanks for your time. Chris On Sat, 2007-10-27 at 13:47 +0100, Alan Cox wrote: > On Fri, 26 Oct 2007 21:21:52 -0500 > Chris Holvenstot <[EMAIL PROTECTED]> wrote: > > > My SATA controller is integrated on my MSI motherboard and sports four > > ports. It is implemented using the Nvidia CK804 chipset. My processor > > is an AMD64 X2 4600+ running the 32 bit version of Linux. > > > > I have had these drives up and running for about six months.= > > You don't provide enough information to even guess. The whole of the > relevant part of the ata messages would be a lot more useful than the > cutting/pasting of bits you've provided. > The BIOS finding the drive is indicating it responds to identify and the > basic commands (and usually means your cabling is fine), but really the > rest of the trace is needed to see what occurs next. If the drive has > gone then the partition table read will fail and you won't > get /dev/sdb[something] for it. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major SATA / EXT3 Issue?
On Wed, Oct 31, 2007 at 03:25:54AM -0400, Theodore Tso wrote: > On Tue, Oct 30, 2007 at 03:59:24PM +0200, Heikki Orsila wrote: > > 3. fsck -p on boot failed > > > > (it is very probable not many files were corrupted at this stage) > > Maybe... The system wouldn't have worked as it did, if there were so many broken files before fsck. For example, the system booted properly before fsck. After fsck the grub was broken. So, most files that were corrupted, were unrelated to writes that happened that night. For example, many files at .deb cache at /var were corrupted, but it has been a long time since those were touched. > > 4. I ran fsck.ext3 -y > > > > => that corrupted lots and lots of files. This went > > into a loop, the fsck.ext3 restarted checking over and over again. > > It's possible that e2fsck corrupted the files, but they also could > have been corrupted earlier by the earlier I/O errors. At least there were no IO errors in log files before that night. The only IO error that happened put filesystem into RO state. > There are some > relatively rare filesystem corruptions which e2fsck doesn't handle as > gracefully as it should, though, I will admit. I would need to see > the e2fsck transcript to be sure. Unfortunately, I do not have it. > Were there any messages about needing to relocate inode tables by any > chance? I don't recall. I recall seeing messages about "too many blocks inside inode" and "clearing inodes". There were hundreds or thousands of these messages. -- Heikki Orsila Barbie's law: [EMAIL PROTECTED] "Math is hard, let's go shopping!" http://www.iki.fi/shd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major SATA / EXT3 Issue?
On Tue, Oct 30, 2007 at 03:59:24PM +0200, Heikki Orsila wrote: > 3. fsck -p on boot failed > > (it is very probable not many files were corrupted at this stage) Maybe... > > 4. I ran fsck.ext3 -y > > => that corrupted lots and lots of files. This went > into a loop, the fsck.ext3 restarted checking over and over again. It's possible that e2fsck corrupted the files, but they also could have been corrupted earlier by the earlier I/O errors. There are some relatively rare filesystem corruptions which e2fsck doesn't handle as gracefully as it should, though, I will admit. I would need to see the e2fsck transcript to be sure. Were there any messages about needing to relocate inode tables by any chance? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major SATA / EXT3 Issue?
On Fri, Oct 26, 2007 at 09:21:52PM -0500, Chris Holvenstot wrote: > In each case the failure mode appears to have been the same ??? the system > appears to lock up. When rebooted I get a long string of messages like: > > > Oct 26 20:07:37 localhost kernel: [ 101.581091] ata2: timeout waiting > for ADMA IDLE, stat=0x440 > > Oct 26 20:07:37 localhost kernel: [ 101.581096] sd 1:0:0:0: [sda] Write > Protect is off > > Oct 26 20:07:37 localhost kernel: [ 101.581174] res > 71/04:08:00:00:00/04:00:1d:00:00/e0 Emask 0x1 (device error) > > Oct 26 20:07:37 localhost kernel: [ 101.644992] ata2.00: configured for > UDMA/33 > > Oct 26 20:07:37 localhost kernel: [ 101.644994] ata2: EH complete > > Oct 26 20:07:37 localhost kernel: [ 101.645006] sd 1:0:0:0: [sda] Write > cache: disabled, read cache: enabled, doesn't support DPO or FUA As it turns out, our organizations version control server just blew up last night. The filesystem is rather corrupt at the moment, fortunately we have backups. We are running 2.6.22.5 (vanilla) on Debian 4.0. It has two SATA devices with Linux software raid mirror (mdadm) on Intel ata_piix chipset. The filesystem is EXT3. 1. This happened during nightly backup: init_special_inode: bogus i_mode (17) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:68:6f:3a:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 53248 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1: port is slow to respond, please be patient (Status 0xd0) ata1: device not ready (errno=-16), forcing hardreset ata1: soft resetting port ata1.00: revalidation failed (errno=-2) ata1: failed to recover some devices, retrying in 5 secs ata1: soft resetting port ata1.00: configured for UDMA/133 ata1: EH complete sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA init_special_inode: bogus i_mode (17) init_special_inode: bogus i_mode (17) init_special_inode: bogus i_mode (17) journal_bmap: journal block not found at offset 5133 on md2 Aborting journal on device md2. ext3_abort called. EXT3-fs error (device md2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only __journal_remove_journal_head: freeing b_committed_data 2. I rebooted 3. fsck -p on boot failed (it is very probable not many files were corrupted at this stage) 4. I ran fsck.ext3 -y => that corrupted lots and lots of files. This went into a loop, the fsck.ext3 restarted checking over and over again. Heikki Orsila - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major SATA / EXT3 Issue?
Chris Holvenstot wrote: I am curious if anyone else has had major problems with SATA drives on the current series of kernels. I have (or rather had) two SATA drives on my system - the first was a Maxtor MaxLine 500 and the second was a Maxtor MaxLine 250. Both of these drives were plugged to the 1.5 Gigabyte / second mode. My SATA controller is integrated on my MSI motherboard and sports four ports. It is implemented using the Nvidia CK804 chipset. My processor is an AMD64 X2 4600+ running the 32 bit version of Linux. I have had these drives up and running for about six months. The first drive "failed" about 10 days ago - and unfortunately I focused on hardware error and after several attempts to get the drive back online I physically pulled it from the system. This drive was used for backups and thus was not critical to day-to-day operations. However, tonight I "lost" a second SATA drive, this one I use on a daily basis for my kernel build and test processes. It failed in the same manner as the first, which makes me a little suspicious. The first drive “failed” while I was running a modified Ubuntu 7.04 system. Because I focused on hardware as the reason for the failure I did not collect specific information about the version of the kernel being used, but it was likely to be 2.6.24-git8. The second drive “failed” tonight on what is, except for the kernel, a fairly standard Ubuntu 7.10 system (the same hardware - I upgraded my OS this past week) – the kernel in use tonight at the time of the second failure was 2.6.24-rc1-git1 In each case the failure mode appears to have been the same – the system appears to lock up. When rebooted I get a long string of messages like: Oct 26 20:07:37 localhost kernel: [ 101.581091] ata2: timeout waiting for ADMA IDLE, stat=0x440 Oct 26 20:07:37 localhost kernel: [ 101.581096] sd 1:0:0:0: [sda] Write Protect is off Oct 26 20:07:37 localhost kernel: [ 101.581174] res 71/04:08:00:00:00/04:00:1d:00:00/e0 Emask 0x1 (device error) Oct 26 20:07:37 localhost kernel: [ 101.644992] ata2.00: configured for UDMA/33 Oct 26 20:07:37 localhost kernel: [ 101.644994] ata2: EH complete Oct 26 20:07:37 localhost kernel: [ 101.645006] sd 1:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA You should try and get some output from dmesg and not from the messages log, as the log daemon seems to have a nasty habit of discarding critical output from these errors. In this case the failing command is missing and the message ordering even seems off. The hardware appears to be correctly identified by the BIOS during the power up sequence. Not much is seen in the dmesg log excpet for: [ 43.649673] scsi0 : sata_nv [ 43.649722] scsi1 : sata_nv [ 43.649776] ata1: SATA max UDMA/133 cmd 0x9f0 ctl 0xbf0 bmdma 0xcc00 irq 19 [ 43.649778] ata2: SATA max UDMA/133 cmd 0x970 ctl 0xb70 bmdma 0xcc08 irq 19 There should be more than this at the very least.. As above, please try to get output from dmesg itself. When I try to run a file system check on these devices I get: e2fsck 1.40.2 (12-Jul-2007) fsck.ext2: No such file or directory while trying to open /dev/sdb1 The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193 I have a gut feeling that when the system appears to lock up what is really going on is that the contents of the drive are being trashed. But I have no proof of that. I don't think that is the case, more like the drives have not been detected at all. If this happens after a reboot when they were working before, that sounds like some kind of a hardware issue most likely.. When I try to do a parted to see what the system thinks is on the drive I get the error message: Error: Error opening /dev/sdb: No medium found I am not having any problems with my EXT3 file systems located on “standard” IDE / PATA drives. My config file, which has not changed in months beyond taking the defaults during make oldconfig looks like: -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major SATA / EXT3 Issue?
Chris Holvenstot schrieb: I am curious if anyone else has had major problems with SATA drives on the current series of kernels. I have (or rather had) two SATA drives on my system - the first was a Maxtor MaxLine 500 and the second was a Maxtor MaxLine 250. [...] If harddisks get too hot they die pretty quickly. The current harddisks usually support SMART. Ask the disks what they think it's wrong: http://smartmontools.sourceforge.net/ Then, check the disks with different hardware... I have seen several issues with bad cables and broken hardware. If you think that the problem is kernel related, you can boot some other distro or change the kernel to see if this problem still persists. Please post the dmesg output if you are sure the problem is kernel related... Regards, -- Clemens Koller ___ R&D Imaging Devices Anagramm GmbH Rupert-Mayer-Str. 45/1 81379 Muenchen Germany http://www.anagramm-technology.com Phone: +49-89-741518-50 Fax: +49-89-741518-19 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major SATA / EXT3 Issue?
On Fri, 26 Oct 2007 21:21:52 -0500 Chris Holvenstot <[EMAIL PROTECTED]> wrote: > My SATA controller is integrated on my MSI motherboard and sports four > ports. It is implemented using the Nvidia CK804 chipset. My processor > is an AMD64 X2 4600+ running the 32 bit version of Linux. > > I have had these drives up and running for about six months.= You don't provide enough information to even guess. The whole of the relevant part of the ata messages would be a lot more useful than the cutting/pasting of bits you've provided. The BIOS finding the drive is indicating it responds to identify and the basic commands (and usually means your cabling is fine), but really the rest of the trace is needed to see what occurs next. If the drive has gone then the partition table read will fail and you won't get /dev/sdb[something] for it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/