subject:"Major SATA \/ EXT3 Issue\?"

Re: Major SATA / EXT3 Issue?

2007-11-01 Thread Norbert Preining

Hi all!

(Please Cc)

I alsohave to report a very similar incident. Debian/sid, kernel 2.6.22.
Doing some hard work for the disk (svn up of two big repositories, some
copying of files, etc etc).

Suddently the PC froze. Nothing, I had to reboot. But then:
- BIOS didn't detect the disks, or better, it took extremely long
- booting into linux gave those time out messages already mentioned (I
  am away, cannot give you details for now till sunday)
- booting into windows frooze windows when accessing the second harddisk
  (from which stuff was copied).
- reseting the computer didn't help., but turning physically off, and
  turning on again did the trick, some fsck-ing.
- booting into windows needed chkdsk from ewindows, and severalk files
  destroyed.

Both the disks and the computer are quite new, and are NOT heavily used,
only now and then.

AFAIR nv SATA driver.

Ic ould repeat these problems with big copying actions.

The problem with logging is that the computer freezes hard and nothing
remains in the log files.

Best wishes

Norbert

---
Dr. Norbert Preining <[EMAIL PROTECTED]>Vienna University of Technology
Debian Developer <[EMAIL PROTECTED]> Debian TeX Group
gpg DSA: 0x09C5B094  fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
---
JARROW (adj.)
An agricultural device which, when towed behind a tractor, enables the
farmer to spread his dung evenly across the width of the road.
--- Douglas Adams, The Meaning of Liff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major SATA / EXT3 Issue?

2007-10-31 Thread Chris Holvenstot

First, let me thank you for your response, and apologize for taking so
long to get back to you.  I have not been able to "cut loose" from my
day job for the past several days as I have been completly tied up
loading and configuring a dozen Thinkpads scheduled to be sacraficed
next week.

With this failure there was no data to be collected from the time of the
failure- as previously stated, the system "appeared" to lock up.  I
could not switch to an alternitive tty nor could I ssh into the system
from one of my other boxes. 

Nothing showed in the logs at the time of the failure.

On subsequent reboots I received "scrolling" errors on both of the
devices in question of the type previously reported.  This continued
through every attempt to reboot.

Attempts to clean things up with an fsck always resulted in the
following message

fsck.ext2: No such file or directory while trying to open /dev/sdb1

(or /dev/sda1 dependent on which device was being used)

I attempted to run gparted to see what if any file system the system
thought might be lurking on these drives, but gparted would always hang
up "scanning devices"

This morning I plugged the drives back in and brought the system up on a
2.6.24-rc1-git4 kernel and shazamm!  Both drives were seen by the system
and the only errors seen by fsck were a few unused / orphan inodes.  

I am not sure where to go from here - I have been in the game long
enough to be doubtful about multiple INDEPENDENT hardware failures -
which is why after "blowing off" the first dead SATA drive, I took
another look when I had a second one fail in the same manner within a
few days.

Both of the previously failed drives are up and running at this time.
Understandably I am not ready to put "real data" back on them but I will
button up the case, bring the system back up, and come up withsome way
to flow "test data" between these two devices looking for another
failure.

Once again thanks for your time.

Chris

On Sat, 2007-10-27 at 13:47 +0100, Alan Cox wrote:
> On Fri, 26 Oct 2007 21:21:52 -0500
> Chris Holvenstot <[EMAIL PROTECTED]> wrote:
> 
> > My SATA controller is integrated on my MSI motherboard and sports four
> > ports.  It is implemented using the Nvidia CK804 chipset.  My processor
> > is an AMD64 X2 4600+ running the 32 bit version of Linux.  
> > 
> > I have had these drives up and running for about six months.=
> 
> You don't provide enough information to even guess. The whole of the
> relevant part of the ata messages would be a lot more useful than the
> cutting/pasting of bits you've provided.

> The BIOS finding the drive is indicating it responds to identify and the
> basic commands (and usually means your cabling is fine), but really the
> rest of the trace is needed to see what occurs next. If the drive has
> gone then the partition table read will fail and you won't
> get /dev/sdb[something] for it.
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major SATA / EXT3 Issue?

2007-10-31 Thread Heikki Orsila

On Wed, Oct 31, 2007 at 03:25:54AM -0400, Theodore Tso wrote:
> On Tue, Oct 30, 2007 at 03:59:24PM +0200, Heikki Orsila wrote:
> > 3. fsck -p on boot failed
> > 
> > (it is very probable not many files were corrupted at this stage)
> 
> Maybe...

The system wouldn't have worked as it did, if there were so many broken 
files before fsck. For example, the system booted properly before 
fsck. After fsck the grub was broken. So, most files that were 
corrupted, were unrelated to writes that happened that night. For 
example, many files at .deb cache at /var were corrupted, but it has 
been a long time since those were touched.

> > 4. I ran fsck.ext3 -y
> > 
> > => that corrupted lots and lots of files. This went 
> > into a loop, the fsck.ext3 restarted checking over and over again.
> 
> It's possible that e2fsck corrupted the files, but they also could
> have been corrupted earlier by the earlier I/O errors.

At least there were no IO errors in log files before that night. The 
only IO error that happened put filesystem into RO state.

> There are some
> relatively rare filesystem corruptions which e2fsck doesn't handle as
> gracefully as it should, though, I will admit.  I would need to see
> the e2fsck transcript to be sure. 

Unfortunately, I do not have it.

> Were there any messages about needing to relocate inode tables by any
> chance?

I don't recall. I recall seeing messages about "too many blocks 
inside inode" and "clearing inodes". There were hundreds or 
thousands of these messages.

-- 
Heikki Orsila   Barbie's law:
[EMAIL PROTECTED]   "Math is hard, let's go shopping!"
http://www.iki.fi/shd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major SATA / EXT3 Issue?

2007-10-30 Thread Theodore Tso

On Tue, Oct 30, 2007 at 03:59:24PM +0200, Heikki Orsila wrote:
> 3. fsck -p on boot failed
> 
> (it is very probable not many files were corrupted at this stage)

Maybe...
> 
> 4. I ran fsck.ext3 -y
> 
> => that corrupted lots and lots of files. This went 
> into a loop, the fsck.ext3 restarted checking over and over again.

It's possible that e2fsck corrupted the files, but they also could
have been corrupted earlier by the earlier I/O errors.  There are some
relatively rare filesystem corruptions which e2fsck doesn't handle as
gracefully as it should, though, I will admit.  I would need to see
the e2fsck transcript to be sure. 

Were there any messages about needing to relocate inode tables by any
chance?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major SATA / EXT3 Issue?

2007-10-30 Thread Heikki Orsila

On Fri, Oct 26, 2007 at 09:21:52PM -0500, Chris Holvenstot wrote:
> In each case the failure mode appears to have been the same ??? the system
> appears to lock up. When rebooted I get a long string of messages like:
> 
> 
> Oct 26 20:07:37 localhost kernel: [ 101.581091] ata2: timeout waiting
> for ADMA IDLE, stat=0x440
> 
> Oct 26 20:07:37 localhost kernel: [ 101.581096] sd 1:0:0:0: [sda] Write
> Protect is off
> 
> Oct 26 20:07:37 localhost kernel: [ 101.581174] res
> 71/04:08:00:00:00/04:00:1d:00:00/e0 Emask 0x1 (device error)
> 
> Oct 26 20:07:37 localhost kernel: [ 101.644992] ata2.00: configured for
> UDMA/33
> 
> Oct 26 20:07:37 localhost kernel: [ 101.644994] ata2: EH complete
> 
> Oct 26 20:07:37 localhost kernel: [ 101.645006] sd 1:0:0:0: [sda] Write
> cache: disabled, read cache: enabled, doesn't support DPO or FUA

As it turns out, our organizations version control server just blew up 
last night. The filesystem is rather corrupt at the moment, fortunately 
we have backups.

We are running 2.6.22.5 (vanilla) on Debian 4.0. It has two SATA devices 
with Linux software raid mirror (mdadm) on Intel ata_piix chipset.
The filesystem is EXT3.

1. This happened during nightly backup:

init_special_inode: bogus i_mode (17)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:68:6f:3a:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 53248 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting port
ata1.00: revalidation failed (errno=-2)
ata1: failed to recover some devices, retrying in 5 secs
ata1: soft resetting port
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
init_special_inode: bogus i_mode (17)
init_special_inode: bogus i_mode (17)
init_special_inode: bogus i_mode (17)
journal_bmap: journal block not found at offset 5133 on md2
Aborting journal on device md2.
ext3_abort called.
EXT3-fs error (device md2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
__journal_remove_journal_head: freeing b_committed_data

2. I rebooted

3. fsck -p on boot failed

(it is very probable not many files were corrupted at this stage)

4. I ran fsck.ext3 -y

=> that corrupted lots and lots of files. This went 
into a loop, the fsck.ext3 restarted checking over and over again.

Heikki Orsila
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major SATA / EXT3 Issue?

2007-10-28 Thread Robert Hancock


Chris Holvenstot wrote:

I am curious if anyone else has had major problems with SATA drives on
the current series of kernels.  I have (or rather had) two SATA drives
on my system - the first was a Maxtor MaxLine 500 and the second was a
Maxtor MaxLine 250.

Both of these drives were plugged to the 1.5 Gigabyte / second mode.

My SATA controller is integrated on my MSI motherboard and sports four
ports.  It is implemented using the Nvidia CK804 chipset.  My processor
is an AMD64 X2 4600+ running the 32 bit version of Linux.  


I have had these drives up and running for about six months.

The first drive "failed" about 10 days ago - and unfortunately I focused
on hardware error and after several attempts to get the drive back
online I physically pulled it from the system.  This drive was used for
backups and thus was not critical to day-to-day operations.  


However, tonight I "lost" a second SATA drive, this one I use on a daily
basis for my kernel build and test processes.  It failed in the same
manner as the first, which makes me a little suspicious.


The first drive “failed” while I was running a modified Ubuntu 7.04
system. Because I focused on hardware as the reason for the failure I
did not collect specific information about the version of the kernel
being used, but it was likely to be 2.6.24-git8.


The second drive “failed” tonight on what is, except for the kernel, a
fairly standard Ubuntu 7.10 system (the same hardware - I upgraded my OS
this past week) – the kernel in use tonight at the time of the second
failure was 2.6.24-rc1-git1


In each case the failure mode appears to have been the same – the system
appears to lock up. When rebooted I get a long string of messages like:


Oct 26 20:07:37 localhost kernel: [ 101.581091] ata2: timeout waiting
for ADMA IDLE, stat=0x440

Oct 26 20:07:37 localhost kernel: [ 101.581096] sd 1:0:0:0: [sda] Write
Protect is off

Oct 26 20:07:37 localhost kernel: [ 101.581174] res
71/04:08:00:00:00/04:00:1d:00:00/e0 Emask 0x1 (device error)

Oct 26 20:07:37 localhost kernel: [ 101.644992] ata2.00: configured for
UDMA/33

Oct 26 20:07:37 localhost kernel: [ 101.644994] ata2: EH complete

Oct 26 20:07:37 localhost kernel: [ 101.645006] sd 1:0:0:0: [sda] Write
cache: disabled, read cache: enabled, doesn't support DPO or FUA


You should try and get some output from dmesg and not from the messages 
log, as the log daemon seems to have a nasty habit of discarding 
critical output from these errors. In this case the failing command is 
missing and the message ordering even seems off.





The hardware appears to be correctly identified by the BIOS during the
power up sequence. 



Not much is seen in the dmesg log excpet for:


[ 43.649673] scsi0 : sata_nv

[ 43.649722] scsi1 : sata_nv

[ 43.649776] ata1: SATA max UDMA/133 cmd 0x9f0 ctl 0xbf0 bmdma 0xcc00
irq 19

[ 43.649778] ata2: SATA max UDMA/133 cmd 0x970 ctl 0xb70 bmdma 0xcc08
irq 19


There should be more than this at the very least.. As above, please try 
to get output from dmesg itself.





When I try to run a file system check on these devices I get:




e2fsck 1.40.2 (12-Jul-2007)

fsck.ext2: No such file or directory while trying to open /dev/sdb1

The superblock could not be read or does not describe a correct ext2

filesystem. If the device is valid and it really contains an ext2

filesystem (and not swap or ufs or something else), then the superblock

is corrupt, and you might try running e2fsck with an alternate
superblock:

e2fsck -b 8193 


I have a gut feeling that when the system appears to lock up what is
really going on is that the contents of the drive are being trashed. But
I have no proof of that.


I don't think that is the case, more like the drives have not been 
detected at all. If this happens after a reboot when they were working 
before, that sounds like some kind of a hardware issue most likely..





When I try to do a parted to see what the system thinks is on the drive
I get the error message:


Error: Error opening /dev/sdb: No medium found 



I am not having any problems with my EXT3 file systems located on
“standard” IDE / PATA drives.


My config file, which has not changed in months beyond taking the
defaults during make oldconfig looks like:


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major SATA / EXT3 Issue?

2007-10-27 Thread Clemens Koller


Chris Holvenstot schrieb:

I am curious if anyone else has had major problems with SATA drives on
the current series of kernels.  I have (or rather had) two SATA drives
on my system - the first was a Maxtor MaxLine 500 and the second was a
Maxtor MaxLine 250.
[...]


If harddisks get too hot they die pretty quickly.
The current harddisks usually support SMART. Ask the disks what
they think it's wrong:
http://smartmontools.sourceforge.net/

Then, check the disks with different hardware... I have seen
several issues with bad cables and broken hardware.

If you think that the problem is kernel related, you can boot
some other distro or change the kernel to see if this problem
still persists. Please post the dmesg output if you are sure
the problem is kernel related...

Regards,
--
Clemens Koller
___
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Str. 45/1
81379 Muenchen
Germany

http://www.anagramm-technology.com
Phone: +49-89-741518-50
Fax: +49-89-741518-19
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major SATA / EXT3 Issue?

2007-10-27 Thread Alan Cox

On Fri, 26 Oct 2007 21:21:52 -0500
Chris Holvenstot <[EMAIL PROTECTED]> wrote:

> My SATA controller is integrated on my MSI motherboard and sports four
> ports.  It is implemented using the Nvidia CK804 chipset.  My processor
> is an AMD64 X2 4600+ running the 32 bit version of Linux.  
> 
> I have had these drives up and running for about six months.=

You don't provide enough information to even guess. The whole of the
relevant part of the ata messages would be a lot more useful than the
cutting/pasting of bits you've provided.

The BIOS finding the drive is indicating it responds to identify and the
basic commands (and usually means your cabling is fine), but really the
rest of the trace is needed to see what occurs next. If the drive has
gone then the partition table read will fail and you won't
get /dev/sdb[something] for it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major SATA / EXT3 Issue?

Re: Major SATA / EXT3 Issue?

Re: Major SATA / EXT3 Issue?

Re: Major SATA / EXT3 Issue?

Re: Major SATA / EXT3 Issue?

Re: Major SATA / EXT3 Issue?

Re: Major SATA / EXT3 Issue?

Re: Major SATA / EXT3 Issue?

8 matches

Site Navigation

Mail list logo

Footer information