Folks,

I'm at a loss for my next plan of attack.  I've got a mythtv backend machine
that's been giving me fits for a week or two.  Spontaneous reboots and then
consistent "hangings".  RAM checked out okay, repositioned all the PCI
cards,etc.  

I was in recovery mode and first saw this message (spammed):

Feb 12 08:36:21 backend kernel: [43638.975058] ata4.00: status: { DRDY ERR }
Feb 12 08:36:21 backend kernel: [43638.975061] ata4.00: error: { UNC }
Feb 12 08:36:21 backend kernel: [43639.082514] ata4.00: configured for
UDMA/133
Feb 12 08:36:21 backend kernel: [43639.082533] ata4: EH complete
Feb 12 08:36:24 backend kernel: [43641.901713] ata4.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x0
Feb 12 08:36:24 backend kernel: [43641.901722] ata4.00: BMDMA stat 0x4
Feb 12 08:36:24 backend kernel: [43641.901730] ata4.00: cmd
c8/00:08:30:62:8c/00:00:00:00:00/e5 tag 0 dma 4096 in
Feb 12 08:36:24 backend kernel: [43641.901732]          res
51/40:00:34:62:8c/00:00:00:00:00/05 Emask 0x9 (media error)

So I removed the drive from /etc/fstab and the machine stabilized very
nicely.  No problems yet, but I'm keeping my fingers crossed.

The drive is a Seagate (if that doesn't throw up red flags) 7200.11 500 GB
that is about 12-18 months old (no firmware issues though, good firmware or
so I was told).

I checked the health of the drive with smartctl -H and it passed.  So then I
did some reading and decided to do some self-tests, here's the log:

r...@backend:~# smartctl -l selftest /dev/sdc
...
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)
LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     17123
93086260
# 2  Extended offline    Completed: read failure       90%     17120
93086260
# 3  Short offline       Completed: read failure       90%     17120
93086260

Not a good sign,  10% into the check and it fails.  But now I'm stuck.  From
my reading of the situation, the drive could just have corrupted data, and a
low-level format could remap the sectors as "good".  I don't necessarily
need to get all my episodes of Fox's "24" off this drive, so I could wipe it
and not lose sleep (That's option B though).

The drive is formatted xfs, but xfs_check reports no problems.

So what is it, do I have bad data, or a bad drive?  Do I need more
information before I can diagnose it?  I plan on burning a SeaTools CD this
afternoon and seeing if that can diagnose the hardware of the drive.

>From what I can tell this could just be a side-effect of the spontaneous
reboots, but the system is very stable right now with the drive unmounted,
so I really think it was the cause of the reboots, and not an effect.

Brian

--------------------
BYU Unix Users Group 
http://uug.byu.edu/ 

The opinions expressed in this message are the responsibility of their
author.  They are not endorsed by BYU, the BYU CS Department or BYU-UUG. 
___________________________________________________________________
List Info (unsubscribe here): http://uug.byu.edu/mailman/listinfo/uug-list

Reply via email to