Re: [PLUG] disk problem

Tim Mon, 23 Feb 2009 10:39:55 -0800

Denis,

> dmsg shows Ext-3 fs error (stb1); remounting file system read only


This is a pretty clear indication that either your drive is dying or you
did something very bad to your filesystem (many hard reboots,
accidential writes to the raw disk device, etc).  I'm thinking the most
likely is that your drive is dying, since EXT3 is pretty good about
recovering from hard reboots.

To help verify this, check through your dmesg/kernel logs for more
drive failure information.  For instance, the Linux kernel typically
prints out something similar to the following when there are signs of
hardware failures:

hdd: set_multmode: status=0x61 { DriveReady DeviceFault Error }
hdd: set_multmode: error=0x04 { DriveStatusError }
hdd: recal_intr: status=0x61 { DriveReady DeviceFault Error }
hdd: recal_intr: error=0x04 { DriveStatusError }
ide1: reset: success 

If you see these, then surely the kernel is having trouble with the
hardware.  The two main explanations are either that the drive is flat
out going bad and is returning CRC errors on sector reads, OR your
cabling is bad and there are CRC errors in messages between the drive
and the controller.  (I'm sure there are other possible explanations,
but I know how to check for these two fairly common issues easily.)

To get more information about such errors, i suggest you use smartctl 
(a part of the smartmontools package on Debian) on the drive.  For
instance, on my drive I get:

# smartctl -a /dev/sda
<... snip ...>
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail  Always       
-       7497130
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       
-       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       
-       427
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       
-       41336965
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       
-       6663
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       
-       419
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       
-       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0022   061   051   045    Old_age   Always       
-       39 (Lifetime Min/Max 21/39)
194 Temperature_Celsius     0x0022   039   049   000    Old_age   Always       
-       39 (0 14 0 0)
195 Hardware_ECC_Recovered  0x001a   079   057   000    Old_age   Always       
-       85764379
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       
-       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      
-       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       
-       0
<... /snip ...>

Each drive will output slightly different values.  I don't konw what
many of these mean, and on my drive, there's a lot of questionably high
values...  However, the two that are most important in my output are
Reallocated_Sector_Ct and UDMA_CRC_Error_Count.  Anything above 0 in the
former may be indicative of a drive being on it's last legs (according
to a Google research paper).  If you have a large number of UDMA CRC
errors, that may be indicative of a bad cable.  Other errors may be
useful if you do the research on them.

For smartctl and possibly for reading your kernel logs, you will need to
have root privileges.


> The drive shows one locked folder (lost+found), but it has a modified date of 
> 19June2008. I have been unable to look inside.
> 
> Is lost+found where the error/recovery resides?
> Is this a warning that the drive may give up the ghost real soon now?

The lost+found folder is where orphaned files are placed after a fsck.
Orphaned files are those which were not deleted normally, but were
unlinked from their parent directory... in other words, files still on
disk but were forgotten about.  You should look through these files to
see if anything of importance is there.  Commonly they'll just be
temporary files (perhaps generated by firefox and the like).  I
encourage you to use the "file" command on them to determine the file
content type, then use an appropriate tool to open the ones you can.
Note that you must have root privileges to look at files in this
directory.

Good luck,
tim
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] disk problem

Reply via email to