Re: fsck UNEXPECTED INCONSISTENCY

2010-03-02 Thread STeve Andre'
On Tuesday 02 March 2010 13:35:23 J.C. Roberts wrote:
> On Tue, 02 Mar 2010 11:06:46 -0500 and...@msu.edu wrote:
> > Quoting "J.C. Roberts" :
> > > And I thought I was expected to be inconsistent. ;)
> > >
> > > Anyhow, I was upgrading from the Feb 2, to the most recent
> > > snapshot, and fsck is coming up with a problem on one of my
> > > partitions. I can probably get it working ("fix" is such a strong
> > > word) with `fsck -fy` but my real concern is if the drive is
> > > failing?
> > >
> > > atactl tells me everything is just fine?
> > >
> > > I have a nearly identical system, with the same type of disk, which
> > > reports similar atactl attributes... but then again, I don't really
> > > trust SATA/PATA drives very much or their supposedly "smart"
> > > monitoring.
> > >
> > > The data on the system is not only backed up, but it's also easily
> > > replaced since the machine is only used for src and ports builds. I
> > > think I might lose a total of a few newly downloaded distfiles
> > > since the last backup.
> > >
> > > What I really want to do here is understand *why* some portion of
> > > the disk has become unreadable?
> >
> > I've seen the smart system report errors and have had them become
> > true a few times, but far more often I've seen the damn things report
> > "No proble, Boss" and then died a little later...
> >
> > You could have any number of things going on giving you those read
> > errors.  With the right test jig from the manufacturer you'd likely
> > know. My guess would be an op amp for one of the heads isn't quite
> > working right.  The other possibility is that part of the disk wasn't
> > coated right and you have a weak spot, magnetically speaking.
> >
> > But you'll never really know.  At least you now have a new target, for
> > practice with.  Sabot rounds are great for little disks...
>
> Where's Nick and his nail gun when we need him?
>
> I got to work with some people from the "disk industry" and know how
> secretive they must be about how stuff actually works due to NDA's. I'd
> have better odds as a snowball in hell than getting the needed test
> equipment and docs from the vendor.

Very likely true, which I why I say that you have an interesting target!

--STeve Andre'



Re: fsck UNEXPECTED INCONSISTENCY

2010-03-02 Thread J.C. Roberts
On Tue, 02 Mar 2010 11:06:46 -0500 and...@msu.edu wrote:

> Quoting "J.C. Roberts" :
> 
> > And I thought I was expected to be inconsistent. ;)
> >
> > Anyhow, I was upgrading from the Feb 2, to the most recent
> > snapshot, and fsck is coming up with a problem on one of my
> > partitions. I can probably get it working ("fix" is such a strong
> > word) with `fsck -fy` but my real concern is if the drive is
> > failing?
> >
> > atactl tells me everything is just fine?
> >
> > I have a nearly identical system, with the same type of disk, which
> > reports similar atactl attributes... but then again, I don't really
> > trust SATA/PATA drives very much or their supposedly "smart"
> > monitoring.
> >
> > The data on the system is not only backed up, but it's also easily
> > replaced since the machine is only used for src and ports builds. I
> > think I might lose a total of a few newly downloaded distfiles
> > since the last backup.
> >
> > What I really want to do here is understand *why* some portion of
> > the disk has become unreadable?
> 
> I've seen the smart system report errors and have had them become
> true a few times, but far more often I've seen the damn things report
> "No proble, Boss" and then died a little later...
> 
> You could have any number of things going on giving you those read
> errors.  With the right test jig from the manufacturer you'd likely
> know. My guess would be an op amp for one of the heads isn't quite
> working right.  The other possibility is that part of the disk wasn't
> coated right and you have a weak spot, magnetically speaking.
> 
> But you'll never really know.  At least you now have a new target, for
> practice with.  Sabot rounds are great for little disks...
> 

Where's Nick and his nail gun when we need him?

I got to work with some people from the "disk industry" and know how
secretive they must be about how stuff actually works due to NDA's. I'd
have better odds as a snowball in hell than getting the needed test
equipment and docs from the vendor.
-- 



Re: fsck UNEXPECTED INCONSISTENCY

2010-03-02 Thread Philip Guenther
On Tue, Mar 2, 2010 at 8:06 AM,   wrote:
...
> I've seen the smart system report errors and have had them become
> true a few times, but far more often I've seen the damn things report
> "No proble, Boss" and then died a little later...

I seem to recall a USENIX paper from google (perhaps for the FAST
conference?) in which they analyzed the failure statistics for their
server farms against things like the SMART stats and a bunch of other
stats they collected.  IIRC, they found SMART error reports to be
useful in predicting failure and that while there were some
correlations in other stats, the false positive rates would make using
them uneconomical.  Or something like that.  If you're concerned about
this disk failure then you should hunt up and read the actual
paper...and continue your research beyond that.


Philip Guenther



Re: fsck UNEXPECTED INCONSISTENCY

2010-03-02 Thread J.C. Roberts
On Tue, 02 Mar 2010 11:27:50 -0500 "Brad Tilley" 
wrote:

> > What I really want to do here is understand *why* some portion of
> > the disk has become unreadable?
> 
> 
> cd /bad_partition && dd if=/dev/zero of=big_file.zero bs=512
> conv=sync,noerror
> 
> Let it run until it finishes. That won't explain why the sectors are
> bad, but it may give a good indication of the problem area and answer
> the failing drive question. If dd reports IO issues, you may want to
> replace the drive.
> 
> Brad

Thanks Brad. If it was an unnecessary partition, I'd do a destructive
overwrite to see what it does. Unfortunately, it's /usr/.

I'm going to toss a new disk in the box, and do a fresh install on the
new disk, so I can reliably play with the old one.

-- 



Re: fsck UNEXPECTED INCONSISTENCY

2010-03-02 Thread andres
Quoting "J.C. Roberts" :

> And I thought I was expected to be inconsistent. ;)
>
> Anyhow, I was upgrading from the Feb 2, to the most recent snapshot, and
> fsck is coming up with a problem on one of my partitions. I can probably
> get it working ("fix" is such a strong word) with `fsck -fy` but my real
> concern is if the drive is failing?
>
> atactl tells me everything is just fine?
>
> I have a nearly identical system, with the same type of disk, which
> reports similar atactl attributes... but then again, I don't really trust
> SATA/PATA drives very much or their supposedly "smart" monitoring.
>
> The data on the system is not only backed up, but it's also easily
> replaced since the machine is only used for src and ports builds. I think
> I might lose a total of a few newly downloaded distfiles since the last
> backup.
>
> What I really want to do here is understand *why* some portion of the
> disk has become unreadable?

I've seen the smart system report errors and have had them become
true a few times, but far more often I've seen the damn things report
"No proble, Boss" and then died a little later...

You could have any number of things going on giving you those read
errors.  With the right test jig from the manufacturer you'd likely know.
My guess would be an op amp for one of the heads isn't quite working
right.  The other possibility is that part of the disk wasn't coated right
and you have a weak spot, magnetically speaking.

But you'll never really know.  At least you now have a new target, for
practice with.  Sabot rounds are great for little disks...

--STeve Andre'



Re: fsck UNEXPECTED INCONSISTENCY

2010-03-02 Thread Brad Tilley
On Tue, 02 Mar 2010 07:50 -0800, "J.C. Roberts"
 wrote:
> And I thought I was expected to be inconsistent. ;)
> 
> Anyhow, I was upgrading from the Feb 2, to the most recent snapshot, and
> fsck is coming up with a problem on one of my partitions. I can probably
> get it working ("fix" is such a strong word) with `fsck -fy` but my real
> concern is if the drive is failing?
> 
> atactl tells me everything is just fine?
> 
> I have a nearly identical system, with the same type of disk, which
> reports similar atactl attributes... but then again, I don't really trust
> SATA/PATA drives very much or their supposedly "smart" monitoring.
> 
> The data on the system is not only backed up, but it's also easily
> replaced since the machine is only used for src and ports builds. I think
> I might lose a total of a few newly downloaded distfiles since the last
> backup.
> 
> What I really want to do here is understand *why* some portion of the
> disk has become unreadable?


cd /bad_partition && dd if=/dev/zero of=big_file.zero bs=512
conv=sync,noerror

Let it run until it finishes. That won't explain why the sectors are
bad, but it may give a good indication of the problem area and answer
the failing drive question. If dd reports IO issues, you may want to
replace the drive.

Brad

 
> All of the below were done in single user mode over serial.
> (sorry about the width)
> 
> 
> # atactl wd0 smartenable
> # atactl wd0 readattr
> Attributes table revision: 16
> ID   Attribute name  Threshold  Value  Raw
>   3  Spin Up Time  63   1800x46f2
>   4  Start/Stop Count   0   2530x00d2
>   5  Reallocated Sector Count  63   2530x0007
>   6  Read Channel Margin  100   2530x
>   7  Seek Error Rate0   2530x
>   8  Seek Time Performance187   2530x9edb
>   9  Power-On Hours Count   0   2350xee5c
>  10  Spin Retry Count 157   2530x
>  11  Calibration Retry Count  223   2530x
>  12  Device Power Cycle Count   0   2530x00f0
> 192  Power-Off Retract Count0   2530x
> 193  Load Cycle Count   0   2530x
> 194  Temperature0   2530x000f
> 195  Hardware ECC Recovered 0   2530x170d
> 196  Reallocation Event Count   0   2530x
> 197  Current Pending Sector Count   0   2530x0001
> 198  Off-Line Scan Uncorrectable Sect   0   2530x
> 199  Ultra DMA CRC Error Count  0   1990x
> 200  Write Error Rate   0   2530x
> 201  Soft Read Error Rate   0   2530x
> 202  Data Address Mark Errors   0   2530x
> 203  Run Out Cancel   180   2530x0001
> 204  Soft ECC Correction0   2530x
> 205  Thermal Asperity Check 0   2530x
> 207  Spin High Current  0   2530x
> 208  Spin Buzz  0   2530x
> 209  Offline Seek Performance   0   2530x
>  99  Unknown0   2530x
> 100  Unknown0   2530x
> 101  Unknown0   2530x
> #
> 
> 
> # atactl wd0 smartstatus
> No SMART threshold exceeded
> # 
> 
> 
> # atactl wd0 identify
> Model:6Y250L6, Rev: YAR41BW0, Serial #: 
> Device type: ATA, fixed
> Cylinders: 16383, heads: 16, sec/track: 63, total sectors: 490234752
> Device capabilities:
> ATA standby timer values
> IORDY operation
> IORDY disabling
> Device supports the following standards:
> ATA-1 ATA-2 ATA-3 ATA-4 ATA-5 ATA-6 ATA-7 
> Device supports the following command sets:
> NOP command
> READ BUFFER command
> WRITE BUFFER command
> Host Protected Area feature set
> Read look-ahead
> Write cache
> Power Management feature set
> SMART feature set
> Flush Cache Ext command
> Flush Cache command
> Device Configuration Overlay feature set
> 48bit address feature set
> Automatic Acoustic Management feature set
> Set Max security extension commands
> Advanced Power Management feature set
> DOWNLOAD MICROCODE command
> SMART self-t