Weird disk problem

2014-06-05 Thread Christian Weisgerber
I have a 3TB disk here...

sd1 at scsibus1 targ 1 lun 0:  SCSI3 0/direct 
fixed naa.5000cca225c5fbeb
sd1: 2861588MB, 512 bytes/sector, 5860533168 sectors

... that's serving as a general media dump with a single FFS2 file
system on it.

Filesystem SizeUsed   Avail Capacity  Mounted on
/dev/sd1d  2.7T2.5T   63.7G98%/export

Yesterday, I experienced the odd effect that reading some files,
or parts of files, from that disk became excruciatingly slow.  We're
talking a few kB/s here.  Other files were fine.  There were no
kernel errors/warnings whatsoever.  There were no read errors, the
disk was just 100% busy and appeared to be returning data drip by
drip.

# atactl sd1 smartstatus
No SMART threshold exceeded

No change on reboot.  dd(1) from the raw device was initially fast,
then slowed to a crawl as it progressed.  I eventually "fixed" it
all by powering off the machine, jiggling the SATA connectors (all
fine), and powering the machine back up.

Tonight the problem is back.  Something is very wrong.  Given that
dd if=/dev/rsd1c also seems affected, the filesystem layer can be
excluded.  I won't cry too much over a dying disk, but why the heck
are there no error indications of any kind?

Any other ideas?

-- 
Christian "naddy" Weisgerber  na...@mips.inka.de



Re: Weird disk problem

2014-06-05 Thread David Vasek

On Thu, 5 Jun 2014, Christian Weisgerber wrote:


I have a 3TB disk here...

sd1 at scsibus1 targ 1 lun 0:  SCSI3 0/direct 
fixed naa.5000cca225c5fbeb
sd1: 2861588MB, 512 bytes/sector, 5860533168 sectors

... that's serving as a general media dump with a single FFS2 file
system on it.

Filesystem SizeUsed   Avail Capacity  Mounted on
/dev/sd1d  2.7T2.5T   63.7G98%/export

Yesterday, I experienced the odd effect that reading some files,
or parts of files, from that disk became excruciatingly slow.  We're
talking a few kB/s here.  Other files were fine.  There were no
kernel errors/warnings whatsoever.  There were no read errors, the
disk was just 100% busy and appeared to be returning data drip by
drip.

# atactl sd1 smartstatus
No SMART threshold exceeded

No change on reboot.  dd(1) from the raw device was initially fast,
then slowed to a crawl as it progressed.  I eventually "fixed" it
all by powering off the machine, jiggling the SATA connectors (all
fine), and powering the machine back up.

Tonight the problem is back.  Something is very wrong.  Given that
dd if=/dev/rsd1c also seems affected, the filesystem layer can be
excluded.  I won't cry too much over a dying disk, but why the heck
are there no error indications of any kind?

Any other ideas?


Did you try smartctl from smartmontools for a more detailed report?

My favourite are:

smartctl -a /dev/sd1c
smartctl -l scttemp /dev/sd1c

smartctl -t short /dev/sd1c
smartctl -t long /dev/sd1c (will take several hours!!!)

smartctl -a /dev/sd1c (again after each of the tests)


Regards,
David



Re: Weird disk problem

2014-06-05 Thread STeve Andre'

On 06/05/14 17:38, Christian Weisgerber wrote:

I have a 3TB disk here...

sd1 at scsibus1 targ 1 lun 0:  SCSI3 0/direct 
fixed naa.5000cca225c5fbeb
sd1: 2861588MB, 512 bytes/sector, 5860533168 sectors

... that's serving as a general media dump with a single FFS2 file
system on it.

Filesystem SizeUsed   Avail Capacity  Mounted on
/dev/sd1d  2.7T2.5T   63.7G98%/export

Yesterday, I experienced the odd effect that reading some files,
or parts of files, from that disk became excruciatingly slow.  We're
talking a few kB/s here.  Other files were fine.  There were no
kernel errors/warnings whatsoever.  There were no read errors, the
disk was just 100% busy and appeared to be returning data drip by
drip.

# atactl sd1 smartstatus
No SMART threshold exceeded

No change on reboot.  dd(1) from the raw device was initially fast,
then slowed to a crawl as it progressed.  I eventually "fixed" it
all by powering off the machine, jiggling the SATA connectors (all
fine), and powering the machine back up.

Tonight the problem is back.  Something is very wrong.  Given that
dd if=/dev/rsd1c also seems affected, the filesystem layer can be
excluded.  I won't cry too much over a dying disk, but why the heck
are there no error indications of any kind?

Any other ideas?



I think you are relying on the smart system too much.  Certainly try
what David said, but it's obvious that the disk is sick despite what the
smart system may say.

I've had about seven disk failures in the last several years.  Three or
four of them the smart system was absolutely correct, with the others
being less informative.  I've also had a false notice that a disk was bad,
but worked for several years, till it got too small for its task.

Smart is good, but it has its limitations.  It best deals with gradual
errors, not fast catastrophic ones.

--STeve Andre'



Re: Weird disk problem

2014-06-05 Thread Shawn K. Quinn
On Thu, Jun 5, 2014, at 05:24 PM, STeve Andre' wrote:
> On 06/05/14 17:38, Christian Weisgerber wrote:
> > I have a 3TB disk here...
> >
> > sd1 at scsibus1 targ 1 lun 0:  SCSI3 0/direct 
> > fixed naa.5000cca225c5fbeb
> > sd1: 2861588MB, 512 bytes/sector, 5860533168 sectors
> >
> > ... that's serving as a general media dump with a single FFS2 file
> > system on it.
> >
> > Filesystem SizeUsed   Avail Capacity  Mounted on
> > /dev/sd1d  2.7T2.5T   63.7G98%/export
> >
> > Yesterday, I experienced the odd effect that reading some files,
> > or parts of files, from that disk became excruciatingly slow.  We're
> > talking a few kB/s here.  Other files were fine.  There were no
> > kernel errors/warnings whatsoever.  There were no read errors, the
> > disk was just 100% busy and appeared to be returning data drip by
> > drip.
> >
> > # atactl sd1 smartstatus
> > No SMART threshold exceeded
> >
> > No change on reboot.  dd(1) from the raw device was initially fast,
> > then slowed to a crawl as it progressed.  I eventually "fixed" it
> > all by powering off the machine, jiggling the SATA connectors (all
> > fine), and powering the machine back up.
> >
> > Tonight the problem is back.  Something is very wrong.  Given that
> > dd if=/dev/rsd1c also seems affected, the filesystem layer can be
> > excluded.  I won't cry too much over a dying disk, but why the heck
> > are there no error indications of any kind?
> >
> > Any other ideas?

Anything in dmesg/kernel log about operations timing out?
 
> I think you are relying on the smart system too much.  Certainly try
> what David said, but it's obvious that the disk is sick despite what the
> smart system may say.
> 
> I've had about seven disk failures in the last several years.  Three or
> four of them the smart system was absolutely correct, with the others
> being less informative.  I've also had a false notice that a disk was
> bad,
> but worked for several years, till it got too small for its task.
> 
> Smart is good, but it has its limitations.  It best deals with gradual
> errors, not fast catastrophic ones.

Running smartmontools should give you enough information to determine if
you have a sick disk, though it may require looking at the values and
seeing if you have a rise in e.g. the number of sectors remapped; I
would not trust "atactl sd# smartstatus" by itself. Failing that, there
are more time-honored empirical tests, such as assuming the worst for
the disk's health if it is making weird noises when it slows to a crawl.

It could also be either the SATA cabling or the SATA controller that is
having trouble after warming up (with specific bit patterns, or just in
general). I know that sounds weird, but SATA cables aren't that
expensive to replace and it's quite possible the OP got a dud.

-- 
  Shawn K. Quinn
  skqu...@rushpost.com



Re: Weird disk problem

2014-06-08 Thread Christian Weisgerber
On 2014-06-05, David Vasek  wrote:

> Did you try smartctl from smartmontools for a more detailed report?

I assume there is a 1000-page SMART spec somewhere that would come
in handy for interpreting the responses?

> My favourite are:
>
> smartctl -a /dev/sd1c
> smartctl -l scttemp /dev/sd1c

Temperature is fine, never exceeded the limits.

> smartctl -t short /dev/sd1c

Not supported, it seems.

-- 
Christian "naddy" Weisgerber  na...@mips.inka.de



Re: Weird disk problem

2014-06-08 Thread Christian Weisgerber
On 2014-06-05, "STeve Andre'"  wrote:

> I think you are relying on the smart system too much.

Not at all, but I knew people would immediately direct me to it.

> Certainly try what David said, but it's obvious that the disk is
> sick despite what the smart system may say.

I got a replacement disk and I'm now trying to get the data off the
old one.  (Nothing really important.)  That is proceeding fitfully.
There are spurts of 65 MB/s and then there are stretches of XXX
kB/s, XX kB/s, down to 5 kB/s.  At the current average rate it will
be going for five or six days, assuming the disk survives that long.

Whatever's wrong with it, it's a tenacious little bugger.  There
still hasn't been a single hard read error.  Anyway, I guess we can
close the topic.

-- 
Christian "naddy" Weisgerber  na...@mips.inka.de



Re: Weird disk problem

2014-06-10 Thread David Vasek

On Sun, 8 Jun 2014, Christian Weisgerber wrote:


On 2014-06-05, David Vasek  wrote:


Did you try smartctl from smartmontools for a more detailed report?


I assume there is a 1000-page SMART spec somewhere that would come
in handy for interpreting the responses?


I'm not an expert. But I believe there are some reading this mailing list.

There is a description of the interface available, but I don't think it 
can help you to interpret the numbers.


ftp://ftp.t10.org/t13/docs2004/D1699-ATA8-ACS.pdf
http://www.hgst.com/tech/techlib.nsf/techdocs/EF593BD721D5D2768825782D000B8111/$file/DS7K3000_US7K3000_SATA_OEMSpecRev1.3.pdf
(beware of the $ character in the url)

What I usually care about are attributes like Reallocated_Sector_Ct, 
Reallocated_Event_Count, Current_Pending_Sector, Offline_Uncorrectable, 
Spin_Retry_Count, UDMA_CRC_Error_Count. I monitor my drives in the long 
term and watch if any of these values rises. And of course, the SMART 
Error Log is important.


As for the other attributes such as Raw_Read_Error_Rate, 
Throughput_Performance and Seek_Error_Rate, every vendor seem to use it in 
a different way.


Btw, the model of Hitachi drive you have problems with is said to be one 
of the most reliable hard drives.


http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/
http://www.hgst.com/tech/techlib.nsf/techdocs/EC6D440C3F64DBCC8825782300026498/$file/US7K3000_ds.pdf.
http://www.hgst.com/tech/techlib.nsf/products/Ultrastar_7K3000


smartctl -t short /dev/sd1c


Not supported, it seems.


It is surprising, all Hitachi hard drives I have support short test. If it 
isn't a secret, could I get the 'smartctl -a' output from your drive for 
comparison? Thanks.


Regards,
David