Re: impending disk failure?

2015-10-20 Thread Miles Fidelman

See below

Tony van der Hoff wrote:

On 17/10/15 17:47, Miles Fidelman wrote:

Dominique Dumont wrote:

On Saturday 17 October 2015 14:15:52 Tony van der Hoff wrote:

Can anyone please explain what it means, and whether I should be
worried?

You should check the drive with smartctl.

See http://www.smartmontools.org/

HTH


Yes.. and be sure to go beyond the basic tests.

First off, make sure it's running:
smartctl -s on -A /dev/disk0   ;for each drive, and using the
appropriate /dev/..

Then after, it's accumulated some stats:
smartctl -A /dev/disk0

For a lot of drives, the first line - raw read errors, can be very
telling - anything other than 0, and your disk is failing.
Start-up-time can be telling, if it's increasing.

The thing is, that most drives, except those designed for use in RAID
arrays, mask impending disk failures, by re-reading blocks multiple
times - they often get the data eventually, but your machine keeps
getting slower and slower.




Thanks Miles, and tomás, for your helpful replies.

I apologise for the delay in replying, but I've been away from my desk 
a few days.


I have however been doing some extensive googling, and it would appear 
that the raw read error count is something of a red herring, 
especially when applied to Seagate drives, as these are. Both my 
drives have quite high (in the millions) of RREC; numbers which are 
precisely matched by the Hardware ECC Recovered counts, suggesting 
that the RREC is merely an artifact od HHDs being essentially a 
mechanical device, being pushed to its limits using clever technology. 
The SMART extended tests reveal no problems.


The Wikipedia entry https://en.wikipedia.org/wiki/S.M.A.R.T. is 
particularly informative in the relative importance of these error 
counts; the RREC can be safely ignored, as somebody else here recently 
suggested.


You're missing the point.

As the Wikipedia also points out:
"Mechanical 
failures account for about 60% of all drive failures." and "Further, 36% 
of drives failed without recording any S.M.A.R.T. error at all, except 
the temperature, meaning that S.M.A.R.T. data alone was of limited 
usefulness in anticipating failures."


Today's disk drives are designed to PROTECT DATA, AND MAINTAIN ACCESS TO 
DATA, until the very moment before the drive fails catastrophically.  
The "Hardware ECC Recovered Count" indicates that:
- there are likely to be problems with the underlying media that the ECC 
is recovering from, that will only get worse over time
- the recovery takes time, hence the reason you system is slowing down - 
the more underlying errors, the more time it takes to recover


I've never found SMART extended tests to be indicative of anything, 
until a disk is nearly dead.  Though 
http://www.z-a-recovery.com/manual/smart.aspx gives a good list of other 
SMART variables that might indicate mechanical failures.


If your drives are a couple of years old, and your machine is getting 
slower, don't engage in wishful thinking - backup and get new drives.


Miles

--
In theory, there is no difference between theory and practice.
In practice, there is.    Yogi Berra



Re: impending disk failure?

2015-10-20 Thread Mirko Parthey
On Sat, Oct 17, 2015 at 02:15:52PM +0100, Tony van der Hoff wrote:
> Hi,
> 
> T'm occasionally getting this message in syslog on my jessie box:
> 
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600489] ata3.00: exception Emask 0x10
> SAct 0x10 SErr 0x40 action 0x6 frozen
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600501] ata3.00: irq_stat 0x0800,
> interface fatal error
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600509] ata3: SError: { Handshk }
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600517] ata3.00: failed command:
> WRITE FPDMA QUEUED
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600531] ata3.00: cmd
> 61/c0:20:90:6e:eb/01:00:22:00:00/40 tag 4 ncq 229376 out
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600531]  res
> 40/00:20:90:6e:eb/00:00:22:00:00/40 Emask 0x10 (ATA bus error)
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600538] ata3.00: status: { DRDY }
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600547] ata3: hard resetting link
> Oct 17 12:00:20 tony-lx kernel: [ 8839.092521] ata3: SATA link up 6.0 Gbps
> (SStatus 133 SControl 300)
> Oct 17 12:00:20 tony-lx kernel: [ 8839.099092] ata3.00: configured for
> UDMA/133
> Oct 17 12:00:20 tony-lx kernel: [ 8839.099121] ata3: EH complete
> 
> 
> Can anyone please explain what it means, and whether I should be worried?

To find out if this error is caused by the harddisk or by other components,
I'd suggest moving the disk to a different computer.

If the errors remain, your disk is failing.
Otherwise, swap components one by one.
A good next candidate would be the power supply unit.
Also, check the mainboard capacitors for a bulged top or leakage.

Regards,
Mirko



Re: impending disk failure?

2015-10-20 Thread Tony van der Hoff

On 20/10/15 11:22, Ondřej Grover wrote:

Does this error occur by any chance after resuming from suspended state?


No, this machine doesn't suspend.

But thanks, anyway.


--
Tony van der Hoff| mailto:t...@vanderhoff.org
Buckinghamshire, England |



Re: impending disk failure?

2015-10-20 Thread Ondřej Grover
Does this error occur by any chance after resuming from suspended state? I
had a similar problem because of some faulty drivers, setting
echo 0 > /sys/power/pm_async
makes sure that drivers do not resume asynchronously and it might fix the
problem.
Or can it be correlated to any other system events? It might help to attach
the syslog logs with more context before and after the errors.

Kind regards,
Ondřej Grover

On Tue, Oct 20, 2015 at 12:02 PM, Tony van der Hoff 
wrote:

> On 17/10/15 17:47, Miles Fidelman wrote:
>
>> Dominique Dumont wrote:
>>
>>> On Saturday 17 October 2015 14:15:52 Tony van der Hoff wrote:
>>>
 Can anyone please explain what it means, and whether I should be
 worried?

>>> You should check the drive with smartctl.
>>>
>>> See http://www.smartmontools.org/
>>>
>>> HTH
>>>
>>> Yes.. and be sure to go beyond the basic tests.
>>
>> First off, make sure it's running:
>> smartctl -s on -A /dev/disk0   ;for each drive, and using the
>> appropriate /dev/..
>>
>> Then after, it's accumulated some stats:
>> smartctl -A /dev/disk0
>>
>> For a lot of drives, the first line - raw read errors, can be very
>> telling - anything other than 0, and your disk is failing.
>> Start-up-time can be telling, if it's increasing.
>>
>> The thing is, that most drives, except those designed for use in RAID
>> arrays, mask impending disk failures, by re-reading blocks multiple
>> times - they often get the data eventually, but your machine keeps
>> getting slower and slower.
>>
>>
>
> Thanks Miles, and tomás, for your helpful replies.
>
> I apologise for the delay in replying, but I've been away from my desk a
> few days.
>
> I have however been doing some extensive googling, and it would appear
> that the raw read error count is something of a red herring, especially
> when applied to Seagate drives, as these are. Both my drives have quite
> high (in the millions) of RREC; numbers which are precisely matched by the
> Hardware ECC Recovered counts, suggesting that the RREC is merely an
> artifact od HHDs being essentially a mechanical device, being pushed to its
> limits using clever technology. The SMART extended tests reveal no problems.
>
> The Wikipedia entry https://en.wikipedia.org/wiki/S.M.A.R.T. is
> particularly informative in the relative importance of these error counts;
> the RREC can be safely ignored, as somebody else here recently suggested.
>
> So, back to the original problem; I think tomás hit the nail on the head.
> I've re-plugged the SATA cables, to no great effect; I have now ordered a
> couple of new cables, and will see whether that helps.
>
> Thanks again to  all.
>
>
>
>
> --
> Tony van der Hoff| mailto:t...@vanderhoff.org
> Buckinghamshire, England |
>
>


Re: impending disk failure?

2015-10-20 Thread Tony van der Hoff

On 17/10/15 17:47, Miles Fidelman wrote:

Dominique Dumont wrote:

On Saturday 17 October 2015 14:15:52 Tony van der Hoff wrote:

Can anyone please explain what it means, and whether I should be
worried?

You should check the drive with smartctl.

See http://www.smartmontools.org/

HTH


Yes.. and be sure to go beyond the basic tests.

First off, make sure it's running:
smartctl -s on -A /dev/disk0   ;for each drive, and using the
appropriate /dev/..

Then after, it's accumulated some stats:
smartctl -A /dev/disk0

For a lot of drives, the first line - raw read errors, can be very
telling - anything other than 0, and your disk is failing.
Start-up-time can be telling, if it's increasing.

The thing is, that most drives, except those designed for use in RAID
arrays, mask impending disk failures, by re-reading blocks multiple
times - they often get the data eventually, but your machine keeps
getting slower and slower.




Thanks Miles, and tomás, for your helpful replies.

I apologise for the delay in replying, but I've been away from my desk a 
few days.


I have however been doing some extensive googling, and it would appear 
that the raw read error count is something of a red herring, especially 
when applied to Seagate drives, as these are. Both my drives have quite 
high (in the millions) of RREC; numbers which are precisely matched by 
the Hardware ECC Recovered counts, suggesting that the RREC is merely an 
artifact od HHDs being essentially a mechanical device, being pushed to 
its limits using clever technology. The SMART extended tests reveal no 
problems.


The Wikipedia entry https://en.wikipedia.org/wiki/S.M.A.R.T. is 
particularly informative in the relative importance of these error 
counts; the RREC can be safely ignored, as somebody else here recently 
suggested.


So, back to the original problem; I think tomás hit the nail on the 
head. I've re-plugged the SATA cables, to no great effect; I have now 
ordered a couple of new cables, and will see whether that helps.


Thanks again to  all.



--
Tony van der Hoff| mailto:t...@vanderhoff.org
Buckinghamshire, England |



Re: impending disk failure?

2015-10-18 Thread Himanshu Shekhar
Hi!
You can use Disk Utility (gnome-disks / udisks) and check your hard drive
for errors, or bad sectors.
That seems to be simple enough!

Regards
Himanshu Shekhar
IIIT-Allahabad
IRM2015006


Re: impending disk failure?

2015-10-18 Thread Dominique Dumont
On Saturday 17 October 2015 12:47:36 Miles Fidelman wrote:
> For a lot of drives, the first line - raw read errors, can be very telling -
> anything other than 0, and your disk is failing.

Sorry, the FAQ [1] on smartmontools.org does not agree with your statement:

* What details can be interpreted from Raw read error rate?
  If no documentation is available, the RAW value of attribute 1 is typically 
useless. 
 The 48-bit field might encode several values, try -v 1,hex48 to check. 

All the best

[1] 
http://www.smartmontools.org/wiki/FAQ#WhatdetailscanbeinterpretedfromRawreaderrorrate
-- 
 https://github.com/dod38fr/   -o- http://search.cpan.org/~ddumont/
http://ddumont.wordpress.com/  -o-   irc: dod at irc.debian.org



Re: impending disk failure?

2015-10-17 Thread Miles Fidelman

Dominique Dumont wrote:

On Saturday 17 October 2015 14:15:52 Tony van der Hoff wrote:

Can anyone please explain what it means, and whether I should be worried?

You should check the drive with smartctl.

See http://www.smartmontools.org/

HTH


Yes.. and be sure to go beyond the basic tests.

First off, make sure it's running:
smartctl -s on -A /dev/disk0   ;for each drive, and using the 
appropriate /dev/..


Then after, it's accumulated some stats:
smartctl -A /dev/disk0

For a lot of drives, the first line - raw read errors, can be very telling - 
anything other than 0, and your disk is failing.
Start-up-time can be telling, if it's increasing.

The thing is, that most drives, except those designed for use in RAID arrays, 
mask impending disk failures, by re-reading blocks multiple
times - they often get the data eventually, but your machine keeps getting 
slower and slower.

Miles Fidelman

--
In theory, there is no difference between theory and practice.
In practice, there is.    Yogi Berra



Re: impending disk failure?

2015-10-17 Thread Dominique Dumont
On Saturday 17 October 2015 14:15:52 Tony van der Hoff wrote:
> Can anyone please explain what it means, and whether I should be worried?

You should check the drive with smartctl.

See http://www.smartmontools.org/

HTH

-- 
 https://github.com/dod38fr/   -o- http://search.cpan.org/~ddumont/
http://ddumont.wordpress.com/  -o-   irc: dod at irc.debian.org



Re: impending disk failure?

2015-10-17 Thread tomas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sat, Oct 17, 2015 at 02:15:52PM +0100, Tony van der Hoff wrote:
> Hi,
> 
> T'm occasionally getting this message in syslog on my jessie box:
> 
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600489] ata3.00: exception
> Emask 0x10 SAct 0x10 SErr 0x40 action 0x6 frozen
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600501] ata3.00: irq_stat
> 0x0800, interface fatal error
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600509] ata3: SError: { Handshk }
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600517] ata3.00: failed
> command: WRITE FPDMA QUEUED
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600531] ata3.00: cmd
> 61/c0:20:90:6e:eb/01:00:22:00:00/40 tag 4 ncq 229376 out
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600531]  res
> 40/00:20:90:6e:eb/00:00:22:00:00/40 Emask 0x10 (ATA bus error)
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600538] ata3.00: status: { DRDY }
> Oct 17 12:00:19 tony-lx kernel: [ 8838.600547] ata3: hard resetting link
> Oct 17 12:00:20 tony-lx kernel: [ 8839.092521] ata3: SATA link up
> 6.0 Gbps (SStatus 133 SControl 300)
> Oct 17 12:00:20 tony-lx kernel: [ 8839.099092] ata3.00: configured
> for UDMA/133
> Oct 17 12:00:20 tony-lx kernel: [ 8839.099121] ata3: EH complete

Hmpf. Could be anything. But as a first shot I'd try unseating and
re-seating the disk's SATA cable (or exchanging it). And before and
above everything: make frequent backups!

regards
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlYiV+MACgkQBcgs9XrR2kZ5pwCfdopAWMpZYyEUzxD0Bq6gYMmi
SdQAnRfbuxXBXZl+r+pExmSEzb4bykeZ
=jdUX
-END PGP SIGNATURE-



impending disk failure?

2015-10-17 Thread Tony van der Hoff

Hi,

T'm occasionally getting this message in syslog on my jessie box:

Oct 17 12:00:19 tony-lx kernel: [ 8838.600489] ata3.00: exception Emask 
0x10 SAct 0x10 SErr 0x40 action 0x6 frozen
Oct 17 12:00:19 tony-lx kernel: [ 8838.600501] ata3.00: irq_stat 
0x0800, interface fatal error

Oct 17 12:00:19 tony-lx kernel: [ 8838.600509] ata3: SError: { Handshk }
Oct 17 12:00:19 tony-lx kernel: [ 8838.600517] ata3.00: failed command: 
WRITE FPDMA QUEUED
Oct 17 12:00:19 tony-lx kernel: [ 8838.600531] ata3.00: cmd 
61/c0:20:90:6e:eb/01:00:22:00:00/40 tag 4 ncq 229376 out
Oct 17 12:00:19 tony-lx kernel: [ 8838.600531]  res 
40/00:20:90:6e:eb/00:00:22:00:00/40 Emask 0x10 (ATA bus error)

Oct 17 12:00:19 tony-lx kernel: [ 8838.600538] ata3.00: status: { DRDY }
Oct 17 12:00:19 tony-lx kernel: [ 8838.600547] ata3: hard resetting link
Oct 17 12:00:20 tony-lx kernel: [ 8839.092521] ata3: SATA link up 6.0 
Gbps (SStatus 133 SControl 300)
Oct 17 12:00:20 tony-lx kernel: [ 8839.099092] ata3.00: configured for 
UDMA/133

Oct 17 12:00:20 tony-lx kernel: [ 8839.099121] ata3: EH complete


Can anyone please explain what it means, and whether I should be worried?

Thanks,
--
Tony van der Hoff| mailto:t...@vanderhoff.org
Buckinghamshire, England |



Re: Disk failure, XFS shutting down, trying to recover as much as possible

2015-06-12 Thread David Christensen

On 06/12/2015 12:45 AM, Peter Viskup wrote:

Always consider using ddrescue [1] instead of dd - especially once you are
not sure about the state of the drive.
Tool ddrescue is taking 'dd' image of the drive, but will skip all the
areas where the read will return an error. Standard 'dd' will try to
continuously re-read that area which could cause more damages.


Yes, that sounds like just the tool for the job!  :-)


David





--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: https://lists.debian.org/557b86a1.70...@holgerdanske.com



Re: Disk failure, XFS shutting down, trying to recover as much as possible

2015-06-12 Thread Peter Viskup
Always consider using ddrescue [1] instead of dd - especially once you are
not sure about the state of the drive.
Tool ddrescue is taking 'dd' image of the drive, but will skip all the
areas where the read will return an error. Standard 'dd' will try to
continuously re-read that area which could cause more damages.
Have fun! ;-)

[1] http://www.gnu.org/software/ddrescue/

-- 
Peter

On Fri, Jun 12, 2015 at 1:20 AM, David Christensen <
dpchr...@holgerdanske.com> wrote:

> On 06/11/2015 12:32 AM, Alejandro Exojo wrote:
>
>> Yesterday I found out that my extra disk shut down. I don't know what
>> steps to
>> follow from now on. I'm searching online about the error as I found in the
>> logs, and I don't know what steps to follow.
>>
> ...
>
>> I don't know where to proceed from here. The error seems hardware, but
>> I'm not
>> totally sure. After that, what should I try to do to recover as much as
>> possible? I'm reading about ddrescue now.
>> I don't have space in the other partitions to hold all the data in the
>> failed
>> disk, but I'm only interested in recovering some parts of it as safely as
>> possible. Should I just buy a new disk, try to replicate the original one
>> there, and find out which files are damaged? Or should I create an image
>> as a
>> file stored somewhere else?
>>
>
> 1.  Buy a large disk that you can use for backups.  I use 3 TB Seagate
> ST3000DM001 because they have the best gigabyte/dollar ratio that I am
> aware of.
>
> 2.  Try to mount the file system and backup your files.  If you can't get
> the filesystem mounted, copy the raw disk image to a file using 'dd'.  You
> might have to get the image in pieces using the 'skip' and 'seek' options.
>
> 3.  Download the disk drive manufacturer's diagnostic toolset and run it.
> For example:
>
> http://www.seagate.com/support/downloads/seatools/
>
>
> David
>
>
> --
> To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a
> subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
> Archive: https://lists.debian.org/557a17a5.2090...@holgerdanske.com
>
>


Re: Disk failure, XFS shutting down, trying to recover as much as possible

2015-06-11 Thread Dominique Dumont
On Thursday 11 June 2015 23:46:02 Alejandro Exojo wrote:
> This is the whole smartctl output:
> 
> http://paste.debian.net/220687/
> 
> Can I understand the following line as that the disk might be fine?
> 
> SMART overall-health self-assessment test result: PASSED

Details show that no bad sectors was found. (Current_Pending_Sector and 
Reallocated_Sector_Ct are both 0). 

UDMA_CRC_Error_Count is also 0 which that no error were detected during 
transfer on sata cable.

HTH

-- 
 https://github.com/dod38fr/   -o- http://search.cpan.org/~ddumont/
http://ddumont.wordpress.com/  -o-   irc: dod at irc.debian.org


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/3907173.r8ZVSolp50@ylum



Re: Disk failure, XFS shutting down, trying to recover as much as possible

2015-06-11 Thread David Christensen

On 06/11/2015 12:32 AM, Alejandro Exojo wrote:

Yesterday I found out that my extra disk shut down. I don't know what steps to
follow from now on. I'm searching online about the error as I found in the
logs, and I don't know what steps to follow.

...

I don't know where to proceed from here. The error seems hardware, but I'm not
totally sure. After that, what should I try to do to recover as much as
possible? I'm reading about ddrescue now.
I don't have space in the other partitions to hold all the data in the failed
disk, but I'm only interested in recovering some parts of it as safely as
possible. Should I just buy a new disk, try to replicate the original one
there, and find out which files are damaged? Or should I create an image as a
file stored somewhere else?


1.  Buy a large disk that you can use for backups.  I use 3 TB Seagate 
ST3000DM001 because they have the best gigabyte/dollar ratio that I am 
aware of.


2.  Try to mount the file system and backup your files.  If you can't 
get the filesystem mounted, copy the raw disk image to a file using 
'dd'.  You might have to get the image in pieces using the 'skip' and 
'seek' options.


3.  Download the disk drive manufacturer's diagnostic toolset and run 
it.  For example:


http://www.seagate.com/support/downloads/seatools/


David


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: https://lists.debian.org/557a17a5.2090...@holgerdanske.com



Re: Disk failure, XFS shutting down, trying to recover as much as possible

2015-06-11 Thread Alejandro Exojo
El Thursday 11 June 2015, Ric Moore escribió:
> On 06/11/2015 03:32 AM, Alejandro Exojo wrote:
>   Or should I create an image as a
> 
> > file stored somewhere else?
> 
> Just for grins, unplug the connector(s) from the drive AND at the
> motherboard, both. Plug it all back in again. That has worked for me
> more than once, and I replaced those cables after. Ric

I've been trying pretty much that!

The disk was in a slot within the case with a certain "box" and a connector 
that adapted power and data. I changed it's position in the box, now the cable 
less twisted, and I've been able to start recovering data. At one point it 
printed some errors in the log:

http://paste.debian.net/220673/

That said, I still don't understand at all wether the disk can be considered 
problematic or not. The smartctl output is quite incomprehensible to me. :-/

This is the whole smartctl output:

http://paste.debian.net/220687/

Can I understand the following line as that the disk might be fine?

SMART overall-health self-assessment test result: PASSED


Thank you.

-- 
Alex (a.k.a. suy) | GPG ID 0x0B8B0BC2
http://barnacity.net/ | http://disperso.net


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/201506112346.02469@badopi.org



Re: Disk failure, XFS shutting down, trying to recover as much as possible

2015-06-11 Thread Ric Moore

On 06/11/2015 03:32 AM, Alejandro Exojo wrote:
 Or should I create an image as a

file stored somewhere else?


Just for grins, unplug the connector(s) from the drive AND at the 
motherboard, both. Plug it all back in again. That has worked for me 
more than once, and I replaced those cables after. Ric



--
My father, Victor Moore (Vic) used to say:
"There are two Great Sins in the world...
..the Sin of Ignorance, and the Sin of Stupidity.
Only the former may be overcome." R.I.P. Dad.
http://linuxcounter.net/user/44256.html


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: https://lists.debian.org/5579d8be.1020...@gmail.com



Disk failure, XFS shutting down, trying to recover as much as possible

2015-06-11 Thread Alejandro Exojo
Hello.

Yesterday I found out that my extra disk shut down. I don't know what steps to 
follow from now on. I'm searching online about the error as I found in the 
logs, and I don't know what steps to follow.

This is the log (I just trimmed which I think it was irrelevant):

http://paste.debian.net/220272/

In particular, I think this is the most significative:

end_request: I/O error, dev sdc, sector 3297507911
XFS (sdc1): metadata I/O error: block 0x7477cb18 ("xlog_iodone") error 5 
numblks 64
XFS (sdc1): xfs_do_force_shutdown(0x2) called from line 1172 of file 
/build/linux-cLkxwy/linux-3.16.7-ckt9/fs/xfs/xfs_l
XFS (sdc1): Log I/O Error Detected.  Shutting down filesystem
XFS (sdc1): Please umount the filesystem and rectify the problem(s)
XFS (sdc1): metadata I/O error: block 0xc48bfa08 ("xfs_trans_read_buf_map") 
error 5 numblks 16
XFS (sdc1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
XFS (sdc1): xfs_log_force: error 5 returned.


I've run smartctl, and the output doesn't look any promising:


walt:~# smartctl -T permissive -x /dev/sdc
Vendor:   /2:0:0:0
Product:  
User Capacity:600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
Log Sense failed, IE page [scsi response fails sanity test]

Error Counter logging not supported
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
Device does not support Self Test logging
Device does not support Background scan results logging
scsiPrintSasPhy Log Sense Failed [scsi response fails sanity test]

walt:~# smartctl -T permissive -x /dev/sdc1
Short INQUIRY response, skip product id
SMART Health Status: OK
Read defect list: asked for grown list but didn't get it

Error Counter logging not supported
Device does not support Self Test logging
Device does not support Background scan results logging
scsiPrintSasPhy Log Sense Failed [scsi response fails sanity test]



I don't know where to proceed from here. The error seems hardware, but I'm not 
totally sure. After that, what should I try to do to recover as much as 
possible? I'm reading about ddrescue now.

I don't have space in the other partitions to hold all the data in the failed 
disk, but I'm only interested in recovering some parts of it as safely as 
possible. Should I just buy a new disk, try to replicate the original one 
there, and find out which files are damaged? Or should I create an image as a 
file stored somewhere else?

Thank you.

-- 
Alex (a.k.a. suy) | GPG ID 0x0B8B0BC2
http://barnacity.net/ | http://disperso.net


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/201506110932.55610@badopi.org



raid1 issue after disk failure: both disks of the array are still active

2012-09-12 Thread Niccolò Belli

Hi,
I have a raid1 array with two disks, distro is Squeeze amd64. /dev/sda 
is slowly dying, here is a snippet of "smartctl -a /dev/sda":


197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always 
  -   2
198 Offline_Uncorrectable   0x0030   100   100   000Old_age 
Offline  -   1


The bad sector is in the second half-MB of the disk, in fact with "dd 
if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1" I get this output in 
/var/log/syslog:


root@asterisk:~# dd if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1
0+1 record dentro
0+1 record fuori
430140 byte (430 kB) copiati, 11,7265 s, 36,7 kB/s

Sep 12 22:15:02 asterisk kernel: [ 8921.561978] dd: sending ioctl 
80306d02 to a partition!
Sep 12 22:15:02 asterisk kernel: [ 8921.561986] dd: sending ioctl 
80306d02 to a partition!
Sep 12 22:15:03 asterisk kernel: [ 8922.529099] ata3.00: exception Emask 
0x0 SAct 0x0 SErr 0x0 action 0x0

Sep 12 22:15:03 asterisk kernel: [ 8922.531774] ata3.00: BMDMA stat 0x44
Sep 12 22:15:03 asterisk kernel: [ 8922.533547] ata3.00: failed command: 
READ DMA
Sep 12 22:15:03 asterisk kernel: [ 8922.535313] ata3.00: cmd 
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:03 asterisk kernel: [ 8922.535316]  res 
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:03 asterisk kernel: [ 8922.538891] ata3.00: status: { DRDY 
ERR }

Sep 12 22:15:03 asterisk kernel: [ 8922.540675] ata3.00: error: { UNC }
Sep 12 22:15:04 asterisk kernel: [ 8923.508206] ata3.00: configured for 
UDMA/133

Sep 12 22:15:04 asterisk kernel: [ 8923.508220] ata3: EH complete
Sep 12 22:15:05 asterisk kernel: [ 8924.469512] ata3.00: exception Emask 
0x0 SAct 0x0 SErr 0x0 action 0x0

Sep 12 22:15:05 asterisk kernel: [ 8924.472323] ata3.00: BMDMA stat 0x44
Sep 12 22:15:05 asterisk kernel: [ 8924.475260] ata3.00: failed command: 
READ DMA
Sep 12 22:15:05 asterisk kernel: [ 8924.477023] ata3.00: cmd 
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:05 asterisk kernel: [ 8924.477025]  res 
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:05 asterisk kernel: [ 8924.480595] ata3.00: status: { DRDY 
ERR }

Sep 12 22:15:05 asterisk kernel: [ 8924.482370] ata3.00: error: { UNC }
Sep 12 22:15:06 asterisk kernel: [ 8925.452209] ata3.00: configured for 
UDMA/133

Sep 12 22:15:06 asterisk kernel: [ 8925.452224] ata3: EH complete
Sep 12 22:15:07 asterisk kernel: [ 8926.418504] ata3.00: exception Emask 
0x0 SAct 0x0 SErr 0x0 action 0x0

Sep 12 22:15:07 asterisk kernel: [ 8926.420741] ata3.00: BMDMA stat 0x44
Sep 12 22:15:07 asterisk kernel: [ 8926.422486] ata3.00: failed command: 
READ DMA
Sep 12 22:15:07 asterisk kernel: [ 8926.424279] ata3.00: cmd 
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:07 asterisk kernel: [ 8926.424281]  res 
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:07 asterisk kernel: [ 8926.427861] ata3.00: status: { DRDY 
ERR }

Sep 12 22:15:07 asterisk kernel: [ 8926.429660] ata3.00: error: { UNC }
Sep 12 22:15:08 asterisk kernel: [ 8927.396270] ata3.00: configured for 
UDMA/133

Sep 12 22:15:08 asterisk kernel: [ 8927.396285] ata3: EH complete
Sep 12 22:15:09 asterisk kernel: [ 8928.359173] ata3.00: exception Emask 
0x0 SAct 0x0 SErr 0x0 action 0x0

Sep 12 22:15:09 asterisk kernel: [ 8928.361647] ata3.00: BMDMA stat 0x44
Sep 12 22:15:09 asterisk kernel: [ 8928.364273] ata3.00: failed command: 
READ DMA
Sep 12 22:15:09 asterisk kernel: [ 8928.366028] ata3.00: cmd 
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:09 asterisk kernel: [ 8928.366030]  res 
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:09 asterisk kernel: [ 8928.369643] ata3.00: status: { DRDY 
ERR }

Sep 12 22:15:09 asterisk kernel: [ 8928.371420] ata3.00: error: { UNC }
Sep 12 22:15:10 asterisk kernel: [ 8929.340218] ata3.00: configured for 
UDMA/133

Sep 12 22:15:10 asterisk kernel: [ 8929.340233] ata3: EH complete
Sep 12 22:15:11 asterisk kernel: [ 8930.332648] ata3.00: exception Emask 
0x0 SAct 0x0 SErr 0x0 action 0x0

Sep 12 22:15:11 asterisk kernel: [ 8930.334453] ata3.00: BMDMA stat 0x44
Sep 12 22:15:11 asterisk kernel: [ 8930.336245] ata3.00: failed command: 
READ DMA
Sep 12 22:15:11 asterisk kernel: [ 8930.337995] ata3.00: cmd 
c8/00:08:48:0f:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Sep 12 22:15:11 asterisk kernel: [ 8930.337998]  res 
51/40:00:48:0f:00/40:00:00:00:00/e0 Emask 0x9 (media error)
Sep 12 22:15:11 asterisk kernel: [ 8930.341583] ata3.00: status: { DRDY 
ERR }

Sep 12 22:15:11 asterisk kernel: [ 8930.343360] ata3.00: error: { UNC }
Sep 12 22:15:12 asterisk kernel: [ 8931.344205] ata3.00: configured for 
UDMA/133

Sep 12 22:15:12 asterisk kernel: [ 8931.344220] ata3: EH complete
Sep 12 22:15:13 asterisk kernel: [ 8932.306376] ata3.00: exception Emask 
0x0 SAct 0x0 SErr 0x0 action 0x0

Sep 12 22:15:13 asterisk kernel: [ 8932.308201] a

Re: is this hard disk failure?

2011-06-09 Thread Aenn Seidhe Priest
Looks like controller failure or a broken pin/wire in the cable (more
likely).

On 09.06.2011 at 20:14 lee wrote:

>surreal  writes:
>
>>>From today morning i am getting strange kind of system messages on
>starting the computer..
>>
>> I typed dmesg and found these messages
>>
>> [  304.694936] ata4.00: status: { DRDY ERR }
>> [  304.694939] ata4.00: error: { ICRC ABRT }
>> [  304.694954] ata4: soft resetting link
>> [  304.938280] ata4.00: configured for UDMA/33
>> [  304.938293] ata4: EH complete
>> [  304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x6
>> [  304.970873] ata4.00: BMDMA stat 0x26
>> [  304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0
tag 0
>dma 28672 in
>> [ 
304.970887]          res
>51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 (host bus error)
>>
>> What do these messages mean? What is the solution to prevent these
>messages from appearing? Help!
>
>This doesn´t look like the usual hardware error from a broken
hard disk:
>When a disk is broken, you usually get messages about sector errors.
>
>I would check all the connections (power and SATA) and try new
>cables. If the problem doesn´t go away, it can be anything, like
the
>firmware of the drive, a problem with your mainboard, a problem with
>your power supply: Backup the data, replace the drive and see if the
new
>one also shows errors like the above.
>
>
>--
>To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
>with a subject of "unsubscribe". Trouble? Contact
>listmas...@lists.debian.org
>Archive: http://lists.debian.org/87mxhq98ju@yun.yagibdah.de




--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/201106091817170078.1d84a...@portafi.com



Re: is this hard disk failure?

2011-06-09 Thread Nico Kadel-Garcia
On Tue, Jun 7, 2011 at 9:02 AM, Miles Fidelman
 wrote:
> Ralf Mardorf wrote:
>>
>> For me a hard disc never gets broken without click-click-click noise
>> before it failed, but it's very common that cables and connections fail.
>>
>>
>
> By the time a disk gets to the click-click-click phase, there has been LOTS
> of warning - it's just that today's disks include lots of internal
> fault-recovery mechanisms that hide things from you, unless you run SMART
> diagnostics (and not just the basic "smart status" either).

This is not borne out by my experience, or Google's white paper on the
subject in 2007. See this study

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/disk_failures.pdf

The upshot is that "smart" monitoring is nowhere near 100% reliable,
you're lucky if it catches even half of your drive failures in time to
do anything besides rely on backups or rely on the rest of your RAID.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/banlktikv12enqg6iud-vyhhew6jx9u6...@mail.gmail.com



Re: is this hard disk failure?

2011-06-09 Thread lee
surreal  writes:

>>From today morning i am getting strange kind of system messages on starting 
>>the computer..
>
> I typed dmesg and found these messages
>
> [  304.694936] ata4.00: status: { DRDY ERR }
> [  304.694939] ata4.00: error: { ICRC ABRT }
> [  304.694954] ata4: soft resetting link
> [  304.938280] ata4.00: configured for UDMA/33
> [  304.938293] ata4: EH complete
> [  304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> [  304.970873] ata4.00: BMDMA stat 0x26
> [  304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma 
> 28672 in
> [  304.970887]  res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 
> (host bus error)
>
> What do these messages mean? What is the solution to prevent these messages 
> from appearing? Help!

This doesn´t look like the usual hardware error from a broken hard disk:
When a disk is broken, you usually get messages about sector errors.

I would check all the connections (power and SATA) and try new
cables. If the problem doesn´t go away, it can be anything, like the
firmware of the drive, a problem with your mainboard, a problem with
your power supply: Backup the data, replace the drive and see if the new
one also shows errors like the above.


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87mxhq98ju@yun.yagibdah.de



Re: is this hard disk failure?

2011-06-09 Thread Scott Ferguson
On 09/06/11 13:46, Ron Johnson wrote:
> On 06/07/2011 08:02 AM, Miles Fidelman wrote:
> [snip]
>>
>> - install SMART utilities and run "smartctl -A /dev/ -- the
>> first line is usually the "raw read error" rate -- if the value (last
>> entry on the line) is anything except 0, that's the sign that your drive
>> is failing, if it's in the 1000s, failure is imminent, it's just that
>> your drive's internal software is hiding it from you - replace it!
>>
> 
> Then why does smartctl give my disk a green light?
> 
> http://members.cox.net/ron.l.johnson/smart_window.png
> 

Is that a TravelStar?
Try running the extended tests and setting it for offline data
collection. I've got two "factory refurbished" ones that show 0 where
yours shows a scary 589825. That mine had to be refurbished means they
were sent back...and I've heard stories of hundreds sent back to the
factory when a rollout of Ipex boxes found 1 in 5 were dying during the
initial imaging.

What is the raw value for Reallocation for event count?

Cheers

-- 
Tuttle? His name's Buttle.
There must be some mistake.
Mistake? [Chuckles]
We don't make mistakes.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4df07948.9080...@gmail.com



Re: is this hard disk failure?

2011-06-08 Thread Miles Fidelman

Ron Johnson wrote:

On 06/07/2011 08:02 AM, Miles Fidelman wrote:
[snip]


- install SMART utilities and run "smartctl -A /dev/ -- the
first line is usually the "raw read error" rate -- if the value (last
entry on the line) is anything except 0, that's the sign that your drive
is failing, if it's in the 1000s, failure is imminent, it's just that
your drive's internal software is hiding it from you - replace it!



Then why does smartctl give my disk a green light?

http://members.cox.net/ron.l.johnson/smart_window.png


Well... smartctl isn't giving you the green light, it's your GUI that's 
interpreting the numbers as a "green light"


Personally, that raw read error rate would scare me, particularly in 
such a young drive.


Miles




--
In theory, there is no difference between theory and practice.
In  practice, there is.    Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4df05d4a.4050...@meetinghouse.net



Re: is this hard disk failure?

2011-06-08 Thread Ron Johnson

On 06/07/2011 08:02 AM, Miles Fidelman wrote:
[snip]


- install SMART utilities and run "smartctl -A /dev/ -- the
first line is usually the "raw read error" rate -- if the value (last
entry on the line) is anything except 0, that's the sign that your drive
is failing, if it's in the 1000s, failure is imminent, it's just that
your drive's internal software is hiding it from you - replace it!



Then why does smartctl give my disk a green light?

http://members.cox.net/ron.l.johnson/smart_window.png

--
"Neither the wisest constitution nor the wisest laws will secure
the liberty and happiness of a people whose manners are universally
corrupt."
Samuel Adams, essay in The Public Advertiser, 1749


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4df04225.4090...@cox.net



Re: is this hard disk failure?

2011-06-07 Thread Miles Fidelman

Henrique de Moraes Holschuh wrote

Re. tuning:  How?  I've tried to find ways to get md to track
timeouts, and never been able to find any relevant parameters.
 

It is not in md.  It is in the libata/scsi layer.  Just tune the per-device
parameters, e.g. in /sys/block/sda/device/*

AFAIK, if libata doesn't time out the device, md won't drop it off the
array.

   


Ahhh Thanks!

--
In theory, there is no difference between theory and practice.
In  practice, there is.    Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4dee53ae.2050...@meetinghouse.net



Re: is this hard disk failure?

2011-06-07 Thread Henrique de Moraes Holschuh
On Tue, 07 Jun 2011, Miles Fidelman wrote:
> >Linux software raid is much more forgiving by default (and it can tune
> >the timeout for each component device separately), and will just slow
> >down most of the time instead of kicking component devices off the array
> >until dataloss happens.  Could be useful if you got duped by the vendor
> >and sold a defective drive that can only operate safely out-of-spec, but
> >can still be useful to you.
> 
> Not necessarily the best strategy if you have enough drives to
> survive 2 drive failures.  Sometimes better to have a drive drop out
> of the array and trigger an alarm than to have a system slow to a
> crawl precipitously (particularly as that makes it hard to run
> diagnostics to figure out which drive is bad).

YMMV.  I'd never do that in a RAID array with important data in it.

External events that cause non-ECR disks to time out CAN and DO happen to
the entire set of disks in the same enclosure (such as impact vibrations
from a nearby equipment or from the floor).  It is a known problem in
datacenters, but it can happen at home as well when a large truck passes
close by, or someone bumps in the shelf/table/rack :-)

If enough of those devices go over the timeout threshold because of the
external even (which is rather spartan by default on most hardware RAID
cards), the array goes offline and data loss can happen.

Worse, rebuilding a degraded array will excercise the array at the time it
is most vulnerable, it is not a safe operation unless you're rebuilding an
already redundant array (which is one of the reasons why RAID6 or anything
N+2 or above is a good idea).  This is why you have to regularly scrub the
array at off-peak hours or as a background operation.

> Re. tuning:  How?  I've tried to find ways to get md to track
> timeouts, and never been able to find any relevant parameters.

It is not in md.  It is in the libata/scsi layer.  Just tune the per-device
parameters, e.g. in /sys/block/sda/device/*

AFAIK, if libata doesn't time out the device, md won't drop it off the
array.

> Queries to the linux-raid list have yielded some fairly definitive
> sounding statements, from folks who should know, that md doesn't
> have any such timeouts.  If they're there, please.. more
> information!

md doesn't track performance (much, if at all), it does not do even a decent
job of scheduling reads/writes over multiple md devices that have components
that share the same physical device.   It is quite simple (but not to the
point of being brain-dead like dm-raid).

OTOH, md really is a separate layer on top of the component devices. You can
smart-test and performance-test the component devices, change their
libata/scsi layer parameters...

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110607160627.gd1...@khazad-dum.debian.net



Re: is this hard disk failure?

2011-06-07 Thread Miles Fidelman

Henrique de Moraes Holschuh wrote:

On Tue, 07 Jun 2011, Miles Fidelman wrote:
   

b. you're running RAID - instead of the drive dropping out of the
array, the entire array slows down as it waits for the failing drive
to (eventually) respond
 



Linux software raid is much more forgiving by default (and it can tune
the timeout for each component device separately), and will just slow
down most of the time instead of kicking component devices off the array
until dataloss happens.  Could be useful if you got duped by the vendor
and sold a defective drive that can only operate safely out-of-spec, but
can still be useful to you.
   


Not necessarily the best strategy if you have enough drives to survive 2 
drive failures.  Sometimes better to have a drive drop out of the array 
and trigger an alarm than to have a system slow to a crawl precipitously 
(particularly as that makes it hard to run diagnostics to figure out 
which drive is bad).


Re. tuning:  How?  I've tried to find ways to get md to track timeouts, 
and never been able to find any relevant parameters.  Queries to the 
linux-raid list have yielded some fairly definitive sounding statements, 
from folks who should know, that md doesn't have any such timeouts.  If 
they're there, please.. more information!







--
In theory, there is no difference between theory and practice.
In  practice, there is.    Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4dee4643.3060...@meetinghouse.net



Re: is this hard disk failure?

2011-06-07 Thread Henrique de Moraes Holschuh
On Tue, 07 Jun 2011, Miles Fidelman wrote:
> b. you're running RAID - instead of the drive dropping out of the
> array, the entire array slows down as it waits for the failing drive
> to (eventually) respond

Eh, it is worse.

A failing drive _will_ drop out of the array sooner or later, and it can
be very bad if it is does so 'sooner' for any other reason than an
imminent unit failure:  there is a high probability of other device(s)
deciding to also time out while the array is degraded or rebuilding, and
it results in service downtime (and usually data loss).

You never want discs dropping off the array due to
non-immediate-failure-related performance problems, the chance of
multiple drops causing an array failure is too high.  You want to know
the disk is slow, and to replace it in controlled conditions.

This problem is *common*.  Don't do hardware RAID on regular consumer
crap without SCT ERC support (aka TLER/CCTL/ERC), and don't buy
expensive crap with buggy firmware that the vendor refuses to issue a
public fix for to save face (but which you can get from your RAID card
vendor if you are very lucky).  Linux smartctl gives you access to the
drive's SCT ERC page if it is supported.

Also, any device model (not a SPECIFIC device) for which firmware
updates are available that reduce the effective throughput should be
avoided like the plague, as that indicates they have shipped models with
manufacturing or component issues, and you can never be sure of what
you'll get when you buy a new one.

If you already have bought such a device with known high design or
manufacturing defects/weakness ratio, it depends on your luck whether
you got something good or a lemon.  If SMART finds *NO* issues (no
increasing high fly writes, no reallocated sectors grow), and throughput
tests show the expected response, you have a good one: be happy.

If either test shows any such issues, remove it from production.
Secure-erase it, apply any firmware updates if you want to use it as
throw-away backup media (make sure the data is encrypted), or send it
for recycling.

Linux software raid is much more forgiving by default (and it can tune
the timeout for each component device separately), and will just slow
down most of the time instead of kicking component devices off the array
until dataloss happens.  Could be useful if you got duped by the vendor
and sold a defective drive that can only operate safely out-of-spec, but
can still be useful to you.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110607152700.gb1...@khazad-dum.debian.net



Re: is this hard disk failure?

2011-06-07 Thread Ralf Mardorf
On Tue, 2011-06-07 at 09:02 -0400, Miles Fidelman wrote:
> Ralf Mardorf wrote:
> > For me a hard disc never gets broken without click-click-click noise
> > before it failed, but it's very common that cables and connections fail.
> >
> >
> 
> By the time a disk gets to the click-click-click phase,

A phase everybody know for modern HDDs :D, but it's possible to get data
even from a disk that won't loose the heads anymore [1].
For the Atari I've got a 42MB SCSI connected to a Lacom adaptor, it
sometimes needs several boots, but it's unbreakable.

>  there has been 
> LOTS of warning - it's just that today's disks include lots of internal 
> fault-recovery mechanisms that hide things from you, unless you run 
> SMART diagnostics (and not just the basic "smart status" either).
> 
> For example, if you have a machine that's suddenly running VERY slowly

Correct! Resp. if Voodoo seems to have impact to your machine, it seldom
is Voodoo, but a broken HDD.

>  - 
> it's good sign that a drive is experiencing internal read errors (unless 
> it's a laptop - a shorted battery is a good suspect).  Both are lessons 
> learned the hard way, and not forgotten.
> 
> Turns out that modern drives have onboard processors that retry reads 
> multiple times - good for protecting data if you only have the one copy 
> on that drive, at the expense of reduced disk access times.  Not so good if:
> 
> a. you don't notice that it's happening (the disk will eventually fail 
> hard), or,
> 
> b. you're running RAID - instead of the drive dropping out of the array, 
> the entire array slows down as it waits for the failing drive to 
> (eventually) respond
> 
> In either case, you'll tear your hair out trying to figure out why your 
> machine is running slowly  (is it a virus, a file lock that didn't 
> release, etc., etc., etc.).
> 
> Lessons learned:
> 
> - if your machine is running really slowly, try a reboot -- if it 
> reboots properly, but takes 2 times as long (or longer) to shutdown and 
> then come back up -- get very suspicious (if your patience lasts that long)
> 
> - if it's a laptop - pull the battery and try again - if everything is 
> normal, buy yourself a new battery
> 
> - if it's a server - try booting from a liveCD (if you can, first 
> disconnect the hard drive entirely) - if normal then you could well have 
> a hard drive problem (or you could have a virus)
> 
> - install SMART utilities and run "smartctl -A /dev/ -- the 
> first line is usually the "raw read error" rate -- if the value (last 
> entry on the line) is anything except 0, that's the sign that your drive 
> is failing, if it's in the 1000s, failure is imminent, it's just that 
> your drive's internal software is hiding it from you - replace it!
> 
> - if you're running RAID, be sure to purchase "enterprise" drives (where 
> "desktop" try very hard to read a sector, despite the delay; enterprise 
> drives give up quickly as they expect failure recovery to be handled by 
> RAID)
> 
> - you would expect software raid (md) to detect slow drives, mark them 
> bad, and drop them from an array -- nope, md does not keep track of delay
> 
> and, not really relevant for Debian, but a direct offshoot of learning 
> the above lessons:
> 
> - if you're running a Mac or Windows, you're system may be reporting 
> "smart status good" - but it's not really true - it's not looking at raw 
> read errors
> 
> - there seems to be a bug in the smart utilities for Mac (as available 
> through Macports and Fink) -- the smart daemon will fail periodically, 
> with the only symptom being that every few minutes, you're machine will 
> slow to a crawl (spinning beachball everywhere) for 30 seconds or so, 
> then recover --- a really good example of taking a pre-emptive measure 
> that causes a new problem (I can't tell you how long it took to track 
> this one down - what with downloading every performance tracking tool I 
> could find.)
> 
> 
> Miles Fidelman
> 
> -- 
> In theory, there is no difference between theory and practice.
> In  practice, there is.    Yogi Berra

My Samsung SATA drives until now are without failure for a suspicious
long time :). I very, very often turn the computer off and on.
The only bad are the SATA connectors, a friend already planned to solder
new SATA connectors on his mobo. Note! Nobody without experiences in
soldering multi-layer boards should do this soldering. I planned to do
it too.

[1] When the heads aren't released anymore after the final click, there
still is the possibility to get them working.

- Disassemble the HDD from the case, keep the power and data cables
connected.
- With a rubber-headed mallet or something similar knock against the HDD
from several angles, while rebooting again and again.
- If it doesn't work, repeat this after the HDD did rest for a week.
Dunno while this does help, but it does, perhaps different temperatures
for the room will work like gnomes.

-- Ralf


-- 
To UNSUBSCRIBE, email to debian-user-requ..

Re: is this hard disk failure?

2011-06-07 Thread Miles Fidelman

Ralf Mardorf wrote:

For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.

   


By the time a disk gets to the click-click-click phase, there has been 
LOTS of warning - it's just that today's disks include lots of internal 
fault-recovery mechanisms that hide things from you, unless you run 
SMART diagnostics (and not just the basic "smart status" either).


For example, if you have a machine that's suddenly running VERY slowly - 
it's good sign that a drive is experiencing internal read errors (unless 
it's a laptop - a shorted battery is a good suspect).  Both are lessons 
learned the hard way, and not forgotten.


Turns out that modern drives have onboard processors that retry reads 
multiple times - good for protecting data if you only have the one copy 
on that drive, at the expense of reduced disk access times.  Not so good if:


a. you don't notice that it's happening (the disk will eventually fail 
hard), or,


b. you're running RAID - instead of the drive dropping out of the array, 
the entire array slows down as it waits for the failing drive to 
(eventually) respond


In either case, you'll tear your hair out trying to figure out why your 
machine is running slowly  (is it a virus, a file lock that didn't 
release, etc., etc., etc.).


Lessons learned:

- if your machine is running really slowly, try a reboot -- if it 
reboots properly, but takes 2 times as long (or longer) to shutdown and 
then come back up -- get very suspicious (if your patience lasts that long)


- if it's a laptop - pull the battery and try again - if everything is 
normal, buy yourself a new battery


- if it's a server - try booting from a liveCD (if you can, first 
disconnect the hard drive entirely) - if normal then you could well have 
a hard drive problem (or you could have a virus)


- install SMART utilities and run "smartctl -A /dev/ -- the 
first line is usually the "raw read error" rate -- if the value (last 
entry on the line) is anything except 0, that's the sign that your drive 
is failing, if it's in the 1000s, failure is imminent, it's just that 
your drive's internal software is hiding it from you - replace it!


- if you're running RAID, be sure to purchase "enterprise" drives (where 
"desktop" try very hard to read a sector, despite the delay; enterprise 
drives give up quickly as they expect failure recovery to be handled by 
RAID)


- you would expect software raid (md) to detect slow drives, mark them 
bad, and drop them from an array -- nope, md does not keep track of delay


and, not really relevant for Debian, but a direct offshoot of learning 
the above lessons:


- if you're running a Mac or Windows, you're system may be reporting 
"smart status good" - but it's not really true - it's not looking at raw 
read errors


- there seems to be a bug in the smart utilities for Mac (as available 
through Macports and Fink) -- the smart daemon will fail periodically, 
with the only symptom being that every few minutes, you're machine will 
slow to a crawl (spinning beachball everywhere) for 30 seconds or so, 
then recover --- a really good example of taking a pre-emptive measure 
that causes a new problem (I can't tell you how long it took to track 
this one down - what with downloading every performance tracking tool I 
could find.)



Miles Fidelman

--
In theory, there is no difference between theory and practice.
In  practice, there is.    Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4dee217c.9020...@meetinghouse.net



Re: is this hard disk failure?

2011-06-07 Thread Ralf Mardorf
On Tue, 2011-06-07 at 13:59 +0200, Ralf Mardorf wrote:
> On Tue, 2011-06-07 at 11:46 +, Camaleón wrote:
> > It can be a bad cable -or bad connection-
> 
> For me a hard disc never gets broken without click-click-click noise
> before it failed, but it's very common that cables and connections fail.
> 
> A tip: If there's a warranty seal, don't break it, try to loose it with
> a hairdryer. Then disassemble cables and remount them.

PS: Back in the old Atari days we kept the seals and by force tear out
the screw under the seal. Not every seal can be unscathed removed by a
hairdryer, but usually not all screws are needed.



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1307448211.4467.40.camel@debian



Re: is this hard disk failure?

2011-06-07 Thread Ralf Mardorf
On Tue, 2011-06-07 at 11:46 +, Camaleón wrote:
> It can be a bad cable -or bad connection-

For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.

A tip: If there's a warranty seal, don't break it, try to loose it with
a hairdryer. Then disassemble cables and remount them.


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1307447981.4467.36.camel@debian



Re: is this hard disk failure?

2011-06-07 Thread Camaleón
On Tue, 07 Jun 2011 13:17:29 +0530, surreal wrote:

> From today morning i am getting strange kind of system messages on
> starting the computer..
> 
> I typed dmesg and found these messages
> 
> [  304.694936] ata4.00: status: { DRDY ERR } 
> [  304.694939] ata4.00: error: { ICRC ABRT } 
> [  304.694954] ata4: soft resetting link 

(...)

What do you have attached to that port (ata 4)? 

> What do these messages mean? What is the solution to prevent these
> messages from appearing? Help!

It can be a bad cable -or bad connection- or even a kernel issue. I mean, 
it does not have to be a hard disk failure "per se". Anyway, running a 
smartctl long test will neither hurt.

Greetings,

-- 
Camaleón


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/pan.2011.06.07.11.46...@gmail.com



Re: is this hard disk failure?

2011-06-07 Thread Ralf Mardorf
On Tue, 2011-06-07 at 16:21 +0800, Ong Chin Kiat wrote:

> If you can get another hard disk to test, that will narrow down the
> possibilities
... and before doing this turn off power and disconnect and connect all
cables for this HDD on the HDD (power too) and on the mobo.

-- Ralf







-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1307443287.4467.2.camel@debian



Re: is this hard disk failure?

2011-06-07 Thread Ong Chin Kiat
Couple of possibilites:
- Hard disk is failing
- Insufficient power available for your hard disk, causing it to spin up
then spin down again
- Controller error
- Faulty connection or SATA port

The more likely possibilities are 1 and 3.

If you can get another hard disk to test, that will narrow down the
possibilities.

On Tue, Jun 7, 2011 at 3:47 PM, surreal  wrote:

> >From today morning i am getting strange kind of system messages on
> starting the computer..
>
> I typed dmesg and found these messages
>
> [  304.694936] ata4.00: status: { DRDY ERR }
> [  304.694939] ata4.00: error: { ICRC ABRT }
> [  304.694954] ata4: soft resetting link
> [  304.938280] ata4.00: configured for UDMA/33
> [  304.938293] ata4: EH complete
> [  304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> [  304.970873] ata4.00: BMDMA stat 0x26
> [  304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma
> 28672 in
> [  304.970887]  res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30
> (host bus error)
> [  304.970891] ata4.00: status: { DRDY ERR }
> [  304.970895] ata4.00: error: { ICRC ABRT }
> [  304.970909] ata4: soft resetting link
> [  305.218280] ata4.00: configured for UDMA/33
> [  305.218296] ata4: EH complete
> [  305.880378] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> [  305.880385] ata4.00: BMDMA stat 0x26
> [  305.880397] ata4.00: cmd 25/00:80:fe:22:8e/00:01:15:00:00/e0 tag 0 dma
> 196608 in
> [  305.880399]  res 51/84:60:1e:23:8e/84:01:15:00:00/e0 Emask 0x30
> (host bus error)
> [  305.880404] ata4.00: status: { DRDY ERR }
> [  305.880408] ata4.00: error: { ICRC ABRT }
> [  305.880423] ata4: soft resetting link
> [  306.126281] ata4.00: configured for UDMA/33
> [  306.126297] ata4: EH complete
>
>
> What do these messages mean? What is the solution to prevent these messages
> from appearing? Help!
>
> --
> Harshad Joshi
>
>
>


is this hard disk failure?

2011-06-07 Thread surreal
>From today morning i am getting strange kind of system messages on starting
the computer..

I typed dmesg and found these messages

[  304.694936] ata4.00: status: { DRDY ERR }
[  304.694939] ata4.00: error: { ICRC ABRT }
[  304.694954] ata4: soft resetting link
[  304.938280] ata4.00: configured for UDMA/33
[  304.938293] ata4: EH complete
[  304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[  304.970873] ata4.00: BMDMA stat 0x26
[  304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma
28672 in
[  304.970887]  res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30
(host bus error)
[  304.970891] ata4.00: status: { DRDY ERR }
[  304.970895] ata4.00: error: { ICRC ABRT }
[  304.970909] ata4: soft resetting link
[  305.218280] ata4.00: configured for UDMA/33
[  305.218296] ata4: EH complete
[  305.880378] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[  305.880385] ata4.00: BMDMA stat 0x26
[  305.880397] ata4.00: cmd 25/00:80:fe:22:8e/00:01:15:00:00/e0 tag 0 dma
196608 in
[  305.880399]  res 51/84:60:1e:23:8e/84:01:15:00:00/e0 Emask 0x30
(host bus error)
[  305.880404] ata4.00: status: { DRDY ERR }
[  305.880408] ata4.00: error: { ICRC ABRT }
[  305.880423] ata4: soft resetting link
[  306.126281] ata4.00: configured for UDMA/33
[  306.126297] ata4: EH complete


What do these messages mean? What is the solution to prevent these messages
from appearing? Help!

-- 
Harshad Joshi


Re: was getting disk failure errors, repaired the sectors, now what?

2010-07-02 Thread lee
On Wed, Jun 30, 2010 at 05:34:22PM -0400, H.S. wrote:
> 
> So now I know that my backups most probably are not trustworthy, the
> ones from the last four or so days. No problem. I do rolling backups
> using cron and rsync. But what I do now?

Now you buy at least two new disks, preferably some that are rated for 24/7,
set them up as a RAID-1 (or RAID-5) with mdadm and copy your data onto
the RAID as best as you can.

NEVER keep data on a single disk only.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20100702110941.gk8...@yun.yagibdah.de



Re: was getting disk failure errors, repaired the sectors, now what?

2010-07-02 Thread Andrei Popescu
On Jo, 01 iul 10, 18:42:26, H.S. wrote:
> 
> Okay, did all these, but that set of file not found errors upon console
> login is still there.

They are probably gone. If you want to try to repair the system (versus 
reinstalling from scratch) you can just reinstall each package 
containing the missing files (dpkg -S or apt-file can help).

You might need the --force-confmiss option to dpkg.

Regards,
Andrei
-- 
Offtopic discussions among Debian users and developers:
http://lists.alioth.debian.org/mailman/listinfo/d-community-offtopic


signature.asc
Description: Digital signature


Re: was getting disk failure errors, repaired the sectors, now what?

2010-07-01 Thread H.S.
On 01/07/10 09:43 AM, H.S. wrote:
> On 01/07/10 03:34 AM, Andrei Popescu wrote:
>>
>> Don't you have some method of checking the integrity of you backups?
>> (http://www.taobackup.com/integrity.html)
>>
>> It is considered that a modern drive developing bad sectors visible to 
>> the system[1] is not to be trusted.
>>
>> [1] drives are remapping bad sectors internally, until they run out of 
>> spare sectors.
>>
>>
>> This looks like filesystem corruption, did you fsck the drive/partition?
>>
> 
> All good points. I haven't tried these ... which I will do now.
>

Okay, did all these, but that set of file not found errors upon console
login is still there.




-- 

Please reply to this list only. I read this list on its corresponding
newsgroup on gmane.org. Replies sent to my email address are just
filtered to a folder in my mailbox and get periodically deleted without
ever having been read.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/i0j5ki$s9...@dough.gmane.org



Re: was getting disk failure errors, repaired the sectors, now what?

2010-07-01 Thread H.S.
On 01/07/10 03:34 AM, Andrei Popescu wrote:
> 
> Don't you have some method of checking the integrity of you backups?
> (http://www.taobackup.com/integrity.html)
> 
> It is considered that a modern drive developing bad sectors visible to 
> the system[1] is not to be trusted.
> 
> [1] drives are remapping bad sectors internally, until they run out of 
> spare sectors.
> 
> 
> This looks like filesystem corruption, did you fsck the drive/partition?
> 

All good points. I haven't tried these ... which I will do now.

Thanks!




-- 

Please reply to this list only. I read this list on its corresponding
newsgroup on gmane.org. Replies sent to my email address are just
filtered to a folder in my mailbox and get periodically deleted without
ever having been read.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/i0i622$6l...@dough.gmane.org



Re: was getting disk failure errors, repaired the sectors, now what?

2010-07-01 Thread Andrei Popescu
On Mi, 30 iun 10, 17:34:22, H.S. wrote:
 
> So now I know that my backups most probably are not trustworthy, the
> ones from the last four or so days. No problem. I do rolling backups
> using cron and rsync. But what I do now? Do I just delete the backups
> from the last four days and resume regular ones? 

Don't you have some method of checking the integrity of you backups?
(http://www.taobackup.com/integrity.html)

>   How risky is the
> partition even though the manufacturer's diagnostic utility reports no
> errors now.
 
It is considered that a modern drive developing bad sectors visible to 
the system[1] is not to be trusted.

[1] drives are remapping bad sectors internally, until they run out of 
spare sectors.

> Another thing, I get this when I log in to a console, what is it all about?
> #-#
> Last login: Wed Jun 30 17:23:40 EDT 2010 from localhost on pts/9
> /etc/update-motd.d/20-cpu-checker: line 3:
> /usr/lib/update-notifier/update-motd-cpu-checker: No such file or directory
> /etc/update-motd.d/20-cpu-checker: line 3: exec:
> /usr/lib/update-notifier/update-motd-cpu-checker: cannot execute: No
> such file
> or directory
> run-parts: /etc/update-motd.d/20-cpu-checker exited with return code 126
> /etc/update-motd.d/90-updates-available: line 3:
> /usr/lib/update-notifier/update-motd-updates-available: No such file or
> directory
> /etc/update-motd.d/90-updates-available: line 3: exec:
> /usr/lib/update-notifier/update-motd-updates-available: cannot execute:
> No such file or directory
> run-parts: /etc/update-motd.d/90-updates-available exited with return
> code 126
> /etc/update-motd.d/98-reboot-required: line 3:
> /usr/lib/update-notifier/update-motd-reboot-required: No such file or
> directory
> /etc/update-motd.d/98-reboot-required: line 3: exec:
> /usr/lib/update-notifier/update-motd-reboot-required: cannot execute: No
> such file or directory
> run-parts: /etc/update-motd.d/98-reboot-required exited with return code 126

This looks like filesystem corruption, did you fsck the drive/partition?

Regards,
Andrei
-- 
Offtopic discussions among Debian users and developers:
http://lists.alioth.debian.org/mailman/listinfo/d-community-offtopic


signature.asc
Description: Digital signature


was getting disk failure errors, repaired the sectors, now what?

2010-06-30 Thread H.S.

I noticed that when I rebooted my machine earlier today, it would not
load the kernel and it was giving some "media error" messages.

I did various basic hardware debugging and ended up with my hard disk's
manufacturer's diagnostic utility telling me that there were bad sectors
on the drive. This was from a Windows 7 machine. But it would not repair
the disk. Searched some more and realized I should try it from a boot
disk (as opposed to from within Windows) created from the diagnostic
utility. So I did that, rebooted in DR DOS and ran the test again. This
time the test reported errors but also repaired them.


Now, I checked my logs from last few days and it seems like the problem
started only 3 or 4 days ago (the errors are given further below). The
problem appears to be in /dev/sdb3. The good news is that I do regular
backups of my /home (don't care about the system files, I can always
reinstall it), so I wasn't worried about losing any data (the OS and
/home partitions are on sda). The bad news is that my backups are on
/deb/sdb3 :(


So now I know that my backups most probably are not trustworthy, the
ones from the last four or so days. No problem. I do rolling backups
using cron and rsync. But what I do now? Do I just delete the backups
from the last four days and resume regular ones? How risky is the
partition even though the manufacturer's diagnostic utility reports no
errors now.

Another thing, I get this when I log in to a console, what is it all about?
#-#
Last login: Wed Jun 30 17:23:40 EDT 2010 from localhost on pts/9
/etc/update-motd.d/20-cpu-checker: line 3:
/usr/lib/update-notifier/update-motd-cpu-checker: No such file or directory
/etc/update-motd.d/20-cpu-checker: line 3: exec:
/usr/lib/update-notifier/update-motd-cpu-checker: cannot execute: No
such file
or directory
run-parts: /etc/update-motd.d/20-cpu-checker exited with return code 126
/etc/update-motd.d/90-updates-available: line 3:
/usr/lib/update-notifier/update-motd-updates-available: No such file or
directory
/etc/update-motd.d/90-updates-available: line 3: exec:
/usr/lib/update-notifier/update-motd-updates-available: cannot execute:
No such file or directory
run-parts: /etc/update-motd.d/90-updates-available exited with return
code 126
/etc/update-motd.d/98-reboot-required: line 3:
/usr/lib/update-notifier/update-motd-reboot-required: No such file or
directory
/etc/update-motd.d/98-reboot-required: line 3: exec:
/usr/lib/update-notifier/update-motd-reboot-required: cannot execute: No
such file or directory
run-parts: /etc/update-motd.d/98-reboot-required exited with return code 126
Linux red 2.6.32-100601-red-1394 #1 Tue Jun 1 00:13:15 EDT 2010 i686

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
You have mail.
#-#


The errors from the log are given below.

#-#
Jun 28 10:30:01 red /USR/SBIN/CRON[11462]: (CRON) error (grandchild
#11463 failed with exit status 5)
Jun 28 10:30:01 red kernel: [210131.012427] sd 0:0:1:0: [sdb] Unhandled
error code
Jun 28 10:30:01 red kernel: [210131.012436] sd 0:0:1:0: [sdb] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 28 10:30:01 red kernel: [210131.012444] sd 0:0:1:0: [sdb] CDB:
Read(10): 28 00 09 6d 24 54 00 00 08 00
Jun 28 10:30:01 red kernel: [210131.012459] end_request: I/O error, dev
sdb, sector 158147668
Jun 28 10:30:01 red kernel: [210131.012489] EXT3-fs error (device sdb3):
ext3_get_inode_loc: unable to read inode block - inode=2894305,
block=5799941
Jun 28 10:51:48 red -- MARK --
Jun 28 11:08:26 red smartd[1577]: Device: /dev/sda [SAT], SMART Usage
Attribute: 194 Temperature_Celsius changed from 98 to 96
Jun 28 11:17:02 red /USR/SBIN/CRON[11544]: (root) CMD (   cd / &&
run-parts --report /etc/cron.hourly)
Jun 28 11:31:48 red -- MARK --
Jun 28 11:51:48 red -- MARK --
Jun 28 11:55:46 red kernel: [215276.064441] sd 0:0:1:0: [sdb] Unhandled
error code
Jun 28 11:55:46 red kernel: [215276.064450] sd 0:0:1:0: [sdb] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 28 11:55:46 red kernel: [215276.064458] sd 0:0:1:0: [sdb] CDB:
Read(10): 28 00 00 06 12 6f 00 00 08 00
Jun 28 11:55:46 red kernel: [215276.064473] end_request: I/O error, dev
sdb, sector 397935
Jun 28 11:55:46 red kernel: [215276.064497] EXT3-fs error (device sdb5):
ext3_find_entry: reading directory #2 offset 0
Jun 28 11:55:46 red kernel: [215276.064553] sd 0:0:1:0: [sdb] Unhandled
error code
Jun 28 11:55:46 red kernel: [215276.064557] sd 0:0:1:0: [sdb] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 28 11:55:46 red kernel: [215276.064563] sd 0:0:1:0: [sdb] CDB:
Write(10): 2a 00 00 05 e2 57 00 0

Re: Need advice analyzing S.M.A.R.T data of HDD in "Imminent disk failure".

2009-11-03 Thread Tony Nelson
On 09-11-03 21:29:19, Luis Maceira wrote:
> In Ubuntu9.10 I have received warnings that a HDD is in
> pre-failure.The disk(Iomega Prestige mobile USB external) has 1
> month.In Debian Testing and OpenSolaris(installed on the same HDD I
> have no warnings.).Using smartmontools (this disk is not in its
> database) I get below:
> m...@mycomputer:~$ sudo smartctl -T verypermissive -a -d sat --health
> /dev/sdb
 ...
> SMART overall-health self-assessment test result: PASSED
> See vendor-specific Attribute list for marginal Attributes.
 ...
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE 
> UPDATED  WHEN_FAILED RAW_VALUE
 ...
>  10 Spin_Retry_Count0x0013   100   092   097Pre-fail 
> Always   In_the_past 0
 ...

This is the complaint.  Perhaps the disk won't spin up some day, making 
the data inaccessible, or perhaps the disk just didn't get enough power 
one time (use an external power supply to prevent a recurrence).

I don't think there is any way to get Palimpsest to not complain other 
than to shut it off.

In the past I have rendered a drive amnesiac by updating its firmware, 
but I can't really recommend that, and it's a hassle.

-- 

TonyN.:'   
  '  


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Need advice analyzing S.M.A.R.T data of HDD in "Imminent disk failure".

2009-11-03 Thread Greg Madden
On Tuesday 03 November 2009 17:29:19 Luis Maceira wrote:
> In Ubuntu9.10 I have received warnings that a HDD is in pre-failure.The
> disk(Iomega Prestige mobile USB external) has 1 month.In Debian Testing and
> OpenSolaris(installed on the same HDD I have no warnings.).Using
> smartmontools (this disk is not in its database) I get below:
> m...@mycomputer:~$ sudo smartctl -T verypermissive -a -d sat --health
> /dev/sdb
>
snip
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> See vendor-specific Attribute list for marginal Attributes.
>



SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH 
TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   100   006    Pre-fail  Always       
-       
116611138
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       
-       
0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       
-       
175
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       
-       
0
  7 Seek_Error_Rate         0x000f   062   060   030    Pre-fail  Always       
-       
1756052
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       
6
 10 Spin_Retry_Count        0x0013   100   092   097    Pre-fail  Always   
In_the_past 0
 12 Power_Cycle_Count       0x0032   100   037   020    Old_age   Always       
-       
175
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       
-       
0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       
-       
0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       
0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       
-       
0
190 Airflow_Temperature_Cel 0x0022   073   064   045    Old_age   Always       
-       
27 (Lifetime Min/Max 18/27)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       
-       
0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       
-       
157
193 Load_Cycle_Count        0x0032   092   092   000    Old_age   Always       
-       
17797
194 Temperature_Celsius     0x0022   027   040   000    Old_age   Always       
-       
27 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   047   045   000    Old_age   Always       
-       
116611138
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-         
0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       
-       
1073741824


This is a Ubuntu issue, really, Ubuntu has implemented new stuff , device-kit 
(ata) 
among others. This has resulted in Smart error messages in user space, where it 
has caused many  (users) to wonder wtf. 

It is still the same info, Smart enabled drives, you probably didn't test with  
smartctl very often, at least I didn't. 

AFAIKT, if  ' Reallocated_Sector_C, & ' Offline_Uncorrectable' equal zero the 
drive 
firmware is still doing its job of remapping bad blocks.

 Lots of info on google, dating way back, on smartmon tools.



-- 
Peace

Greg Madden


Need advice analyzing S.M.A.R.T data of HDD in "Imminent disk failure".

2009-11-03 Thread Luis Maceira
In Ubuntu9.10 I have received warnings that a HDD is in pre-failure.The 
disk(Iomega Prestige mobile USB external) has 1 month.In Debian Testing and 
OpenSolaris(installed on the same HDD I have no warnings.).Using smartmontools 
(this disk is not in its database) I get below:
m...@mycomputer:~$ sudo smartctl -T verypermissive -a -d sat --health /dev/sdb

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: ST9250315AS
Serial Number:5VC106FC
Firmware Version: 0001BSM1
User Capacity:250,059,350,016 bytes
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:Wed Nov  4 01:47:20 2009 WET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Error SMART Status command failed
Please get assistance from http://smartmontools.sourceforge.net/
Values from ATA Return Descriptor are:
 00 09 0c 00 00 00 00 00 00  00 4f 00 00 00 00  
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (   0) seconds.
Offline data collection
capabilities:(0x73) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:(  69) minutes.
Conveyance self-test routine
recommended polling time:(   2) minutes.
SCT capabilities:  (0x103b) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   116   100   006Pre-fail  Always   
-   116611138
  3 Spin_Up_Time0x0003   099   099   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   100   100   020Old_age   Always   
-   175
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   062   060   030Pre-fail  Always   
-   1756052
  9 Power_On_Hours  0x0032   100   100   000Old_age   Always   
-   6
 10 Spin_Retry_Count0x0013   100   092   097Pre-fail  Always   
In_the_past 0
 12 Power_Cycle_Count   0x0032   100   037   020Old_age   Always   
-   175
184 Unknown_Attribute   0x0032   100   100   099Old_age   Always   
-   0
187 Reported_Uncorrect  0x0032   100   100   000Old_age   Always   
-   0
188 Unknown_Attribute   0x0032   100   100   000Old_age   Always   
-   0
189 High_Fly_Writes 0x003a   100   100   000Old_age   Always   
-   0
190 Airflow_Temperature_Cel 0x0022   073   064   045Old_age   Always   
-   27 (Lifetime Min/Max 18/27)
191 G-Sense_Error_Rate  0x0032   100   100   000Old_age   Always   
-   0
192 Power-Off_Retract_Count 0x0032   100   100   000Old_age   Always   
-   157
193 Load_Cycle_Count0x0032   092   092   000Old_age   Always   
-   17797
194 Temperature_Celsius 0x0022   027   040   000Old_age   Always   
-   27 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001

Re: need advice on scsi disk failure

2009-08-21 Thread owens
>
>
>
> Original Message 
>From: longwind2...@gmail.com
>To: debian-user@lists.debian.org
>Subject: Re: need advice on scsi disk failure
>Date: Fri, 21 Aug 2009 14:33:10 -0800
>
>>On Fri, Aug 21, 2009 at 7:34 AM,  wrote:
>>>>
>>> The conventional approach is first to clean the contacts on the
>>> connector and the card (some alcohol on a cotton swab for the
>>> connector and a pencil eraser for the card contacts) and try
>again.
>>> If that doesn't work go to "plan B" (run a complete disk
>diagnostic
>>> regularly-if the problem is the connector then the disk diagnostic
>>> will only fail when the connector isn't inserted fully; if the
>>> problem is NOT the connector then the disk diagnostic should pick
>up
>>> the problem).
>>> L
>>>>>
>>>>>--
>>
>>You are right!
>>I don't have alcohol
>>so I just use my hand to rub the connection between cable and
>connector
>>the new power supply (which always fails to power my scsi disk) can
>power it now
>>
>>The human's hand is the best tool!
>>
>>I still have to buy a new scsi cable (and connector?)
>>
Good stuff!  The eraser merely provides some good friction without
producing particles that might muck-up your computer-obviously your
hand worked.  As for me if your system is stable, don't buy some new
cables or connectors ("if it ain't broke don't fix it")
L
>>
>>-- 
>>To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
>>with a subject of "unsubscribe". Trouble? Contact 
listmas...@lists.debian.org
>>
>>



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: need advice on scsi disk failure

2009-08-21 Thread Long Wind
On Fri, Aug 21, 2009 at 7:34 AM,  wrote:
>>
> The conventional approach is first to clean the contacts on the
> connector and the card (some alcohol on a cotton swab for the
> connector and a pencil eraser for the card contacts) and try again.
> If that doesn't work go to "plan B" (run a complete disk diagnostic
> regularly-if the problem is the connector then the disk diagnostic
> will only fail when the connector isn't inserted fully; if the
> problem is NOT the connector then the disk diagnostic should pick up
> the problem).
> L
>>>
>>>--

You are right!
I don't have alcohol
so I just use my hand to rub the connection between cable and connector
the new power supply (which always fails to power my scsi disk) can power it now

The human's hand is the best tool!

I still have to buy a new scsi cable (and connector?)


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: need advice on scsi disk failure

2009-08-21 Thread Long Wind
On Fri, Aug 21, 2009 at 11:34 AM,  wrote:
> The conventional approach is first to clean the contacts on the
> connector and the card (some alcohol on a cotton swab for the
> connector and a pencil eraser for the card contacts) and try again.
> If that doesn't work go to "plan B" (run a complete disk diagnostic
> regularly-if the problem is the connector then the disk diagnostic
> will only fail when the connector isn't inserted fully; if the
> problem is NOT the connector then the disk diagnostic should pick up
> the problem).
> L

Thanks!
but I don't have alcohol
"If that doesn't work ", the scsi card doesn't find the disk and can't
run disk diagnostic
I have a new power supply and an old one.
Using the new one, the scsi disk always can't be found
but with the old one, I have more luck


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



RE: need advice on scsi disk failure

2009-08-21 Thread owens
>
>
>
> Original Message 
>From: longwind2...@gmail.com
>To: debian-user@lists.debian.org
>Subject: RE: need advice on scsi disk failure
>Date: Fri, 21 Aug 2009 08:50:21 -0400
>
>>I bought a scsi 50G disk a few years ago
>>The seller said it had been used on server for a long time
>>The scsi card used to warn that the disk will fail soon during boot
>>Then I change SCSI card firmware, the warning disappear
>>>From 5 days ago, the card often can't find disk during boot
>>It's no surprise because the light on scsi disk isn't on
>>I reconnect the power cable to scsi disk again and again
>>and then with some luck the disk works normally.
>>It seems that the power connection becomes loose
>>but I am afraid the disk is failing
>>My question is "Is that symptoms of scsi disk failure?"
>>
The conventional approach is first to clean the contacts on the
connector and the card (some alcohol on a cotton swab for the
connector and a pencil eraser for the card contacts) and try again. 
If that doesn't work go to "plan B" (run a complete disk diagnostic
regularly-if the problem is the connector then the disk diagnostic
will only fail when the connector isn't inserted fully; if the
problem is NOT the connector then the disk diagnostic should pick up
the problem).
L
>>
>>-- 
>>To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
>>with a subject of "unsubscribe". Trouble? Contact 
listmas...@lists.debian.org
>>
>>



--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: need advice on scsi disk failure

2009-08-21 Thread mitch
On Fri, 21 Aug 2009 09:35:25 -0400
Long Wind  wrote:

> On Fri, Aug 21, 2009 at 8:56 AM, mitch
> wrote:
> >
> > I had the same problem, scsi drives failing to start, shutting down
> > while running.
> >
> > Bad power connector. The pins were not making proper contact at all
> > times.
> >
> > Changed the connectors and the problem stopped.
> >
> 
> Really?
> I always think scsi cable and connector are durable component
> I rarely touch these components
> I will buy new ones
> Thanks!

That's what I found. The pins are metal and heat causes metal to
expand and the scsi drives can get warm and the heat gets transferred
to the pins. 


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: need advice on scsi disk failure

2009-08-21 Thread Long Wind
On Fri, Aug 21, 2009 at 8:56 AM, mitch wrote:
>
> I had the same problem, scsi drives failing to start, shutting down
> while running.
>
> Bad power connector. The pins were not making proper contact at all
> times.
>
> Changed the connectors and the problem stopped.
>

Really?
I always think scsi cable and connector are durable component
I rarely touch these components
I will buy new ones
Thanks!


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: need advice on scsi disk failure

2009-08-21 Thread mitch
On Fri, 21 Aug 2009 08:50:21 -0400
Long Wind  wrote:

> It's no surprise because the light on scsi disk isn't on
> I reconnect the power cable to scsi disk again and again
> and then with some luck the disk works normally.
> It seems that the power connection becomes loose

I had the same problem, scsi drives failing to start, shutting down
while running.

Bad power connector. The pins were not making proper contact at all
times.

Changed the connectors and the problem stopped.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



need advice on scsi disk failure

2009-08-21 Thread Long Wind
I bought a scsi 50G disk a few years ago
The seller said it had been used on server for a long time
The scsi card used to warn that the disk will fail soon during boot
Then I change SCSI card firmware, the warning disappear
>From 5 days ago, the card often can't find disk during boot
It's no surprise because the light on scsi disk isn't on
I reconnect the power cable to scsi disk again and again
and then with some luck the disk works normally.
It seems that the power connection becomes loose
but I am afraid the disk is failing
My question is "Is that symptoms of scsi disk failure?"


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: disk failure [CLOSED]

2007-11-14 Thread Andrew Sackville-West
On Wed, Nov 14, 2007 at 04:46:42PM +, michael wrote:
> On Wed, 2007-11-14 at 11:41 +, michael wrote:
> > On Wed, 2007-11-14 at 12:22 +0100, Jochen Schulz wrote:
> > > michael:
> > > > 'tiger' just told me various home directories are unavailable and upon
> > > > further investigation I see disk errors. Here's the first reports I can
> > > > find regarding said hard drive:
> > > > 
> > > > Nov 13 02:23:01 ratty /USR/SBIN/CRON[19292]: (michael) CMD (rsync -r -v
> > > > -P --links --stats /data_hdb1/michael/ /data_hdd1/michael/)
> > > > Nov 13 02:27:32 ratty kernel: hdd: dma_timer_expiry: dma status == 0x61
> > > > Nov 13 02:27:47 ratty kernel: hdd: DMA timeout error
> > > > Nov 13 02:27:47 ratty kernel: hdd: dma timeout error: status=0x58
> > > > { DriveReady SeekComplete DataRequest }
> > > 
> > > If I were you, I would assume it is dead. Try to copy everything you
> > > still can get off the disk, use your backup for the rest.
> > 
> > When I tried e2fsck it said there was no partition on the HD (sorry just
> > had to turn the machine off, see below)
> > 
> > > If you are curious, you may use smartctl from the package smartmontools
> > > to do further investigation. Your hard disk's manufacturer probably
> > > offers diagnostic tools as well.
> > 
> > SMART only gave very limited info - didn't seem to be able to read disk
> > at all.
> > 
> > > > Does anybody have ideas if this means the HD has actually died or not?
> > > 
> > > It's most probably dead.
> > 
> > 
> > Just discover the air con where the server is situated had conked out.
> > Estimated room temp was 30C (in UK where outside air temp is about 7C
> > today) so I suspect that may be a contributing factor. I'm crossing my
> > fingers that once the room etc has cooled down that the HD will work
> > again...
> > 
> 
> aha, close shave, now room cooler the HD in question works again. smartd
> now installed and running ;)

I'd still backup and replace that drive... The old trick of putting a
HD in the freezer to get another boot out of it only works for so long
and it sounds like that drive is headed down that path.

A


signature.asc
Description: Digital signature


Re: disk failure [CLOSED]

2007-11-14 Thread michael
On Wed, 2007-11-14 at 11:41 +, michael wrote:
> On Wed, 2007-11-14 at 12:22 +0100, Jochen Schulz wrote:
> > michael:
> > > 'tiger' just told me various home directories are unavailable and upon
> > > further investigation I see disk errors. Here's the first reports I can
> > > find regarding said hard drive:
> > > 
> > > Nov 13 02:23:01 ratty /USR/SBIN/CRON[19292]: (michael) CMD (rsync -r -v
> > > -P --links --stats /data_hdb1/michael/ /data_hdd1/michael/)
> > > Nov 13 02:27:32 ratty kernel: hdd: dma_timer_expiry: dma status == 0x61
> > > Nov 13 02:27:47 ratty kernel: hdd: DMA timeout error
> > > Nov 13 02:27:47 ratty kernel: hdd: dma timeout error: status=0x58
> > > { DriveReady SeekComplete DataRequest }
> > 
> > If I were you, I would assume it is dead. Try to copy everything you
> > still can get off the disk, use your backup for the rest.
> 
> When I tried e2fsck it said there was no partition on the HD (sorry just
> had to turn the machine off, see below)
> 
> > If you are curious, you may use smartctl from the package smartmontools
> > to do further investigation. Your hard disk's manufacturer probably
> > offers diagnostic tools as well.
> 
> SMART only gave very limited info - didn't seem to be able to read disk
> at all.
> 
> > > Does anybody have ideas if this means the HD has actually died or not?
> > 
> > It's most probably dead.
> 
> 
> Just discover the air con where the server is situated had conked out.
> Estimated room temp was 30C (in UK where outside air temp is about 7C
> today) so I suspect that may be a contributing factor. I'm crossing my
> fingers that once the room etc has cooled down that the HD will work
> again...
> 

aha, close shave, now room cooler the HD in question works again. smartd
now installed and running ;)


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: disk failure

2007-11-14 Thread michael
On Wed, 2007-11-14 at 12:22 +0100, Jochen Schulz wrote:
> michael:
> > 'tiger' just told me various home directories are unavailable and upon
> > further investigation I see disk errors. Here's the first reports I can
> > find regarding said hard drive:
> > 
> > Nov 13 02:23:01 ratty /USR/SBIN/CRON[19292]: (michael) CMD (rsync -r -v
> > -P --links --stats /data_hdb1/michael/ /data_hdd1/michael/)
> > Nov 13 02:27:32 ratty kernel: hdd: dma_timer_expiry: dma status == 0x61
> > Nov 13 02:27:47 ratty kernel: hdd: DMA timeout error
> > Nov 13 02:27:47 ratty kernel: hdd: dma timeout error: status=0x58
> > { DriveReady SeekComplete DataRequest }
> 
> If I were you, I would assume it is dead. Try to copy everything you
> still can get off the disk, use your backup for the rest.

When I tried e2fsck it said there was no partition on the HD (sorry just
had to turn the machine off, see below)

> If you are curious, you may use smartctl from the package smartmontools
> to do further investigation. Your hard disk's manufacturer probably
> offers diagnostic tools as well.

SMART only gave very limited info - didn't seem to be able to read disk
at all.

> > Does anybody have ideas if this means the HD has actually died or not?
> 
> It's most probably dead.


Just discover the air con where the server is situated had conked out.
Estimated room temp was 30C (in UK where outside air temp is about 7C
today) so I suspect that may be a contributing factor. I'm crossing my
fingers that once the room etc has cooled down that the HD will work
again...




-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: disk failure

2007-11-14 Thread Jochen Schulz
michael:
> 'tiger' just told me various home directories are unavailable and upon
> further investigation I see disk errors. Here's the first reports I can
> find regarding said hard drive:
> 
> Nov 13 02:23:01 ratty /USR/SBIN/CRON[19292]: (michael) CMD (rsync -r -v
> -P --links --stats /data_hdb1/michael/ /data_hdd1/michael/)
> Nov 13 02:27:32 ratty kernel: hdd: dma_timer_expiry: dma status == 0x61
> Nov 13 02:27:47 ratty kernel: hdd: DMA timeout error
> Nov 13 02:27:47 ratty kernel: hdd: dma timeout error: status=0x58
> { DriveReady SeekComplete DataRequest }

If I were you, I would assume it is dead. Try to copy everything you
still can get off the disk, use your backup for the rest.

If you are curious, you may use smartctl from the package smartmontools
to do further investigation. Your hard disk's manufacturer probably
offers diagnostic tools as well.

> Does anybody have ideas if this means the HD has actually died or not?

It's most probably dead.

J.
-- 
When standing at the top of beachy head I find the rocks below very
attractive.
[Agree]   [Disagree]
 


signature.asc
Description: Digital signature


Re: disk failure

2007-11-14 Thread michael
On Wed, 2007-11-14 at 10:36 +, michael wrote:
> 'tiger' just told me various home directories are unavailable and upon
> further investigation I see disk errors. Here's the first reports I can
> find regarding said hard drive:
> 
> Nov 13 02:23:01 ratty /USR/SBIN/CRON[19292]: (michael) CMD (rsync -r -v
> -P --links --stats /data_hdb1/michael/ /data_hdd1/michael/)
> Nov 13 02:27:32 ratty kernel: hdd: dma_timer_expiry: dma status == 0x61
> Nov 13 02:27:47 ratty kernel: hdd: DMA timeout error
> Nov 13 02:27:47 ratty kernel: hdd: dma timeout error: status=0x58
> { DriveReady SeekComplete DataRequest }
> Nov 13 02:27:47 ratty kernel: ide: failed opcode was: unknown
> Nov 13 02:27:47 ratty kernel: hdd: status timeout: status=0xd0 { Busy }
> Nov 13 02:27:47 ratty kernel: ide: failed opcode was: unknown
> Nov 13 02:27:47 ratty kernel: hdc: DMA disabled
> Nov 13 02:27:47 ratty kernel: hdd: drive not ready for command
> Nov 13 02:27:58 ratty kernel: ide1: reset: success
> Nov 13 02:33:12 ratty kernel: hdd: dma_timer_expiry: dma status == 0x40
> Nov 13 02:33:12 ratty kernel: hdd: DMA timeout retry
> Nov 13 02:33:12 ratty kernel: hdd: timeout waiting for DMA
> Nov 13 02:33:12 ratty kernel: hdd: status timeout: status=0xd0 { Busy }
> Nov 13 02:33:12 ratty kernel: ide: failed opcode was: unknown
> Nov 13 02:33:12 ratty kernel: hdd: drive not ready for command
> Nov 13 02:33:42 ratty kernel: ide1: reset timed-out, status=0x80
> Nov 13 02:33:42 ratty kernel: hdd: status error: status=0x7f
> { DriveReady DeviceFault SeekComplete DataRequest CorrectedError Index
> Error }
> Nov 13 02:33:42 ratty kernel: hdd: status error: error=0x7f
> { DriveStatusError UncorrectableError SectorIdNotFound TrackZeroNotFound
> AddrMarkNotFound }, LBAs
> ect=1103831727999, high=65793, low=8355711, sector=1440664839
> Nov 13 02:33:42 ratty kernel: ide: failed opcode was: unknown
> Nov 13 02:33:42 ratty kernel: hdd: no DRQ after issuing MULTWRITE_EXT
> Nov 13 02:34:12 ratty kernel: ide1: reset timed-out, status=0x80
> Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
> 1440664839
> Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
> block 180083097
> Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
> Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
> 1440664847
> Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
> block 180083098
> Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
> Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
> 1440664855
> Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
> block 180083099
> Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
> Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
> 1440664863
> Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
> block 180083100
> Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
> Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
> 1440664871
> Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
> block 180083101
> Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
> Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
> 1440664879
> Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
> block 180083102
> Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
> Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
> 1440664887
> {and much more}
> 
> Does anybody have ideas if this means the HD has actually died or not?
> Other HDs seem okay. Cheers, Michael

clamscan also found:
//home/michael/.gnome2/epiphany/mozilla/epiphany/Cache/_CACHE_001_:
Trojan.Downloader.JS.Zapchast.B FOUND

so I've removed the .gnome2/epiphany/mozilla/epiphany/Cache directory


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



disk failure

2007-11-14 Thread michael
'tiger' just told me various home directories are unavailable and upon
further investigation I see disk errors. Here's the first reports I can
find regarding said hard drive:

Nov 13 02:23:01 ratty /USR/SBIN/CRON[19292]: (michael) CMD (rsync -r -v
-P --links --stats /data_hdb1/michael/ /data_hdd1/michael/)
Nov 13 02:27:32 ratty kernel: hdd: dma_timer_expiry: dma status == 0x61
Nov 13 02:27:47 ratty kernel: hdd: DMA timeout error
Nov 13 02:27:47 ratty kernel: hdd: dma timeout error: status=0x58
{ DriveReady SeekComplete DataRequest }
Nov 13 02:27:47 ratty kernel: ide: failed opcode was: unknown
Nov 13 02:27:47 ratty kernel: hdd: status timeout: status=0xd0 { Busy }
Nov 13 02:27:47 ratty kernel: ide: failed opcode was: unknown
Nov 13 02:27:47 ratty kernel: hdc: DMA disabled
Nov 13 02:27:47 ratty kernel: hdd: drive not ready for command
Nov 13 02:27:58 ratty kernel: ide1: reset: success
Nov 13 02:33:12 ratty kernel: hdd: dma_timer_expiry: dma status == 0x40
Nov 13 02:33:12 ratty kernel: hdd: DMA timeout retry
Nov 13 02:33:12 ratty kernel: hdd: timeout waiting for DMA
Nov 13 02:33:12 ratty kernel: hdd: status timeout: status=0xd0 { Busy }
Nov 13 02:33:12 ratty kernel: ide: failed opcode was: unknown
Nov 13 02:33:12 ratty kernel: hdd: drive not ready for command
Nov 13 02:33:42 ratty kernel: ide1: reset timed-out, status=0x80
Nov 13 02:33:42 ratty kernel: hdd: status error: status=0x7f
{ DriveReady DeviceFault SeekComplete DataRequest CorrectedError Index
Error }
Nov 13 02:33:42 ratty kernel: hdd: status error: error=0x7f
{ DriveStatusError UncorrectableError SectorIdNotFound TrackZeroNotFound
AddrMarkNotFound }, LBAs
ect=1103831727999, high=65793, low=8355711, sector=1440664839
Nov 13 02:33:42 ratty kernel: ide: failed opcode was: unknown
Nov 13 02:33:42 ratty kernel: hdd: no DRQ after issuing MULTWRITE_EXT
Nov 13 02:34:12 ratty kernel: ide1: reset timed-out, status=0x80
Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
1440664839
Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
block 180083097
Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
1440664847
Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
block 180083098
Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
1440664855
Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
block 180083099
Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
1440664863
Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
block 180083100
Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
1440664871
Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
block 180083101
Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
1440664879
Nov 13 02:34:12 ratty kernel: Buffer I/O error on device hdd1, logical
block 180083102
Nov 13 02:34:12 ratty kernel: lost page write due to I/O error on hdd1
Nov 13 02:34:12 ratty kernel: end_request: I/O error, dev hdd, sector
1440664887
{and much more}

Does anybody have ideas if this means the HD has actually died or not?
Other HDs seem okay. Cheers, Michael


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Hard disk failure?

2006-03-31 Thread Ramiro Aceves
> Well, I have been wrong a few times before, but jumping to conclusions
> has stood me well over the years on average.  I'm a semi-almost-nearly
> retired broadcast engineer and a Certified Electronics Technician with
> over 55 years of corraling electrons for a living.  They never seem to
> want you to fully retire when they *think* they might have to call me
> in an emergency.  So they sort of keep me on retainer or something.  It
> helps pay the medical insurance thats a bit hard to get when you're
> past 70. :)
>
> Let me know what the Seagate diagnostics thinks of it.  I'm assuming of
> course that the cabling is correct, as in this drive *is* on the end of
> the cable, and not the middle connector, and that the longer section of
> the cable is plugged into the motherboard, as either of those two
> conditions not being met can bite you in a similar manner.
>
> Good luck, and toss me a mail if thats it, or the drive itself is
> toasted.

Hello again,

Thanks for the responses.

Last night I run SeaTools tool on the entire disk and did not find any
errors. I did a full test. I do not understand why this tool does not
tell anything about the SMART thing and the recorded errors I see with
smartctl :-(

Yes, I have a 40 GB drive on the middle of the cable and the 160 GB
drive is on the end. Today I am going to buy a new cable just to
discard the cable issue (I do not know if it makes sense). Also I am
going to measure disk input voltages just to verify everything is ok
on the power suppy.

I have another Debian Sarge on the 40GB drive that I use to backup the
first disk. I have booted from it and made a full system backup of my
Debian partition on the first disk.

This problem is very annoying because it happens some times. For
example, I did a full backup on the 40GB disk from my Debian partition
on the first disk without problems.

But sometimes, when I mount

mount -t ext3 /dev/hda1 /mnt/harddisk1(on the second disk)

It refuses to mount and I get the disk unrecoverable errors. But
sometimes it just works fine. This is driving me nuts ;-)

I am lost on what to do.



Thank you very much in advance.
Ramiro.



>
> >Regards.
> >
> >Ramiro,.



Re: Hard disk failure?

2006-03-30 Thread spacetrial
hi,

me myself had similar troubles with SATA Seagate drive and DMA (in WinXP no 
problems) - see Jeff Garzik's libsata page.

I solved it by buying a Western Digital drive.

Al_

On Thu 30. of March 2006 18:14, Ramiro Aceves wrote:
> Hello,
>
> Thank you very much for your help. I apreciate your time. The hard
> disk has faced the problem again today. The problem is that the disk
> exhibit the same errors (with different numbers) at boot.
>
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=43778543,
> high=2, low=10224111,> sector=43778543
> debian-remix kernel: end_request: I/O error, dev hda,
>  sector 43778543
>
> the system then remounts read-only and the system cannot write anything.
>
> I have just downloaded the Seagate diagnostics tool "SeaTools". It is
> a bootable CDROM iso image. I will run it and will post the results
> here.
>
> Regards.
>
> Ramiro


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Hard disk failure?

2006-03-30 Thread Ramiro Aceves
Hello,

Thank you very much for your help. I apreciate your time. The hard
disk has faced the problem again today. The problem is that the disk
exhibit the same errors (with different numbers) at boot.

hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=43778543,
high=2, low=10224111,> sector=43778543
debian-remix kernel: end_request: I/O error, dev hda,
 sector 43778543

the system then remounts read-only and the system cannot write anything.

I have just downloaded the Seagate diagnostics tool "SeaTools". It is
a bootable CDROM iso image. I will run it and will post the results
here.

Regards.

Ramiro



Re: Hard disk failure?

2006-03-30 Thread Philippe De Ryck
On Thu, 2006-03-30 at 13:38 +0200, Ramiro Aceves wrote:
> Hello Debian friends,
> 
> On september 2005 I bought a new Seagate 160 GB hard disk type
> ST3160021A UDMA (not SATA) and after some time of good working I am
> getting some kind of errors, mainly on Debian Sarge startup.
> 
> Sometimes my system do not boot because it says something like: "
> readonly filesystem".
> 
> The errors occur frequently now, and they often happen on the system
> "cold" booting, I mean, the first time I switch it on.
> 
> I cannot  tell you the exact messages cause I am not the normal user
> of this computer. My mother, who uses the computer, has written down
> the following message, so It could be it is not accurate:
> 
> "ext3 error device hda1 in start transation: readonly filesystem."
> 
> I also have some  /var/log/messages errors:
> 
> 
> Mar 26 10:49:23 debian-remix kernel: hda: dma_intr: status=0x51 {
> DriveReady SeekComplete Error }
> Mar 26 10:49:23 debian-remix kernel: hda: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=43778543, high=2, low=10224111,
> sector=43778543
> Mar 26 10:49:23 debian-remix kernel: end_request: I/O error, dev hda,
> sector 43778543
> 
> 
> I have also have run SMARTCTL tests with the following results:
> 
> 
> # smartctl -a /dev/hda
> 
> >From wich I have captured the last 5 errors:



> 
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate 0x000f   058   056   006Pre-fail
> Always   -   129227943
>   3 Spin_Up_Time0x0003   097   096   000Pre-fail
> Always   -   0
>   4 Start_Stop_Count0x0032   100   100   020Old_age
> Always   -   1
>   5 Reallocated_Sector_Ct   0x0033   098   098   036Pre-fail
> Always   -   80
>   7 Seek_Error_Rate 0x000f   073   060   030Pre-fail
> Always   -   22255207
>   9 Power_On_Hours  0x0032   100   100   000Old_age
> Always   -   795
>  10 Spin_Retry_Count0x0013   100   100   097Pre-fail
> Always   -   0
>  12 Power_Cycle_Count   0x0032   100   100   020Old_age
> Always   -   559
> 194 Temperature_Celsius 0x0022   033   040   000Old_age   Always
>   -   33
> 195 Hardware_ECC_Recovered  0x001a   058   056   000Old_age   Always
>   -   129227943
> 197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always
>   -   0
> 198 Offline_Uncorrectable   0x0010   100   100   000Old_age
> Offline  -   0
> 199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age   Always
>   -   0
> 200 Multi_Zone_Error_Rate   0x   100   253   000Old_age
> Offline  -   0
> 202 TA_Increase_Count   0x0032   100   253   000Old_age   Always
>   -   0



> 
> 
> What do you thing shoud I do?
> 
> 1-¿Does it make sense to check the disk cable? Or is it an "internal"
> disk drive error?
> 2- Should I return the disk to my seller?
> 
> 
> Normally, restarting the computer solves the problem after a fsck.
> Sometimes I have also run a "manual" fsck with no aparent data loss. I
> am concerned about a more serious  hard disk failure with real data
> loss. (I have done backups, no problem   ;-)  )
> 
> Many thanks in advance:
> 
> Ramiro
> 

Those errors are bad indeed!. I've seen those kernel messages on one of
my machines due to a faulty cable (few years ago). I've faced some hard
drive issues last week (still facing actually :)) and started looking
into smart. I've found this article which explains it quite good
(http://www.linuxjournal.com/article/6983 ). According to the article,
high values in the attribute-table are good. You have some pretty low
values, even below the treshold, which is not good.

You can also boot knoppix or some live distro and run badblocks on the
drive. This will scan the entire drive for badblocks.

Maybe Seagate provides a tool (on the site for instance) to examine the
drive. That could give some specific information.

I don't know if you can return a drive by saying that it is dying. I
think they will send you home with the message: "come back when it's
dead".

For information, my attribute table from my Maxtor 6Y120M0 (SATA):
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time0x0027   138   128   063Pre-fail  Always
-   24509
  4 Start_Stop_Count0x0032   253   253   000Old_age   Always
-

Re: Hard disk failure?

2006-03-30 Thread Gene Heskett
On Thursday 30 March 2006 06:38, Ramiro Aceves wrote:

>smartctl -t /dev/hda

and see what falls out, but it sounds like a drive problem that needs to 
be warrantied to me.

-- 
Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules.  I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Hard disk failure?

2006-03-30 Thread Ramiro Aceves
X_LBA  CURRENT_TEST_STATUS
100  Not_testing
200  Not_testing
300  Not_testing
400  Not_testing
500  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


What do you thing shoud I do?

1-¿Does it make sense to check the disk cable? Or is it an "internal"
disk drive error?
2- Should I return the disk to my seller?


Normally, restarting the computer solves the problem after a fsck.
Sometimes I have also run a "manual" fsck with no aparent data loss. I
am concerned about a more serious  hard disk failure with real data
loss. (I have done backups, no problem   ;-)  )

Many thanks in advance:

Ramiro



Re: Hard disk failure?

2006-03-10 Thread Andrew Cady
On Fri, Mar 10, 2006 at 10:52:33AM +0100, Jim MacBaine wrote:
> Hello,
> 
> one of my hard drives seems to be dying.  To me as a layman this looks
> as if the disk should be returned to the shop where I bought it.  Is
> this right?
>
> ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata1: status=0x51 { DriveReady SeekComplete Error }
> ata1: error=0x40 { UncorrectableError }

It is most likely that the drive is dying.

> Shouldn't the md driver kick mark this drive as faulty and kick it out
> of the array? After I noticed the error in the syslog, I had to mark
> it faulty manually with mdadm.

Well, it's not completely dead yet.  The error has to get to
the md driver before the disk fails.  Although the message says
"UncorrectableError", Linux will in fact retry it a while before
reporting the error to md.  If it eventually succeeds, the drive is not
marked failed.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Hard disk failure?

2006-03-10 Thread Duncan Anderson
On Friday, 10 March 2006 11:52, Jim MacBaine wrote:
> Hello,
>
> one of my hard drives seems to be dying.  To me as a layman this looks
> as if the disk should be returned to the shop where I bought it.  Is
> this right?
>
> ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata1: status=0x51 { DriveReady SeekComplete Error }
> ata1: error=0x40 { UncorrectableError }
> ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata1: status=0x51 { DriveReady SeekComplete Error }
> ata1: error=0x40 { UncorrectableError }
> ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata1: status=0x51 { DriveReady SeekComplete Error }
> ata1: error=0x40 { UncorrectableError }
> ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata1: status=0x51 { DriveReady SeekComplete Error }
> ata1: error=0x40 { UncorrectableError }
> ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata1: status=0x51 { DriveReady SeekComplete Error }
> ata1: error=0x40 { UncorrectableError }
> sd 0:0:0:0: SCSI error: return code = 0x802
> sda: Current: sense key=0x3
> ASC=0x11 ASCQ=0x4
> end_request: I/O error, dev sda, sector 145661514
>
> It is a Seagate Barracuda 7200.8 SATA  drive with 200 GB capacity and
> a member of a raid5 md device.
>
> Shouldn't the md driver kick mark this drive as faulty and kick it out
> of the array? After I noticed the error in the syslog, I had to mark
> it faulty manually with mdadm.
>
> Regards,
> Jim


Since it's a Seagate you're in luck, because they have a carry-in warranty 
replacement policy. I have seen these kinds of errors before, shortly before 
my Barracuda died a tragic death. 

It is most likely to be the drive. Get it swapped out.

cheers
Duncan


___ 
Yahoo! Photos – NEW, now offering a quality print service from just 8p a photo 
http://uk.photos.yahoo.com


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Hard disk failure?

2006-03-10 Thread Jim MacBaine
Hello,

one of my hard drives seems to be dying.  To me as a layman this looks
as if the disk should be returned to the shop where I bought it.  Is
this right?

ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
ata1: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
sd 0:0:0:0: SCSI error: return code = 0x802
sda: Current: sense key=0x3
ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 145661514

It is a Seagate Barracuda 7200.8 SATA  drive with 200 GB capacity and
a member of a raid5 md device.

Shouldn't the md driver kick mark this drive as faulty and kick it out
of the array? After I noticed the error in the syslog, I had to mark
it faulty manually with mdadm.

Regards,
Jim



Re: LVM and disk failure

2006-01-09 Thread Erik Karlin
On Sat, Jan 07, 2006 at 11:02:25PM -0800, Mike Bird wrote:
> On Sat, 2006-01-07 at 22:15, Daniel Webb wrote:
> > On Sat, Jan 07, 2006 at 09:02:20PM -0800, Mike Bird wrote:
> > Well, yes, but supposing you *do* have a failure?  Then what?  Half the
> > filesystem is still there on the second disk, is it recoverable, and if not,
> > why not?
> 
> You may get some of the data.  You probably won't get all of it.
> I've had the great good fortune to have a group of bad blocks
> develop at a place on a drive where no data was currently
> stored, and I've lost data to bad blocks too.
> 
> > I'm getting the impression that spanning volume groups with a logical volume
> > is a *very* bad idea unless the physical volumes are RAID.
> 
> A VG built on two single drive PV's is twice as large but roughly
> half as reliable as a single drive.  Depending upon the kind
> of data, that may or may not be a bad idea.
> 
> Almost all of our VG's are built on RAID-1 PV's.  

Yea, that seems to be the most common. I built a 6 drive raid 10 array
originally with 3 missing disks. That gave me the reliablity when I
eventually filled in the missing disks and was no worse off than
non-raid.

As far as recovering LVM data, you can use the vgscan --partial option.
It's supposed to do exactly as you describe.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: LVM and disk failure

2006-01-07 Thread Mike Bird
On Sat, 2006-01-07 at 22:15, Daniel Webb wrote:
> On Sat, Jan 07, 2006 at 09:02:20PM -0800, Mike Bird wrote:
> Well, yes, but supposing you *do* have a failure?  Then what?  Half the
> filesystem is still there on the second disk, is it recoverable, and if not,
> why not?

You may get some of the data.  You probably won't get all of it.
I've had the great good fortune to have a group of bad blocks
develop at a place on a drive where no data was currently
stored, and I've lost data to bad blocks too.

> I'm getting the impression that spanning volume groups with a logical volume
> is a *very* bad idea unless the physical volumes are RAID.

A VG built on two single drive PV's is twice as large but roughly
half as reliable as a single drive.  Depending upon the kind
of data, that may or may not be a bad idea.

Almost all of our VG's are built on RAID-1 PV's.  

However, we have a VG which contains the LV which contains our
partial Debian mirror, together with a bunch of similar stuff.
It's huge but it can be rebuilt if a drive dies.  It's not
worth it to us to double up the 1.8TB to get RAID-1 protection
that isn't needed for that particular kind of data.

--Mike Bird


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: LVM and disk failure

2006-01-07 Thread Daniel Webb
On Sat, Jan 07, 2006 at 09:02:20PM -0800, Mike Bird wrote:

> If you've got enough spindles, each physical volume is typically a RAID1
> or RAID5.  Then you can add and remove physical volumes from
> your volume group as needed.  A single disk failure is harmless.
> 
> Other than adding and removing physical volumes you don't resize
> the volume group.  The logical volumes are readily resizable, as
> are some of the filesystems (e.g. ext2/3) which they can contain.

Well, yes, but supposing you *do* have a failure?  Then what?  Half the
filesystem is still there on the second disk, is it recoverable, and if not,
why not?

I'm getting the impression that spanning volume groups with a logical volume
is a *very* bad idea unless the physical volumes are RAID.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: LVM and disk failure

2006-01-07 Thread Mike Bird
On Sat, 2006-01-07 at 20:20, Daniel Webb wrote:
> What happens when you have a 2-disk LVM volume group and disk 1 fails?
> Obviously this will depend on the filesystem you put on top of the volume,
> right?  So which filesystems will recover gracefully if you chop them in half
> like that?
> 
> It's a little disturbing that in all the documentation I've read on LVM this
> is never mentioned, and yet it seems to destroy the main purpose of lvm: to be
> able to add and remove disks to a volume easily.  Each physical volume you add
> makes it that much more likely that you'll lose the whole thing.  Sure, you
> can put it on top of RAID, but now you lost your size flexibility because RAID
> isn't so easy to resize (or is it?).  The snapshots feature is nice, that's
> all I'll use it for until I find a satisfactory answer to this question.

If you've got enough spindles, each physical volume is typically a RAID1
or RAID5.  Then you can add and remove physical volumes from
your volume group as needed.  A single disk failure is harmless.

Other than adding and removing physical volumes you don't resize
the volume group.  The logical volumes are readily resizable, as
are some of the filesystems (e.g. ext2/3) which they can contain.

--Mike Bird


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



LVM and disk failure

2006-01-07 Thread Daniel Webb
I've been Googling for the answer to this and failing, so:

What happens when you have a 2-disk LVM volume group and disk 1 fails?
Obviously this will depend on the filesystem you put on top of the volume,
right?  So which filesystems will recover gracefully if you chop them in half
like that?

It's a little disturbing that in all the documentation I've read on LVM this
is never mentioned, and yet it seems to destroy the main purpose of lvm: to be
able to add and remove disks to a volume easily.  Each physical volume you add
makes it that much more likely that you'll lose the whole thing.  Sure, you
can put it on top of RAID, but now you lost your size flexibility because RAID
isn't so easy to resize (or is it?).  The snapshots feature is nice, that's
all I'll use it for until I find a satisfactory answer to this question.

I also was checking out evms and it looks very interesting.  Any impressions
from those who have used it?  Is it stable/reliable?  I didn't see anything in
their docs either about recovering when one disk in a volume fails.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: MD software raid & multiple disk failure recovery help script

2004-11-01 Thread Mike Fedyk
Mike Fedyk wrote:
Then it will create a new array (make sure you have one missing drive 
so that it doesn't try syncing the disks) with the old disks.  What 
you're trying to do is find the original disk order, and if you fail 
multiple disks, that ordering info is lost AFAIK.

Here[1] are the combinations that would be tried for a four drive raid 
array.  That's 24 combinations for 4 drives, 120 for 5 drives and a 
whopping 720 for 6 drives.  I have four drives, but even running the 
commands and keeping track of the combinations on paper 24 times is 
enough. 
I have an update that enforces the "missing" to avoid array 
reconstruction (which would destroy the data on an array if 
reconstructed improperly).

This version includes code to create the array, and use tune2fs, mount, 
and e2fsck for array construction verification.  Oh, there is a hard 
coded setting of 256K sized chunks, which is very important in recovery.

With all of these combinations, I still haven't found one that passes 
the tests.  Anyone have any ideas?

Thanks,
Mike
#!/bin/ash
set -e
#set -x
rotate() {
   local last_var=$1
   shift
   echo $@ $last_var
}
cut_one() {
shift
echo $@
}
rotate_part() {
local no_rotate=""
local r_to_shift=$1
shift
while [ $r_to_shift -gt 0 ]; do
no_rotate="${no_rotate# }$1 "
   shift
r_to_shift=$(( $r_to_shift - 1 ))
done
echo "$no_rotate$(rotate $@)"
}
do_it() {
local shift_factor="$1"
shift
local my_partitions="$@"
local d_shift=$(( $num_drives - $shift_factor ))
if [ $shift_factor -lt $(( $num_drives - 1 )) ]; then
while [ 0 -lt $d_shift ]; do
do_it $(( $shift_factor + 1 )) "$my_partitions"
my_partitions=$(rotate_part $shift_factor $my_partitions)
d_shift=$(( $d_shift - 1 ))
done
else
echo -n "$my_partitions:"
mdadm -S "$array_dev" > /dev/null 2>&1
mdadm -C "$array_dev" -c 256 -l 5 -n $num_drives $my_partitions --force --run 
> /dev/null 2>&1 \
&& tune2fs -l "$array_dev" > /dev/null \
&& echo "tune2fs: $my_partitions" >> /tmp/abc.log \
&& mount -t ext3 "$array_dev" /mnt/test -o ro \
&& echo "mount: $my_partitions" >> /tmp/abc.log \
&& umount "$array_dev" \
&& e2fsck -fn "$array_dev"  \
&& echo "e2fsck: $partitions" >> /tmp/abc.log \
&& sleep 10
echo
fi
}

#partitions="missing disc0/part2 disc2/part2 disc3/part2"
array_dev=$1
shift

#Add "missing" drive to keep the MD RAID driver from 
#starting a reconstruction thread.
partitions=$@

#Start counting at -1 to account for added "missing" drive.
num_drives=0
for i in $partitions; do
num_drives=$(( $num_drives + 1 ))
done

[ ! -e /mnt/test ] && mkdir /mnt/test
[ -e /tmp/abc.log ] && mv --backup=numbered /tmp/abc.log /tmp/abc.log.bak

for i in $partitions; do
do_it 0 "missing $(cut_one $partitions)"
partitions=$(rotate $partitions)
done


MD software raid & multiple disk failure recovery help script

2004-11-01 Thread Mike Fedyk
Hi all,
I just wrote a script runs a brute force attack against a raid5 array 
that has had multiple drives removed from an active array.

Yep, that's what I did, and the last resort was (from everywhere I could 
find with google) was to use the old mkraid tool if I had a raidtab.  I 
have been using mdadm for a while now and was not looking forward to 
working with the old tools, and modifying the array manually.

This script will take two arguments, the md device, and then a space 
separated list of devices that are within the array.

Then it will create a new array (make sure you have one missing drive so 
that it doesn't try syncing the disks) with the old disks.  What you're 
trying to do is find the original disk order, and if you fail multiple 
disks, that ordering info is lost AFAIK.

Here[1] are the combinations that would be tried for a four drive raid 
array.  That's 24 combinations for 4 drives, 120 for 5 drives and a 
whopping 720 for 6 drives.  I have four drives, but even running the 
commands and keeping track of the combinations on paper 24 times is enough.

I developed against ash since I need to be able to run this under 
busybox.  It just outputs combinations like[1], and doesn't call any 
commands in this version.  I just need some review of the code for logic 
errors and bashisms (which is what I usually write shell scripts against).

I have attached, and pasted[2] the code.
Thanks,
Mike
[1]
sda3 sdb3 sdc3 sdd3
sda3 sdb3 sdd3 sdc3
sda3 sdc3 sdd3 sdb3
sda3 sdc3 sdb3 sdd3
sda3 sdd3 sdb3 sdc3
sda3 sdd3 sdc3 sdb3
sdb3 sdc3 sdd3 sda3
sdb3 sdc3 sda3 sdd3
sdb3 sdd3 sda3 sdc3
sdb3 sdd3 sdc3 sda3
sdb3 sda3 sdc3 sdd3
sdb3 sda3 sdd3 sdc3
sdc3 sdd3 sda3 sdb3
sdc3 sdd3 sdb3 sda3
sdc3 sda3 sdb3 sdd3
sdc3 sda3 sdd3 sdb3
sdc3 sdb3 sdd3 sda3
sdc3 sdb3 sda3 sdd3
sdd3 sda3 sdb3 sdc3
sdd3 sda3 sdc3 sdb3
sdd3 sdb3 sdc3 sda3
sdd3 sdb3 sda3 sdc3
sdd3 sdc3 sda3 sdb3
sdd3 sdc3 sdb3 sda3
[2]
#!/bin/ash
set -e
#set -x
rotate() {
  local last_var=$1
  shift
  echo $@ $last_var
}
rotate_part() {
   local no_rotate=""
   local r_to_shift=$1
   shift
   while [ $r_to_shift -gt 0 ]; do
   no_rotate="${no_rotate# }$1 "
  shift
   r_to_shift=$(( $r_to_shift - 1 ))
   done
   echo "$no_rotate$(rotate $@)"
}
do_it() {
   local shift_factor="$1"
   shift
   local my_partitions="$@"
   local d_shift=$(( $num_drives - $shift_factor ))
   if [ $shift_factor -lt $(( $num_drives - 1 )) ]; then
   while [ 0 -lt $d_shift ]; do
   do_it $(( $shift_factor + 1 )) "$my_partitions"
   my_partitions=$(rotate_part $shift_factor $my_partitions)
   d_shift=$(( $d_shift - 1 ))
   done
   else
   echo "$my_partitions"
   fi
}
#partitions="missing disc0/part2 disc2/part2 disc3/part2"
array_dev=$1
shift
partitions=$@
num_drives=0
for i in $partitions; do
   num_drives=$(( $num_drives + 1 ))
done
do_it 0 "$partitions"
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: SCSI disk failure

2000-03-07 Thread kmself
On Mon, Mar 06, 2000 at 11:07:47AM +0100, Anton Emmerfors wrote:
> Hi,
> 
> This is not strictly Debian related, but lots of competent people
> dwell on this list so...
> 
> Yesterday I came home to find that my Quantum Fireball SCSI-disk had
> produced an "Unrecoverable read error". From kern.log:
> 
> --8<--
> kernel: scsi0: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 
> 25 73 90 00 00 02 00
> kernel: Info fld=0x257390, Current sd08:05: sense key Medium Error
> kernel: Additional sense indicates Unrecovered read error
> kernel: scsidisk I/O error: dev 08:05, sector 221318
> kernel: (scsi0:0:0:0) Performing Domain validation.
> kernel: (scsi0:0:0:0) Successfully completed Domain validation.
> 
> and so on for a few more sectors...
> --8<--
> 
> All sectors were within sda5 so I thought this would be fixable.  I
> tried fscking it and when that didn't work, scsiformat (it only served
> as temporary backup space -- no important data) but not even a
> lowlevel format can be performed. scsiformat complains that:
> 
> --8<--
> SCSI device sda: hdwr sector= 512 bytes. Sectors= 6328861 [3090 MB]
> [3.1 GB]
>  sda:scsi0: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read (6) 00
>  00 00 02 00
> Current error sd08:00: sense key Medium Error
> Additional sense indicates Unrecovered read error
> scsidisk I/O error: dev 08:00, sector 0
>  unable to read partition table
> --8<--
> 
> Is there any other method, however ugly, to get this disk into shape
> or should I consider it to be in doorstop mode?

The only time I've run across SCSI errors is on a Jaz drive (and then
frequently).  My experience is that once I reach the stage of sense
errors, the drive's days are decidedly numbered.  Move what you can
elsewhere and scrap the disk, IMO.  If it's warranteed, you might want
to check into getting a replacement.  Most media these days carries an
extended guarantee period.

-- 
Karsten M. Self (kmself@ix.netcom.com)
What part of "Gestalt" don't you understand?

Scope out Scoop:  http://scoop.kuro5hin.org/
Nothin' rusty about Kuro5hin:  http://www.kuro5hin.org/


SCSI disk failure

2000-03-07 Thread Anton Emmerfors
Hi,

This is not strictly Debian related, but lots of competent people
dwell on this list so...

Yesterday I came home to find that my Quantum Fireball SCSI-disk had
produced an "Unrecoverable read error". From kern.log:

--8<--
kernel: scsi0: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 00 25 
73 90 00 00 02 00
kernel: Info fld=0x257390, Current sd08:05: sense key Medium Error
kernel: Additional sense indicates Unrecovered read error
kernel: scsidisk I/O error: dev 08:05, sector 221318
kernel: (scsi0:0:0:0) Performing Domain validation.
kernel: (scsi0:0:0:0) Successfully completed Domain validation.

and so on for a few more sectors...
--8<--

All sectors were within sda5 so I thought this would be fixable.  I
tried fscking it and when that didn't work, scsiformat (it only served
as temporary backup space -- no important data) but not even a
lowlevel format can be performed. scsiformat complains that:

--8<--
SCSI device sda: hdwr sector= 512 bytes. Sectors= 6328861 [3090 MB]
[3.1 GB]
 sda:scsi0: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read (6) 00
 00 00 02 00
Current error sd08:00: sense key Medium Error
Additional sense indicates Unrecovered read error
scsidisk I/O error: dev 08:00, sector 0
 unable to read partition table
--8<--

Is there any other method, however ugly, to get this disk into shape
or should I consider it to be in doorstop mode?

TIA!

/regards Anton
-- 
Bare feet magnetize sharp metal objects so they point upward from the
floor -- especially in the dark.


Adaptec 2940U - boot disk failure?

1996-12-13 Thread Benedikt Eric Heinen

Hi there,

  today I tried to install the latest debian on a PPro180 system with an
Adaptec 2940U. Unfortunately, no boot-disk seems to work for that system.
The strangest thing I get is 'Controller at 0x378 doesn't react' (the
controller is located at e000-efff (irq 11).

  Any ideas as to what could be done here?

  Benedikt

signoff

---
 Benedikt Eric Heinen  -  Muehlemattstrasse 53  -  CH3007 Bern  -   SWITZERLAND
 email: [EMAIL PROTECTED]phone: ++41.79.3547891


RIOT, n.  A popular entertainment given to the military by innocent bystanders.

 Ambrose Bierce  ``The Devil's Dictionary''



--
TO UNSUBSCRIBE FROM THIS MAILING LIST: e-mail the word "unsubscribe" to
[EMAIL PROTECTED] . Trouble? e-mail to [EMAIL PROTECTED]