Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-27 Thread Benny Lofgren
On 2011-01-27 06.02, Ted Unangst wrote:
 On Wed, Jan 26, 2011 at 10:00 PM, Amit Kulkarni amitk...@gmail.com wrote:
 pardon my ignorance but if you restored your data already, why bother
 investigating disk failure?

 Unless they are all the same person, there seems to be a sudden rash
 of people who want to bring a disk back from the dead because they are
 unwilling or unable to do the math on how much disks cost, how much
 time costs, and what the future integrity of their data is worth.  I
 don't know why this is, but I do know disks die, buy new ones is the
 correct answer to give them.

I fully understand the OP:s need to investigate this problem further,
regardless of whether there was any significant data loss or not.

It's a matter of uptime.

The indicated behaviour, that the system more or less freezes when
encountering a simple sector read error is indeed disturbing. For
example, my own reasons for using mirroring are exclusively so that a
system can remain online and operational in case of a disk failure.

If a disk in a mirror or redundant stripe set fails in a hotpluggable
hardware environment there really should be no need for service
interruption. The disk should be able to be replaced on the fly, or at
the very least during a controlled service window. In this case, that
obviously wouldn't work.

(The reason I'm butting in to this thread is that I'm currently
investigating a similar but probably totally unrelated problem, where a
system under high load (disk activity) claims there are sector read
errors, and then stops responding in a similar fashion to the OP:s
system. Saturate one, two or three disks with reads - no problem. Add a
fourth disk and after a while the problem appears. If I can determine
beyond reasonable doubt that this isn't a hardware problem, I'll submit
a bug report.)


Regards,
/Benny

-- 
internetlabbet.se / work:   +46 8 551 124 80  / Words must
Benny Lofgren/  mobile: +46 70 718 11 90 /   be weighed,
/   fax:+46 8 551 124 89/not counted.
   /email:  benny -at- internetlabbet.se



Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-27 Thread Benny Lofgren
On 2011-01-27 14.11, Ted Unangst wrote:
 On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren bl-li...@lofgren.biz wrote:
 It's a matter of uptime.

 The indicated behaviour, that the system more or less freezes when
 encountering a simple sector read error is indeed disturbing. For
 example, my own reasons for using mirroring are exclusively so that a
 system can remain online and operational in case of a disk failure.
 
 If that's why you're investigating, I'll save you some time.  The disk
 retry code will basically lock the system up while it's retrying.  If
 you don't like it, send a patch.

Well, fwiw I wasn't the one investigating this particular problem, but I
have no problem submitting patches in cases where I'm able to do
meaningful work. (The problem I mentioned investigating is in all
likelihood either driver-related or a hardware problem.)

I absolutely didn't mean to imply that hey this is broken, 'someone'
need to spend time to fix it - I fully realize that that someone may
very well be me. I apologize if I came across that way.

I was merely pointing out that the standard response of disks break,
live with it, while ever true, is sometimes irrelevant to the problem.

Yes, disks break (I currently have approximately two dozen broken ones
in a box at the office waiting for an appointment with a sledgehammer),
and yes, we diligently keep backups (or are sorry we didn't) but that
doesn't solve the situation where you have a critical system that causes
pain if it goes offline.

I have never in almost thirty years in this business lost a single byte
of customer data to disk failure. I have however had cases of unplanned
downtime, and every time that happens is also a failure.

Designing redundancy into our systems helps only as far as to the
nearest single point of failure, and if that point is the OS then I'd
say that is a problem (since it's not always feasible to build
redundancy using multiple servers).

I know I'm preaching to the choir here, and my only interest here is to
improve the robustness of an already incredibly robust system. I'll
certainly contribute to the best of my ability whenever I find fixable
problems.


Best regards,

/Benny

-- 
internetlabbet.se / work:   +46 8 551 124 80  / Words must
Benny Lofgren/  mobile: +46 70 718 11 90 /   be weighed,
/   fax:+46 8 551 124 89/not counted.
   /email:  benny -at- internetlabbet.se



Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-27 Thread Ted Unangst
On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren bl-li...@lofgren.biz wrote:
 It's a matter of uptime.

 The indicated behaviour, that the system more or less freezes when
 encountering a simple sector read error is indeed disturbing. For
 example, my own reasons for using mirroring are exclusively so that a
 system can remain online and operational in case of a disk failure.

If that's why you're investigating, I'll save you some time.  The disk
retry code will basically lock the system up while it's retrying.  If
you don't like it, send a patch.



Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-27 Thread L. V. Lammert
On Thu, 27 Jan 2011, Gordon Ferris wrote:

 We waited too long to replace the failed drive, so there were errors on
 both drives in the mirror, so the data was not completely restored.
 Backups were not as recent as we would have liked.  Since the drive
 didn't completely fail, it seemed worth trying to retrieve some data
 where possible from it.

dd_rescue will give you the best chance of recovering bad sectors.

Lee



Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-27 Thread Ted Unangst
On Thu, Jan 27, 2011 at 2:16 AM, Gordon Ferris
gordon.fer...@wfengineering.com wrote:
 1.  Is it normal for the operating system to freeze when accessing damaged
sectors - even if the only access is via a raw, unmounted partition?  This
seems like a hardware problem to me, except that errors are logged to
/var/log/messages as I described in the original post.

Yes.  It may not be desirable, but the retry code basically puts
everything else on hold while it's running.  It is a hardware problem
the operating system is trying to overcome.

 2.  What utilities will show which sectors are occupied by specific files?
 Ideally I could specify a range of sectors and a list of files using those
sectors would be provided.  It would also be nice to specify files and be
shown which sectors they occupy.  I've heard of the Coroner's Toolkit; are
there any other recommendations?

I don't know of any.  If I needed to do something like this, I'd
probably start with fsck_ffs and modify as needed.  Actually, that's
what fsdb does already.  You probably just need it to print a little
more info and to walk the tree automatically.



[gordon.fer...@wfengineering.com: Re: Computer stops responding (freezes up) during uncorrectable data error]

2011-01-27 Thread Gordon Ferris
Thank you for the interest so far in my post.  

I never meant to imply someone fix this now.  If that's how it came across, 
then I do apologize - that's not what I intended.

I am looking for more than the standard disks break, live with it answer.

I am surprised that the disk retry code doesn't timeout after 5 minutes or 100 
retries or something like that.  Also, it seems odd that the system is still 
responsive when the first few error messages are written to console but then 
stops responding a few messages later.

Also, I expected the unresponsiveness when the failed disk was mounted as part 
of the root filesystem - not when it is mounted as an auxiliary filesystem or 
not even mounted at all but simply accessed as a raw device.

I have trouble believing that I'm the first one to run into this, or at least 
the need to go back and forth between filesystem blocks and filenames.  But 
maybe I am.

Thanks Lee, for the dd_rescue suggestion.  I'll take a look at it.

Sincerely,
Gordon

On Thu, Jan 27, 2011 at 03:01:40PM +0100, Benny Lofgren wrote:
 On 2011-01-27 14.11, Ted Unangst wrote:
  On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren bl-li...@lofgren.biz wrote:
  It's a matter of uptime.
 
  The indicated behaviour, that the system more or less freezes when
  encountering a simple sector read error is indeed disturbing. For
  example, my own reasons for using mirroring are exclusively so that a
  system can remain online and operational in case of a disk failure.
  
  If that's why you're investigating, I'll save you some time.  The disk
  retry code will basically lock the system up while it's retrying.  If
  you don't like it, send a patch.
 
 Well, fwiw I wasn't the one investigating this particular problem, but I
 have no problem submitting patches in cases where I'm able to do
 meaningful work. (The problem I mentioned investigating is in all
 likelihood either driver-related or a hardware problem.)
 
 I absolutely didn't mean to imply that hey this is broken, 'someone'
 need to spend time to fix it - I fully realize that that someone may
 very well be me. I apologize if I came across that way.
 
 I was merely pointing out that the standard response of disks break,
 live with it, while ever true, is sometimes irrelevant to the problem.
 
 Yes, disks break (I currently have approximately two dozen broken ones
 in a box at the office waiting for an appointment with a sledgehammer),
 and yes, we diligently keep backups (or are sorry we didn't) but that
 doesn't solve the situation where you have a critical system that causes
 pain if it goes offline.
 
 I have never in almost thirty years in this business lost a single byte
 of customer data to disk failure. I have however had cases of unplanned
 downtime, and every time that happens is also a failure.
 
 Designing redundancy into our systems helps only as far as to the
 nearest single point of failure, and if that point is the OS then I'd
 say that is a problem (since it's not always feasible to build
 redundancy using multiple servers).
 
 I know I'm preaching to the choir here, and my only interest here is to
 improve the robustness of an already incredibly robust system. I'll
 certainly contribute to the best of my ability whenever I find fixable
 problems.
 
 
 Best regards,
 
 /Benny
 
 -- 
 internetlabbet.se / work:   +46 8 551 124 80  / Words must
 Benny Lofgren/  mobile: +46 70 718 11 90 /   be weighed,
 /   fax:+46 8 551 124 89/not counted.
/email:  benny -at- internetlabbet.se

- End forwarded message -

-- 
Gordon Ferris
W.F. Engineering
Phone: +1 801-455-6108



Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-27 Thread Gordon Ferris
Thank you for the interest so far in my post.  

I never meant to imply someone fix this now.  If that's how it came across, 
then I do apologize - that's not what I intended.

I am looking for more than the standard disks break, live with it answer.

I am surprised that the disk retry code doesn't timeout after 5 minutes or 100 
retries or something like that.  Also, it seems odd that the system is still 
responsive when the first few error messages are written to console but then 
stops responding a few messages later.

Also, I expected the unresponsiveness when the failed disk was mounted as part 
of the root filesystem - not when it is mounted as an auxiliary filesystem or 
not even mounted at all but simply accessed as a raw device.

I have trouble believing that I'm the first one to run into this, or at least 
the need to go back and forth between filesystem blocks and filenames.  But 
maybe I am.

Thanks Lee, for the dd_rescue suggestion.

Thanks David, for the sleuthkit suggestion.

Sincerely,
Gordon

On Thu, Jan 27, 2011 at 03:01:40PM +0100, Benny Lofgren wrote:
 On 2011-01-27 14.11, Ted Unangst wrote:
  On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren bl-li...@lofgren.biz wrote:
  It's a matter of uptime.
 
  The indicated behaviour, that the system more or less freezes when
  encountering a simple sector read error is indeed disturbing. For
  example, my own reasons for using mirroring are exclusively so that a
  system can remain online and operational in case of a disk failure.
  
  If that's why you're investigating, I'll save you some time.  The disk
  retry code will basically lock the system up while it's retrying.  If
  you don't like it, send a patch.
 
 Well, fwiw I wasn't the one investigating this particular problem, but I
 have no problem submitting patches in cases where I'm able to do
 meaningful work. (The problem I mentioned investigating is in all
 likelihood either driver-related or a hardware problem.)
 
 I absolutely didn't mean to imply that hey this is broken, 'someone'
 need to spend time to fix it - I fully realize that that someone may
 very well be me. I apologize if I came across that way.
 
 I was merely pointing out that the standard response of disks break,
 live with it, while ever true, is sometimes irrelevant to the problem.
 
 Yes, disks break (I currently have approximately two dozen broken ones
 in a box at the office waiting for an appointment with a sledgehammer),
 and yes, we diligently keep backups (or are sorry we didn't) but that
 doesn't solve the situation where you have a critical system that causes
 pain if it goes offline.
 
 I have never in almost thirty years in this business lost a single byte
 of customer data to disk failure. I have however had cases of unplanned
 downtime, and every time that happens is also a failure.
 
 Designing redundancy into our systems helps only as far as to the
 nearest single point of failure, and if that point is the OS then I'd
 say that is a problem (since it's not always feasible to build
 redundancy using multiple servers).
 
 I know I'm preaching to the choir here, and my only interest here is to
 improve the robustness of an already incredibly robust system. I'll
 certainly contribute to the best of my ability whenever I find fixable
 problems.
 
 
 Best regards,
 
 /Benny
 
 -- 
 internetlabbet.se / work:   +46 8 551 124 80  / Words must
 Benny Lofgren/  mobile: +46 70 718 11 90 /   be weighed,
 /   fax:+46 8 551 124 89/not counted.
/email:  benny -at- internetlabbet.se

- End forwarded message -

-- 
Gordon Ferris
W.F. Engineering
Phone: +1 801-455-6108

- End forwarded message -

-- 
Gordon Ferris
W.F. Engineering
Phone: +1 801-455-6108



Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-27 Thread David Vasek

On Thu, 27 Jan 2011, Gordon Ferris wrote:

2.  What utilities will show which sectors are occupied by specific 
files?  Ideally I could specify a range of sectors and a list of files 
using those sectors would be provided.  It would also be nice to specify 
files and be shown which sectors they occupy.  I've heard of the 
Coroner's Toolkit; are there any other recommendations?


Try looking at sleuthkit. Handy set of tools. Hopefully it could be used 
in your case too.


Sleuthkit seems to have some limitations on OpenBSD. I needed to use it 
recently, but it did not recognise the FFS filesystem size when run on 
OpenBSD. I had to compile and run it on some linux live CD and it worked 
well (on OpenBSD FFS) from there. I was able to get a listing of blocks 
occupied by individual files.


Regards,
David



Computer stops responding (freezes up) during uncorrectable data error

2011-01-26 Thread Gordon Ferris
I have a disk that has failed; there seem to be damaged areas that 
cause errors when specific files are accessed.  This disk was one of a two-disk 
mirror running raidframe.  The disk has been replaced and the original machine 
is back up and running again.
However as I use a second computer to investigate the failed disk, I 
have been puzzled that this second computer locks up and stops responding when 
I try copying files that include various damaged areas of the disk.  

This second computer has an installation of OpenBSD 4.6, with the 
kernel recompiled to support raidframe (so I can access the data on the 
partition); I have also adjusted the drive numbering so that the failed drive 
believes it is the only disk present in its mirror.  On this second computer, 
the operating system is on a completely different physical disk; the failed 
disk is not necessary for a completely functional system.
However, even though this computer doesn't use the failed disk for its 
root filesystem - the computer still freezes up and stops responding when the 
bad sectors are accessed.
I even tried using the dump and dd utilities to access the disk 
with a raw, unmounted partition - but the host computer still freezes up and 
stops responding after adding a few lines to /var/log/messages.

I was expecting the error messages, but not expecting the host system 
to freeze up - even the mouse stops responding.  It's irritating to have to 
reboot the computer each time I access one of the damaged sectors.
I thought this problem might be caused if the drive controller hardware 
never returns control back to the operating system once the disk error occurs 
too many times.  But the error messages do end up in /var/log/messages, so 
control does return to the operating system for at least a little while.

And yes, repeatedly accessing the same file generates the error 
messages referring to the same sectors.

1.  How can I attempt to access the damaged sectors without causing the entire 
computer to freeze up and stop responding?

2.  I have used stat, ncheck, and fsdb to find and examine the inodes for 
various files.  Is there a utility to show which sectors of the filesystem 
and/or the drive are actually used by various files?

3.  How can I identify all the files that contain bad sectors without freezing 
up the computer on each file that contains one?

# mount
/dev/wd1a on / type ffs (local)
/dev/wd1e on /usr type ffs (local, read-only)
/dev/wd1g on /mnt3 type ffs (local, read-only)
/dev/wd1f on /mnt type ffs (local, read-only)
# fsck -f /dev/rraid2d
** /dev/rraid2d
** File system is already clean
** Last Mounted on /home-big
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
452600 files, 69774853 used, 43730370 free (26658 frags, 5462964 blocks, 0.0% fr
agmentation)

# mount -r /dev/raid2d /mnt2
# mount
/dev/wd1a on / type ffs (local)
/dev/wd1e on /usr type ffs (local, read-only)
/dev/wd1g on /mnt3 type ffs (local, read-only)
/dev/wd1f on /mnt type ffs (local, read-only)
/dev/raid2d on /mnt2 type ffs (local, read-only)

# dd conv=noerror,notrunc,sync \
 if=/mnt2/.../20198332.txt of=/dev/null count=1

The computer stopped responding but these messages were on the console 
and in /var/log/messages on rebooting:
/var/log/messages
Jan 26 08:23:15 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o
f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 08:23:18 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 4
Jan 26 08:23:18 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 4
Jan 26 08:23:18 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o
f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 08:23:20 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 3
Jan 26 08:23:20 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 3
Jan 26 08:23:20 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o
f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 08:23:22 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 2
Jan 26 08:23:22 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 2
Jan 26 08:23:22 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o
f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 08:23:25 one /bsd: wd0f: uncorrectable data error reading fsbn 40104976 o
f 40104952-40104983 (wd0 bn 67174501; cn 4181 tn 106 sn 58), retrying

And the error messages are repeatable (especially the failed block 
numbers) if I repeat the command:
Jan 26 10:40:19 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 
of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 10:40:21 one /bsd: wd0f: uncorrectable data 

Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-26 Thread Amit Kulkarni
pardon my ignorance but if you restored your data already, why bother
investigating disk failure?

On Wed, Jan 26, 2011 at 6:50 PM, Gordon Ferris
gordon.fer...@wfengineering.com wrote:
I have a disk that has failed; there seem to be damaged areas that
cause errors when specific files are accessed.  This disk was one of a
two-disk mirror running raidframe.  The disk has been replaced and the
original machine is back up and running again.
However as I use a second computer to investigate the failed disk, I
have been puzzled that this second computer locks up and stops responding when
I try copying files that include various damaged areas of the disk.

This second computer has an installation of OpenBSD 4.6, with the
kernel recompiled to support raidframe (so I can access the data on the
partition); I have also adjusted the drive numbering so that the failed drive
believes it is the only disk present in its mirror.  On this second computer,
the operating system is on a completely different physical disk; the failed
disk is not necessary for a completely functional system.
However, even though this computer doesn't use the failed disk for
its root filesystem - the computer still freezes up and stops responding when
the bad sectors are accessed.
I even tried using the dump and dd utilities to access the disk
with a raw, unmounted partition - but the host computer still freezes up and
stops responding after adding a few lines to /var/log/messages.

I was expecting the error messages, but not expecting the host system
to freeze up - even the mouse stops responding.  It's irritating to have to
reboot the computer each time I access one of the damaged sectors.
I thought this problem might be caused if the drive controller
hardware never returns control back to the operating system once the disk
error occurs too many times.  But the error messages do end up in
/var/log/messages, so control does return to the operating system for at least
a little while.

And yes, repeatedly accessing the same file generates the error
messages referring to the same sectors.

 1.  How can I attempt to access the damaged sectors without causing the
entire computer to freeze up and stop responding?

 2.  I have used stat, ncheck, and fsdb to find and examine the inodes for
various files.  Is there a utility to show which sectors of the filesystem
and/or the drive are actually used by various files?

 3.  How can I identify all the files that contain bad sectors without
freezing up the computer on each file that contains one?

 # mount
 /dev/wd1a on / type ffs (local)
 /dev/wd1e on /usr type ffs (local, read-only)
 /dev/wd1g on /mnt3 type ffs (local, read-only)
 /dev/wd1f on /mnt type ffs (local, read-only)
 # fsck -f /dev/rraid2d
 ** /dev/rraid2d
 ** File system is already clean
 ** Last Mounted on /home-big
 ** Phase 1 - Check Blocks and Sizes
 ** Phase 2 - Check Pathnames
 ** Phase 3 - Check Connectivity
 ** Phase 4 - Check Reference Counts
 ** Phase 5 - Check Cyl groups
 452600 files, 69774853 used, 43730370 free (26658 frags, 5462964 blocks,
0.0% fr
 agmentation)

 # mount -r /dev/raid2d /mnt2
 # mount
 /dev/wd1a on / type ffs (local)
 /dev/wd1e on /usr type ffs (local, read-only)
 /dev/wd1g on /mnt3 type ffs (local, read-only)
 /dev/wd1f on /mnt type ffs (local, read-only)
 /dev/raid2d on /mnt2 type ffs (local, read-only)

 # dd conv=noerror,notrunc,sync \
 if=/mnt2/.../20198332.txt of=/dev/null count=1

The computer stopped responding but these messages were on the
console and in /var/log/messages on rebooting:
 /var/log/messages
 Jan 26 08:23:15 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 o
 f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
 Jan 26 08:23:18 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode
4
 Jan 26 08:23:18 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode
4
 Jan 26 08:23:18 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 o
 f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
 Jan 26 08:23:20 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode
3
 Jan 26 08:23:20 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode
3
 Jan 26 08:23:20 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 o
 f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
 Jan 26 08:23:22 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode
2
 Jan 26 08:23:22 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode
2
 Jan 26 08:23:22 one /bsd: wd0f: uncorrectable data error reading fsbn
40104952 o
 f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
 Jan 26 08:23:25 one /bsd: wd0f: uncorrectable data error reading fsbn
40104976 o
 f 40104952-40104983 (wd0 bn 67174501; cn 4181 tn 106 sn 58), retrying

And the error messages are repeatable (especially the failed block
numbers) if I repeat the command:
 

Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-26 Thread Ted Unangst
On Wed, Jan 26, 2011 at 10:00 PM, Amit Kulkarni amitk...@gmail.com wrote:
 pardon my ignorance but if you restored your data already, why bother
 investigating disk failure?

Unless they are all the same person, there seems to be a sudden rash
of people who want to bring a disk back from the dead because they are
unwilling or unable to do the math on how much disks cost, how much
time costs, and what the future integrity of their data is worth.  I
don't know why this is, but I do know disks die, buy new ones is the
correct answer to give them.



Re: Computer stops responding (freezes up) during uncorrectable data error

2011-01-26 Thread Gordon Ferris
We waited too long to replace the failed drive, so there were errors on both 
drives in the mirror, so the data was not completely restored.  Backups were 
not as recent as we would have liked.  Since the drive didn't completely fail, 
it seemed worth trying to retrieve some data where possible from it.

1.  Is it normal for the operating system to freeze when accessing damaged 
sectors - even if the only access is via a raw, unmounted partition?  This 
seems like a hardware problem to me, except that errors are logged to 
/var/log/messages as I described in the original post.

2.  What utilities will show which sectors are occupied by specific files?  
Ideally I could specify a range of sectors and a list of files using those 
sectors would be provided.  It would also be nice to specify files and be shown 
which sectors they occupy.  I've heard of the Coroner's Toolkit; are there any 
other recommendations?


On Thu, Jan 27, 2011 at 12:02:44AM -0500, Ted Unangst wrote:
 On Wed, Jan 26, 2011 at 10:00 PM, Amit Kulkarni amitk...@gmail.com wrote:
  pardon my ignorance but if you restored your data already, why bother
  investigating disk failure?
 
 Unless they are all the same person, there seems to be a sudden rash
 of people who want to bring a disk back from the dead because they are
 unwilling or unable to do the math on how much disks cost, how much
 time costs, and what the future integrity of their data is worth.  I
 don't know why this is, but I do know disks die, buy new ones is the
 correct answer to give them.