[gentoo-amd64] Re: not amd64 specific - disk failure

Duncan Tue, 20 Nov 2007 01:49:57 -0800

Raffaele BELARDI <[EMAIL PROTECTED]> posted [EMAIL PROTECTED],
excerpted below, on  Tue, 20 Nov 2007 08:47:32 +0100:


> So my hypothesis is that the bad blocks or sectors at the beginning of
> the partition were not copied, or only partly copied, by dd, and due to
> this the superblocks are all shifted down. Although I don't like to
> access again the hw, maybe I should try: # dd conv=noerror,sync bs=4096
> if=/dev/hdb of=/mnt/disk_500/sdb.img
> 
> to get an aligned image. Problem is I don't know what bs= should be.
> Block size, so 4k?
> 
> Any other option I might have?

This sounds reasonable.  I run reiserfs here and don't know a whole lot 
about ext2/3/4, so won't even attempt an opinion at that level of 
detail.  (That's why I left the actual recovery procedure after creating 
the copy to work with so vague... I wasn't going to try to go there.)

However, I can say this.  Based on my experience with recovery on 
reiserfs (and in fact reiserfs and dd-rescue recovery notes, so it's not 
just me), the block-size doesn't necessarily have to match, as it does 
copy over "raw", so the data it gets it gets, and the data it doesn't, 
well...  It keeps it in the same order serially, as well, so that's not 
an issue.  What the block-size DOES affect is how much data is operated 
on at once -- when it reaches bad blocks, that's the unit that's going to 
determine the amount of missing data.

Working on a good disk, a relatively large block size (as long as it can 
be buffered in memory) is often more efficient, that is, faster, because 
the big blocks mean lower processing overhead.  On a partially bad disk, 
larger blocks will still allow it to cover the good area faster (but 
that's trivial time anyway, compared to the time trying to access the bad 
blocks), AND because the block size is larger, it SHOULD mean less bad 
blocks to try and try and try before giving up in the bad areas too, so 
faster there as well.

The flip side to the faster access over the bad areas is that as I said, 
that's the chunk size that's declared bad, so the larger the block size 
you choose, the more potentially recoverable data gets declared bad when 
the entire block is declared bad.

As for working off the bad disk vs working off an image of it, as long as 
you can continue to recover data off the bad disk, you can keep trying to 
use it.  The problem, of course, is that every access might be your last, 
and it's also possible that each time thru may lose a few more blocks of 
data at the margin.

So it's up to you.  The aligned image will certainly be easier to work 
with, but you might not be able to get the same amount of valid data off.

... You never mentioned exactly what happened to the disk.  Mine was 
overheating.  I live in Phoenix, AZ, and my AC went out in the middle of 
the summer, with me gone and the computer left running.  With outside 
temps often reaching close to 50 C (122 F), the temps inside with the AC 
off could have easily reached 60 C (140 F).  Ambient case air temps could 
therefore have reached 70 C, and with the drive spinning in that... one 
can only guess what temps it reached!

Well, rather obviously, the platters expanded and the heads crashed, 
grooving out a circle in the platter at whatever location they were at at 
the time, plus wherever the still operating system told the heads to seek 
to.  However, once I came home and realized what had happened, I shut 
down and let everything cool down.  After replacing the AC, with 
everything running normal temps again, I was able to boot back up.

I ended up with two separate heavily damaged areas in which I could 
recover little if anything, but fortunately, the partition table and 
superblocks were intact.  I also had been running backup partition copies 
of most of my valuable stuff, by partition, and was able to recover most 
of it from that (barring the new stuff since my last backup, which was 
longer ago then it should have been), since they had been unmounted at 
the time and therefore didn't have the heads seeking into them, only 
across them a few times.

Actually, perhaps surprisingly, I was able to run those disks for some 
time without any known additional damage.  I did switch disks as soon as 
possible, because I was leery of continuing to depend on the partially 
bad ones, but in the mean time, I just checked off the affected 
partitions as dead, and continued to use the others without issue.  In 
fact, I still have the disk, and might still be using it for extra 
storage, except that was the second disk I had lost in two years (looking 
back, the one I'd lost the previous year was probably heat related as 
well, as it had the same failure pattern, and the AC wasn't doing so well 
even then), and I decided to switch to RAID and go slower speed but 
longer warrantee (5 yr) Seagate drives.  Those are now going into their 
third year, without issue (and with a new AC with cooling capacity to 
spare, so hopefully it'll be several years before I need to worry about 
/that/ issue again), but at least now I have the RAID backing me up, with 
most of the system on kernel/md RAID-6, so I can lose up to two of the 
four drives and maintain data integrity.  I am, however, already thinking 
about how I'll do it better next time, now that I've a bit of RAID 
experience under my belt. =8^)

So anyway, if it was heat related, chances are pretty decent it'll remain 
relatively stable, no additional data loss, as long as you keep pretty 
strict watch on the temps and don't let it overheat again.  That was my 
experience this last time, when I know it was heat related, and the time 
before, which had the same failure pattern, so I'm guessing it was heat 
related.  Of course, you never can tell, but that has been my experience 
with heat related disk failures, anyway.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

-- 
[EMAIL PROTECTED] mailing list

[gentoo-amd64] Re: not amd64 specific - disk failure

Reply via email to