Re: Advice for handling softraid reporting i/o error

Erling Westenvik Sun, 03 Feb 2013 08:15:48 -0800

On Mon, Feb 04, 2013 at 01:03:07AM +1100, Joel Sing wrote:
> On Mon, 4 Feb 2013, Erling Westenvik wrote:
> > On Sun, Feb 03, 2013 at 11:11:17AM +0530, Girish Venkatachalam wrote:
> > > I hate to say it but I am sure your hard disk is dying. Replace it
> > > ASAP
> >
> > No no, that's all right. Death is an inevitable part of life. I know
> > the disk is dying and I'm going to replace it (or just throw away
> > the machine which is a piece of junk anyway) but I'd love to get out
> > of it the amendments to it's last will before it passes out
> > completely.
> >
> > When a NON-ENCRYPTED disk has damaged areas one may still be able to
> > access the undamaged areas upon a reboot - possibly by mounting it
> > as a secondary disk on a working system and using various recovery
> > tools, etc.
> >
> > However: the last time I had an ENCRYPTED disk with damaged areas,
> > the whole disk got rendered useless. It wouldn't respond to
> > keydisk/passphrase and hence there was no way to access "undamaged"
> > data.
> >
> > The machine is still powered on. It still return ping but not ssh.
> > When typing on the keyboard, characters get echo'ed on the screen.
> > Do I have any options besides rebooting and praying?
> 
> None. Well, aside from a custom kernel.
> 
> One of the current "features" with softraid (regardless of discipline)
> is that if a drive reports an I/O error, we mark the given chunk as
> being offline. In the case of disciplines that have redundant data,
> this is exactly what we want, since it should force failover to an
> online chunk. However, in the case of disciplines that do not have
> dedundancy, the single chunk failure results in the entire volume
> going offline.
> 
> I suspect this is what has happened. You have not mentioned how the
> crypto volume is used, however I'm going to guess that you either have
> your entire system on it, or at least some critical parts of your
> system. Since it has gone offline things have stopped working and
> there is no way to recover from this without rebooting.
> 
> I plan on changing softraid so that disciplines without redundant data
> simply pass the failure from the underlying chunk up to userland, but
> leave the volume state alone - after all, you can attempt to recover
> data from a online volume, which is much more useful than losing the
> lot in one hit.


Ok, I'm getting it. Thanks. I always seem to forget to mention something
important. Sorry for that. The setup is based on an article on
undeadly.org by Stephan Sperling:

http://undeadly.org/cgi?action=article&sid=20110530221728

That's a fdisk partition spanning the whole of one physical disk (wd0)
and three disklabel partitions a, b and d on that with partition d being
the crypto volume and keying material stored on an USB key disk.

On a couple of other encrypted machines I have, I've startet to use the
new boot code (which workes great but which I so far haven't been able
to make work with a key disk).

Hopefully some of your comments above - especially the last paragraph
about volumes going offline - will make it into the relevant
documentation. I suspect problems like mine are likely to arise more
frequently as more and more people will start to use softraid.

Re: Advice for handling softraid reporting i/o error

Reply via email to