On Tue, May 22, 2012 at 12:42:02PM +0400, Jim Klimov wrote:
> 2012-05-22 7:30, Daniel Carosone wrote:
>> I've done basically this kind of thing before: dd a disk and then
>> scrub rather than replace, treating errors as expected.
>
> I got into similar situation last night on that Thumper -
> it is now migrating a flaky source disk in the array from
> an original old 250Gb disk into a same-sized partition on
> the new 3Tb drive (as I outlined as IDEA7 in another thread).
> The source disk itself had about 300 CKSUM errors during
> the process, and for reasons beyond my current understanding,
> the resilver never completed.
>
> In zpool status it said that the process was done several
> hours before the time I looked at it, but the TLVDEV still
> had a "spare" component device comprised of the old disk
> and new partition, and the (same) hotspare device in the
> pool was "INUSE".

I think this is at least in part an issue with older code.  There have
been various fixes for hangs/restarts/incomplete replaces and sparings
over the time since.  

> After a while we just detached the old disk from the pool
> and ran scrub, which first found some 178 CKSUM errors on
> the new partition right away, and degraded the TLVDEV and
> pool.
>
> We cleared the errors, and ran the script below to log
> the detected errors and clear them, so the disk is fixed
> and not kicked out of the pool due to mismatches.
>
> So in effect, this methodology works for two of us :)
>
> Since you did similar stuff already, I have a few questions:
> 1) How/what did you DD? The whole slice with the zfs vdev?
>    Did the system complain (much) about the renaming of the
>    device compared to paths embedded in pool/vdev headers?
>    Did you do anything manually to remedy that (forcing
>    import, DDing some handcrafted uberblocks, anything?)

I've done it a couple of times at least:

 * a failed disk in a raidz1, where i didn't trust that the other
   disks didn't also have errors.  Basically did a ddrescue from one
   disk to the new. I think these days, a 'replace' where the
   original disk is still online will use that content, like a
   hotspare replace, rather than assume it has gone away and must be
   recreated, but that wasn't the case at the time.

 * Where I had an iscsi mirror of a laptop hard disk, but it was out
   of date and had been detached when the laptop iscsi initiator
   refused to start.  Later, the disk developed a few bad sectors.  I
   made a new submirror, let it sync (with the error still), then
   blatted bits of the old image over the new in the areas where the
   bad sectors where being reported.  Scrub again, and they were fixed
   (as well as some blocks on the new submirror repaired coming back
   up to date again). 

> 2) How did you "treat errors as expected" during scrub?

Pretty much as you did: decline to panic and restart scrubs.

--
Dan.

Attachment: pgpis9PrONjka.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to