This seems to have been a false alarm, sorry for that. As soon as I started
paying attention (logging zpool status, peeking around with zdb & mdb) the
resilver didn't restart unless provoked. A cleartext log would have been
nice ("restarted due to c11t7 becoming online").

A slight problem i can see is that resilver restarts always if a device is
added to the array. In my case devices were absent for a short period (some
SATA failure that corrected itself by running cfgadm -c disconnect &
connect) and it would have been beneficial to let resilver run to completion
and restart only after that to resilver missing data on the added device.
ZFS does have some intelligence in those cases that all data is not
resilvered, but only blocks that have been born after the outage.

Also, as i had a spare in the array, that kicked in, which probably was not
what I would have wanted, as that triggered a full resilver, and not a
partial one. After the fact I could not kick the spare out, and could not
make the resilvering process forget about doing a full resilver. Plus now I
have to replace it back out and make it a cold spare.

But end is well, all is well.. mostly. Devices seem to be still dropping
from the SATA bus randomly. Maybe I'll cough together a report and post to
storage-discuss.

On Wed, Sep 29, 2010 at 8:13 PM, Tuomas Leikola <tuomas.leik...@gmail.com>wrote:

> The endless resilver problem still persists on OI b147. Restarts when it
> should complete.
>
> I see no other solution than to copy the data to safety and recreate the
> array. Any hints would be appreciated as that takes days unless i can stop
> or pause the resilvering.
>
>
> On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola 
> <tuomas.leik...@gmail.com>wrote:
>
>> Hi!
>>
>> My home server had some disk outages due to flaky cabling and whatnot, and
>> started resilvering to a spare disk. During this another disk or two
>> dropped, and were reinserted into the array. So no devices were actually
>> lost, they just were intermittently away for a while each.
>>
>> The situation is currently as follows:
>>   pool: tank
>>  state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>>         attempt was made to correct the error.  Applications are
>> unaffected.
>> action: Determine if the device needs to be replaced, and clear the errors
>>         using 'zpool clear' or replace the device with 'zpool replace'.
>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>  scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go
>> config:
>>
>>         NAME                       STATE     READ WRITE CKSUM
>>         tank                       ONLINE       0     0     0
>>           raidz1-0                 ONLINE       0     0     0
>>             c11t1d0p0              ONLINE       0     0     0
>>             c11t2d0                ONLINE       0     0     5
>>             c11t6d0p0              ONLINE       0     0     0
>>             spare-3                ONLINE       0     0     0
>>               c11t3d0p0            ONLINE       0     0     0  106M
>> resilvered
>>               c9d1                 ONLINE       0     0     0  104G
>> resilvered
>>             c11t4d0p0              ONLINE       0     0     0
>>             c11t0d0p0              ONLINE       0     0     0
>>             c11t5d0p0              ONLINE       0     0     0
>>             c11t7d0p0              ONLINE       0     0     0  93.6G
>> resilvered
>>           raidz1-2                 ONLINE       0     0     0
>>             c6t2d0                 ONLINE       0     0     0
>>             c6t3d0                 ONLINE       0     0     0
>>             c6t4d0                 ONLINE       0     0     0  2.50K
>> resilvered
>>             c6t5d0                 ONLINE       0     0     0
>>             c6t6d0                 ONLINE       0     0     0
>>             c6t7d0                 ONLINE       0     0     0
>>             c6t1d0                 ONLINE       0     0     1
>>         logs
>>           /dev/zvol/dsk/rpool/log  ONLINE       0     0     0
>>         cache
>>           c6t0d0p0                 ONLINE       0     0     0
>>         spares
>>           c9d1                     INUSE     currently in use
>>
>> errors: No known data errors
>>
>> And this has been going on for a week now, always restarting when it
>> should complete.
>>
>> The questions in my mind atm:
>>
>> 1. How can i determine the cause for each resilver? Is there a log?
>>
>> 2. Why does it resilver the same data over and over, and not just the
>> changed bits?
>>
>> 3. Can i force remove c9d1 as it is no longer needed but c11t3 can be
>> resilvered instead?
>>
>> I'm running opensolaris 134, but the event originally happened on 111b. I
>> upgraded and tried quiescing snapshots and IO, none of which helped.
>>
>> I've already ordered some new hardware to recreate this entire array as
>> raidz2 among other things, but there's about a week of time when I can run
>> debuggers and traces if instructed to.
>>
>> - Tuomas
>>
>>
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to