Re: [zfs-discuss] replacing a drive in a raidz vdev
Yes. But its going to be a few months. i'll presume that we will get background disk scrubbing for free once you guys get bookmarking done. :) -- Regards, Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing a drive in a raidz vdev
On 12/5/06, Bill Sommerfeld [EMAIL PROTECTED] wrote: On Mon, 2006-12-04 at 13:56 -0500, Krzys wrote: mypool2/[EMAIL PROTECTED] 34.4M - 151G - mypool2/[EMAIL PROTECTED] 141K - 189G - mypool2/d3 492G 254G 11.5G legacy I am so confused with all of this... Why its taking so long to replace that one bad disk? To workaround a bug where a pool traverse gets lost when the snapshot configuration of a pool changes, both scrubs and resilvers will start over again any time you create or delete a snapshot. Unfortunately, this workaround has problems of its own -- If your inter-snapshot interval is less than the time required to complete a scrub, the resilver will never complete. The open bug is: 6343667 scrub/resilver has to start over when a snapshot is taken if it's not going to be fixed any time soon, perhaps we need a better workaround: Anyone internal working on this? -- Regards, Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing a drive in a raidz vdev
Jeremy Teo wrote: On 12/5/06, Bill Sommerfeld [EMAIL PROTECTED] wrote: On Mon, 2006-12-04 at 13:56 -0500, Krzys wrote: mypool2/[EMAIL PROTECTED] 34.4M - 151G - mypool2/[EMAIL PROTECTED] 141K - 189G - mypool2/d3 492G 254G 11.5G legacy I am so confused with all of this... Why its taking so long to replace that one bad disk? To workaround a bug where a pool traverse gets lost when the snapshot configuration of a pool changes, both scrubs and resilvers will start over again any time you create or delete a snapshot. Unfortunately, this workaround has problems of its own -- If your inter-snapshot interval is less than the time required to complete a scrub, the resilver will never complete. The open bug is: 6343667 scrub/resilver has to start over when a snapshot is taken if it's not going to be fixed any time soon, perhaps we need a better workaround: Anyone internal working on this? Yes. But its going to be a few months. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing a drive in a raidz vdev
I am having no luck replacing my drive as well. few days ago I replaced my drive and its completly messed up now. pool: mypool2 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 8.70% done, 8h19m to go config: NAME STATE READ WRITE CKSUM mypool2 DEGRADED 0 0 0 raidz DEGRADED 0 0 0 c3t0d0ONLINE 0 0 0 c3t1d0ONLINE 0 0 0 c3t2d0ONLINE 0 0 0 c3t3d0ONLINE 0 0 0 c3t4d0ONLINE 0 0 0 c3t5d0ONLINE 0 0 0 replacing DEGRADED 0 0 0 c3t6d0s0/o UNAVAIL 0 0 0 cannot open c3t6d0 ONLINE 0 0 0 errors: No known data errors this is what I get, I am running Solaris 10 U2 two days ago I did see 2.00% range, and then like 10h remaining, now its still going and its already at least few days since it started. when I do: zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT mypool2 952G684G268G71% DEGRADED - I have almost 1TB of space. when I do df -k it does show me only 277gb, it is better than only displaying 12gb as I did see yesterday. mypool2/d3 277900047 12022884 265877163 5% /d/d3 when I do zfs list I get: mypool2684G 254G52K /mypool2 mypool2/d 191G 254G 189G /mypool2/d mypool2/[EMAIL PROTECTED] 653M - 145G - mypool2/[EMAIL PROTECTED] 31.2M - 145G - mypool2/[EMAIL PROTECTED] 36.8M - 144G - mypool2/[EMAIL PROTECTED] 37.9M - 144G - mypool2/[EMAIL PROTECTED] 31.7M - 145G - mypool2/[EMAIL PROTECTED] 27.7M - 145G - mypool2/[EMAIL PROTECTED] 34.0M - 146G - mypool2/[EMAIL PROTECTED] 26.8M - 149G - mypool2/[EMAIL PROTECTED] 34.4M - 151G - mypool2/[EMAIL PROTECTED] 141K - 189G - mypool2/d3 492G 254G 11.5G legacy I am so confused with all of this... Why its taking so long to replace that one bad disk? Why such different results? What is going on? Is there a problem with my zpool/zfs combination? Did I do anything wrong? Did I actually loose data on my drive? If I knew it woul dbe this bad I would just destroy my whole zpool and zfs and start from the beginning but I wanted to see how would it go trough replacement to see whats the process... I am so happy I did not use zfs in my production environment yet to be honest with you... Chris On Sat, 2 Dec 2006, Theo Schlossnagle wrote: I had a disk malfunction in a raidz pool today. I had an extra on in the enclosure and performed a: zpool replace pool old new and several unexpected behaviors have transpired: the zpool replace command hung for 52 minutes during which no zpool commands could be executed (like status, iostat or list). When it finally returned, the drive was marked as replacing as I expected from reading the man page. However, it's progress counter has not been monotonically increasing. It started at 1% and then went to 5% and then back to 2%, etc. etc. I just logged in to see if it was done and ran zpool status and received: pool: xsr_slow_2 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 100.00% done, 0h0m to go config: NAME STATE READ WRITE CKSUM xsr_slow_2 ONLINE 0 0 0 raidzONLINE 0 0 0 c4t600039316A1Fd0s2ONLINE 0 0 0 c4t600039316A1Fd1s2ONLINE 0 0 0 c4t600039316A1Fd2s2ONLINE 0 0 0 c4t600039316A1Fd3s2ONLINE 0 0 0 replacing ONLINE 0 0 0 c4t600039316A1Fd4s2 ONLINE 2.87K 251 0 c4t600039316A1Fd6ONLINE 0 0 0 c4t600039316A1Fd5s2ONLINE 0 0 0 I thought to myself, if it is 100% done why is it still replacing? I waited about 15 seconds and ran the command again to find something rather disconcerting: pool: xsr_slow_2 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 0.45% done, 27h27m to go config: NAME
Re: [zfs-discuss] replacing a drive in a raidz vdev
On Mon, 2006-12-04 at 13:56 -0500, Krzys wrote: mypool2/[EMAIL PROTECTED] 34.4M - 151G - mypool2/[EMAIL PROTECTED] 141K - 189G - mypool2/d3 492G 254G 11.5G legacy I am so confused with all of this... Why its taking so long to replace that one bad disk? To workaround a bug where a pool traverse gets lost when the snapshot configuration of a pool changes, both scrubs and resilvers will start over again any time you create or delete a snapshot. Unfortunately, this workaround has problems of its own -- If your inter-snapshot interval is less than the time required to complete a scrub, the resilver will never complete. The open bug is: 6343667 scrub/resilver has to start over when a snapshot is taken if it's not going to be fixed any time soon, perhaps we need a better workaround: Ideas: - perhaps snapshots should be made to fail while a resilver (not scrub!) is in progress... - or maybe snapshots should fail when a *restarted* resilver is in progress -- that way, if you can complete the resilver between two snapshots times, you don't miss any snapshots, but if it takes longer than that, snapshots are sacrificed in the name of pool integrity. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing a drive in a raidz vdev
On Sat, 2006-12-02 at 00:08 -0500, Theo Schlossnagle wrote: I had a disk malfunction in a raidz pool today. I had an extra on in the enclosure and performed a: zpool replace pool old new and several unexpected behaviors have transpired: the zpool replace command hung for 52 minutes during which no zpool commands could be executed (like status, iostat or list). So, I've observed that zfs will continue to attempt to do I/O to the outgoing drive while a replacement is in progress. (seems counterintuitive - I'd expect that you'd want to touch the outgoing drive as little as possible, perhaps only attempting to read from it in the event that a block wasn't recoverable from the healthy drives). When it finally returned, the drive was marked as replacing as I expected from reading the man page. However, it's progress counter has not been monotonically increasing. It started at 1% and then went to 5% and then back to 2%, etc. etc. do you have any cron jobs set up to do periodic snapshots? If so, I think you're seeing: 6343667 scrub/resilver has to start over when a snapshot is taken I ran into this myself this week - replaced a drive, and the resilver made it to 95% before a snapshot cron job fired and set things back to 0%. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing a drive in a raidz vdev
On Dec 2, 2006, at 1:32 PM, Bill Sommerfeld wrote: On Sat, 2006-12-02 at 00:08 -0500, Theo Schlossnagle wrote: I had a disk malfunction in a raidz pool today. I had an extra on in the enclosure and performed a: zpool replace pool old new and several unexpected behaviors have transpired: the zpool replace command hung for 52 minutes during which no zpool commands could be executed (like status, iostat or list). So, I've observed that zfs will continue to attempt to do I/O to the outgoing drive while a replacement is in progress. (seems counterintuitive - I'd expect that you'd want to touch the outgoing drive as little as possible, perhaps only attempting to read from it in the event that a block wasn't recoverable from the healthy drives). When it finally returned, the drive was marked as replacing as I expected from reading the man page. However, it's progress counter has not been monotonically increasing. It started at 1% and then went to 5% and then back to 2%, etc. etc. do you have any cron jobs set up to do periodic snapshots? If so, I think you're seeing: 6343667 scrub/resilver has to start over when a snapshot is taken I ran into this myself this week - replaced a drive, and the resilver made it to 95% before a snapshot cron job fired and set things back to 0%. Yesterday, a snapshot was taking to assist in backups -- that could be it. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] replacing a drive in a raidz vdev
I had a disk malfunction in a raidz pool today. I had an extra on in the enclosure and performed a: zpool replace pool old new and several unexpected behaviors have transpired: the zpool replace command hung for 52 minutes during which no zpool commands could be executed (like status, iostat or list). When it finally returned, the drive was marked as replacing as I expected from reading the man page. However, it's progress counter has not been monotonically increasing. It started at 1% and then went to 5% and then back to 2%, etc. etc. I just logged in to see if it was done and ran zpool status and received: pool: xsr_slow_2 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 100.00% done, 0h0m to go config: NAME STATE READ WRITE CKSUM xsr_slow_2 ONLINE 0 0 0 raidzONLINE 0 0 0 c4t600039316A1Fd0s2ONLINE 0 0 0 c4t600039316A1Fd1s2ONLINE 0 0 0 c4t600039316A1Fd2s2ONLINE 0 0 0 c4t600039316A1Fd3s2ONLINE 0 0 0 replacing ONLINE 0 0 0 c4t600039316A1Fd4s2 ONLINE 2.87K 251 0 c4t600039316A1Fd6ONLINE 0 0 0 c4t600039316A1Fd5s2ONLINE 0 0 0 I thought to myself, if it is 100% done why is it still replacing? I waited about 15 seconds and ran the command again to find something rather disconcerting: pool: xsr_slow_2 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 0.45% done, 27h27m to go config: NAME STATE READ WRITE CKSUM xsr_slow_2 ONLINE 0 0 0 raidzONLINE 0 0 0 c4t600039316A1Fd0s2ONLINE 0 0 0 c4t600039316A1Fd1s2ONLINE 0 0 0 c4t600039316A1Fd2s2ONLINE 0 0 0 c4t600039316A1Fd3s2ONLINE 0 0 0 replacing ONLINE 0 0 0 c4t600039316A1Fd4s2 ONLINE 2.87K 251 0 c4t600039316A1Fd6ONLINE 0 0 0 c4t600039316A1Fd5s2ONLINE 0 0 0 WTF?! Best regards, Theo // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss