Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-27 Thread Cindy . Swearingen

Hi Laurent,

I was able to reproduce on it on a Solaris 10 5/09 system.
The problem is fixed in a current Nevada bits and also in
the upcoming Solaris 10 release.

The bug fix that integrated this change might be this one:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6328632
zpool offline is a bit too conservative

I can understand that you would want to offline a faulty
disk. In the meantime, you might use fmdump to help isolate
the transient error problems.

Thanks,

Cindy

On 07/20/09 08:36, Laurent Blume wrote:
Thanks a lot, Cindy! 


Let me know how it goes or if I can provide more info.
Part of the bad luck I've had with that set, is that it reports such errors 
about once a month, then everything goes back to normal again. So I'm pretty 
sure that I'll be able to try to offline the disk someday.

Laurent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-20 Thread Laurent Blume
Thanks a lot, Cindy! 

Let me know how it goes or if I can provide more info.
Part of the bad luck I've had with that set, is that it reports such errors 
about once a month, then everything goes back to normal again. So I'm pretty 
sure that I'll be able to try to offline the disk someday.

Laurent
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-20 Thread Laurent Blume
> You're right, from the documentation it definitely
> should work. Still, it doesn't. At least not in
> Solaris 10. But i am not a zfs-developer, so this
> should probably answered by them. I will give it a
> try with a recent OpneSolaris-VM and check, wether
> this works in newer implementations of zfs.

Thanks for confirming that it does work in b117. I'm not able to test it easily 
at the moment.

> Again you are right, that this is a very annoying
> behaviour. the same thing happens with DiskSuite
> pools and ufs when a disk is failing as well, though.
> For me it is not a zfs problem, but a Solaris
> problem. The kernel should stop trying to access
> failing disks a LOT earlier instead of blocking the
> complete I/O for the whole system.

I think it's both. At least, it used to be very much on the ZFS side. My 
understanding is that it has been improved to handle better issues reported by 
the driver.
But now, those thousands of retries in the logs are pretty much useless. The 
system should indeed provide a way to automatically isolate such a disk, which 
could trigger a ZFS panic if it makes a zpool faulted, but ZFS does handle such 
cases.
I understand it's not an easy task ;-)

> I always understood zfs as a concept for hot
> pluggable disks. This is the way i use it and that is
> why i never really had this problem. Whenever i run
> into this behaviour, i simply pull the disk in
> question and replace it.  The time those "hickups"
> affect the performance of our production eviroment
> have never been longer than a couple of minutes.

Ah, that's basically what I'm doing remotely with cfgadm. I'm a few thousands 
kilometers away from those disks, and worse, I was cheap at the time, I didn't 
buy an enclosure with removable drives. Well, I didn't expect so many issues, 
I've had some bad luck with it from the beginning.

Laurent
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-17 Thread Cindy . Swearingen

Hi Laurent,

Yes, you should able to offline a faulty device in a redundant
configuration as long as enough devices are available to keep
the pool redundant.

On my Solaris Nevada system (latest bits), injecting a fault
into a disk in a RAID-Z configuration and then offlining a disk
works as expected.

On my Solaris 10 system, I'm unable to offline a faulted disk in
a RAID-Z configuration so I will get back to you with a bug ID
or some other plausible explanation.

Thanks for reporting this problem.

Cindy




Laurent Blume wrote:

You could offline the disk if [b]this[/b] disk (not
the pool) had a replica. Nothing wrong with the
documentation. Hmm, maybe it is little misleading
here. I walked into the same "trap".



I apologize for being daft here, but I don't find any ambiguity in the 
documentation.
This is explicitly stated as being possible.

"This scenario is possible assuming that the systems in question see the storage 
once it is attached to the new switches, possibly through different controllers than 
before, and your pools are set up as RAID-Z or mirrored configurations."

And lower, it even says that it's not possible to offline two devices in a 
RAID-Z, with that exact error as an example:

"You cannot take a pool offline to the point where it becomes faulted. For 
example, you cannot take offline two devices out of a RAID-Z configuration, nor can 
you take offline a top-level virtual device.

# zpool offline tank c1t0d0
cannot offline c1t0d0: no valid replicas
"

http://docs.sun.com/app/docs/doc/819-5461/gazgm?l=en&a=view

I don't understand what you mean by this disk not having a replica. It's 
RAID-Z2: by definition, all the data it contains is replicated on two other 
disks in the pool. That's why the pool is still working fine.



The pool is not using the disk anymore anyway, so
(from the zfs point of view) there is no need to
offline the disk. If you want to stop the io-system
from trying to access the disk, pull it out or wait
until it gives up...



Yes, there is. I don't want the disk to become online if the system reboots, 
because what actually happens is that it *never* gives up (well, at least not 
in more than 24 hours), and all I/O to the zpool stop as long as there are 
those errors. Yes, I know it should continue working. In practice, it does not 
(though it used to be much worse in previous versions of S10, with all I/O 
stopping on all disks and volumes, both ZFS and UFS, and usually ending in a 
panic).
And the zpool command hangs, and never finished. The only way to get out of it 
is to use cfgadm to send multiple hardware resets to the SATA device, then 
disconnect it. At this point, zpool completes and shows the disk as having 
faulted.


Laurent

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-16 Thread Ross
Great news, thanks Tom!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-16 Thread Thomas Liesner
FYI:

In b117 it works as expected and stated in the documentation.

Tom
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-16 Thread Thomas Liesner
You're right, from the documentation it definitely should work. Still, it 
doesn't. At least not in Solaris 10. But i am not a zfs-developer, so this 
should probably answered by them. I will give it a try with a recent 
OpneSolaris-VM and check, wether this works in newer implementations of zfs.

> > The pool is not using the disk anymore anyway, so
> > (from the zfs point of view) there is no need to
> > offline the disk. If you want to stop the
> io-system
> > from trying to access the disk, pull it out or
> wait
> > until it gives up...
> 
> Yes, there is. I don't want the disk to become online
> if the system reboots, because what actually happens
> is that it *never* gives up (well, at least not in
> more than 24 hours), and all I/O to the zpool stop as
> long as there are those errors. Yes, I know it should
> continue working. In practice, it does not (though it
> used to be much worse in previous versions of S10,
> with all I/O stopping on all disks and volumes, both
> ZFS and UFS, and usually ending in a panic).
> And the zpool command hangs, and never finished. The
> only way to get out of it is to use cfgadm to send
> multiple hardware resets to the SATA device, then
> disconnect it. At this point, zpool completes and
> shows the disk as having faulted.

Again you are right, that this is a very annoying behaviour. the same thing 
happens with DiskSuite pools and ufs when a disk is failing as well, though. 
For me it is not a zfs problem, but a Solaris problem. The kernel should stop 
trying to access failing disks a LOT earlier instead of blocking the complete 
I/O for the whole system.
I always understood zfs as a concept for hot pluggable disks. This is the way i 
use it and that is why i never really had this problem. Whenever i run into 
this behaviour, i simply pull the disk in question and replace it.  The time 
those "hickups" affect the performance of our production eviroment have never 
been longer than a couple of minutes.

Tom
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-16 Thread Laurent Blume
> You could offline the disk if [b]this[/b] disk (not
> the pool) had a replica. Nothing wrong with the
> documentation. Hmm, maybe it is little misleading
> here. I walked into the same "trap".

I apologize for being daft here, but I don't find any ambiguity in the 
documentation.
This is explicitly stated as being possible.

"This scenario is possible assuming that the systems in question see the 
storage once it is attached to the new switches, possibly through different 
controllers than before, and your pools are set up as RAID-Z or mirrored 
configurations."

And lower, it even says that it's not possible to offline two devices in a 
RAID-Z, with that exact error as an example:

"You cannot take a pool offline to the point where it becomes faulted. For 
example, you cannot take offline two devices out of a RAID-Z configuration, nor 
can you take offline a top-level virtual device.

# zpool offline tank c1t0d0
cannot offline c1t0d0: no valid replicas
"

http://docs.sun.com/app/docs/doc/819-5461/gazgm?l=en&a=view

I don't understand what you mean by this disk not having a replica. It's 
RAID-Z2: by definition, all the data it contains is replicated on two other 
disks in the pool. That's why the pool is still working fine.

> The pool is not using the disk anymore anyway, so
> (from the zfs point of view) there is no need to
> offline the disk. If you want to stop the io-system
> from trying to access the disk, pull it out or wait
> until it gives up...

Yes, there is. I don't want the disk to become online if the system reboots, 
because what actually happens is that it *never* gives up (well, at least not 
in more than 24 hours), and all I/O to the zpool stop as long as there are 
those errors. Yes, I know it should continue working. In practice, it does not 
(though it used to be much worse in previous versions of S10, with all I/O 
stopping on all disks and volumes, both ZFS and UFS, and usually ending in a 
panic).
And the zpool command hangs, and never finished. The only way to get out of it 
is to use cfgadm to send multiple hardware resets to the SATA device, then 
disconnect it. At this point, zpool completes and shows the disk as having 
faulted.


Laurent
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-15 Thread Thomas Liesner
You could offline the disk if [b]this[/b] disk (not the pool) had a replica. 
Nothing wrong with the documentation. Hmm, maybe it is little misleading here. 
I walked into the same "trap".

The pool is not using the disk anymore anyway, so (from the zfs point of view) 
there is no need to offline the disk. If you want to stop the io-system from 
trying to access the disk, pull it out or wait until it gives up...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-15 Thread Laurent Blume
I don't have a replacement, but I don't want the disk to be used right now by 
the volume: how do I do that?
This is exactly the point of the offline command as explained in the 
documentation: disabling unreliable hardware, or removing it temporarily.
So this is a huge bug of the documentation?

What's the point of it if its own purpose doesn't work? I'm really puzzled now.

Laurent
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-15 Thread Thomas Liesner
You can't replace it because this disk is still a valid member of the pool, 
although it is marked faulty.
Put in a replacement disk, add this to the pool and replace the faulty one with 
the new disk.

Regards,
Tom
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-13 Thread Ross
Yup, just hit exactly the same myself.  I have a feeling this faulted disk is 
affecting performance, so tried to remove or offline it:

$ zpool iostat -v 30

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rc-pool 1.27T  1015G682 71  84.0M  1.88M
  mirror 199G   265G  0  5  0  21.1K
c4t1d0  -  -  0  2  0  21.1K
c4t2d0  -  -  0  0  0  0
c5t1d0  -  -  0  2  0  21.1K
  mirror 277G   187G170  7  21.1M   322K
c4t3d0  -  - 58  4  7.31M   322K
c5t2d0  -  - 54  4  6.83M   322K
c5t0d0  -  - 56  4  6.99M   322K
  mirror 276G   188G171  6  21.1M   336K
c5t3d0  -  - 56  4  7.03M   336K
c4t5d0  -  - 56  3  7.03M   336K
c4t4d0  -  - 56  3  7.04M   336K
  mirror 276G   188G169  6  20.9M   353K
c5t4d0  -  - 57  3  7.17M   353K
c5t5d0  -  - 54  4  6.79M   353K
c4t6d0  -  - 55  3  6.99M   353K
  mirror 277G   187G171 10  20.9M   271K
c4t7d0  -  - 56  4  7.11M   271K
c5t6d0  -  - 55  5  6.93M   271K
c5t7d0  -  - 55  5  6.88M   271K
  c6d1p0  32K   504M  0 34  0   620K
--  -  -  -  -  -  -

20MB in 30 seconds for 3 disks that's 220kb/s.  Not healthy at all.

$ zpool status
  pool: rc-pool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: scrub completed after 2h55m with 0 errors on Tue Jun 23 11:11:42 2009
config:

NAMESTATE READ WRITE CKSUM
rc-pool DEGRADED 0 0 0
  mirrorDEGRADED 0 0 0
c4t1d0  ONLINE   0 0 0
c4t2d0  FAULTED  1.71M 23.3M 0  too many errors
c5t1d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
c5t2d0  ONLINE   0 0 0
c5t0d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c5t3d0  ONLINE   0 0 0
c4t5d0  ONLINE   0 0 0
c4t4d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c5t4d0  ONLINE   0 0 0
c5t5d0  ONLINE   0 0 0
c4t6d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c4t7d0  ONLINE   0 0 0
c5t6d0  ONLINE   0 0 0
c5t7d0  ONLINE   0 0 0
logsDEGRADED 0 0 0
  c6d1p0ONLINE   0 0 0

errors: No known data errors


# zpool offline rc-pool c4t2d0
cannot offline c4t2d0: no valid replicas

# zpool remove rc-pool c4t2d0
cannot remove c4t2d0: only inactive hot spares or cache devices can be removed
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Can't offline a RAID-Z2 device: "no valid replica"

2009-07-12 Thread Laurent Blume
(As I'm not subscribed to this list, you can keep me in CC:, but I'll check out 
the Jive thread)

Hi all,

I've seen this questions asked several times, but there wasn't any solution 
provided.
I'm trying to offline a faulted device in a RAID-Z2 device on Solaris 10. This 
is done according to the documentation:

http://docs.sun.com/app/docs/doc/819-5461/gazfy?l=en&a=view

However, I always get the same message:
cannot offline c2t1d0: no valid replicas

How come? It's a RAID-Z2 pool, it should (and it does) work fine without one 
device.
What am I doing wrong? 

TIA!


# zpool status data
  pool: data
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
dataDEGRADED 0 0 0
  raidz2DEGRADED 0 0 0
c2t0d0  ONLINE   0 0 0
c2t1d0  FAULTED  3   636 1  too many errors
c2t2d0  ONLINE   0 0 0
c2t3d0  ONLINE   0 0 0

errors: No known data errors

# zpool offline -t data c2t1d0
cannot offline c2t1d0: no valid replicas


Laurent
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss