Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross
Ok, I've done some more testing today and I almost don't know where to start. I'll begin with the good news for Miles :) - Rebooting doesn't appear to cause ZFS to loose the resilver status (but see 1. below) - Resilvering appears to work fine, once complete I never saw any checksum errors when

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Maurice Volaski
2. With iscsi, you can't reboot with sendtargets enabled, static discovery still seems to be the order of the day. I'm seeing this problem with static discovery: http://bugs.opensolaris.org/view_bug.do?bug_id=6775008. 4. iSCSI still has a 3 minute timeout, during which time your pool will

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross Smith
Yeah, thanks Maurice, I just saw that one this afternoon. I guess you can't reboot with iscsi full stop... o_0 And I've seen the iscsi bug before (I was just too lazy to look it up lol), I've been complaining about that since February. In fact it's been a bad week for iscsi here, I've managed

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Miles Nordin
r == Ross [EMAIL PROTECTED] writes: rs I don't think it likes it if the iscsi targets aren't rs available during boot. from my cheatsheet: -8- ok boot -m milestone=none [boots. enter root password for maintenance.] bash-3.00# /sbin/mount -o remount,rw / [-- otherwise iscsiadm

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hey folks, I've just followed up on this, testing iSCSI with a raided pool, and it still appears to be struggling when a device goes offline. I don't see how this could work except for mirrored pools. Would that carry enough market to be worthwhile? -- richard I have to admit, I've not

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross
Incidentally, while I've reported this again as a RFE, I still haven't seen a CR number for this. Could somebody from Sun check if it's been filed please. thanks, Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hi Richard, Thanks, I'll give that a try. I think I just had a kernel dump while trying to boot this system back up though, I don't think it likes it if the iscsi targets aren't available during boot. Again, that rings a bell, so I'll go see if that's another known bug. Changing that setting

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Miles Nordin
rs == Ross Smith [EMAIL PROTECTED] writes: rs 4. zpool status still reports out of date information. I know people are going to skim this message and not hear this. They'll say ``well of course zpool status says ONLINE while the pool is hung. ZFS is patiently waiting. It doesn't know

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross
Hi Miles, It's probably a bad sign that although that post came through as anonymous in my e-mail, I recognised your style before I got half way through your post :) I agree, the zpool status being out of date is weird, I'll dig out the bug number for that at some point as I'm sure I've

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Miles Nordin
r == Ross [EMAIL PROTECTED] writes: r style before I got half way through your post :) [...status r problems...] could be a case of oversimplifying things. yeah I was a bit inappropriate, but my frustration comes from the (partly paranoid) imagining of how the idea ``we need to make

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Toby Thain
On 2-Dec-08, at 3:35 PM, Miles Nordin wrote: r == Ross [EMAIL PROTECTED] writes: r style before I got half way through your post :) [...status r problems...] could be a case of oversimplifying things. ... And yes, this is a religious argument. Just because it spans decades of

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-28 Thread Richard Elling
Ross Smith wrote: On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote: Ross wrote: Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread James C. McPherson
On Thu, 27 Nov 2008 04:33:54 -0800 (PST) Ross [EMAIL PROTECTED] wrote: Hmm... I logged this CR ages ago, but now I've come to find it in the bug tracker I can't see it anywhere. I actually logged three CR's back to back, the first appears to have been created ok, but two have just

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Thanks James, I've e-mailed Alan and submitted this one again. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Hmm... I logged this CR ages ago, but now I've come to find it in the bug tracker I can't see it anywhere. I actually logged three CR's back to back, the first appears to have been created ok, but two have just disappeared. The one I created ok is:

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Bernard Dugas
Hello, Thank you for this very interesting thread ! I want to confirm that Synchronous Distributed Storage is main goal when using ZFS ! The target architecture is 1 local drive, and 2 (or more) remote iSCSI targets, with ZFS being the iSCSI initiator. System is designed/cut so that local

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability,

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Bernard Dugas
Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: Thank you ! Yes, this was also to tell you that you are not alone :-) I agree completely with you on your technical points

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Richard Elling
Ross wrote: Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross Smith
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote: Ross wrote: Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea is really a two stage RFE,

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-06 Thread Ross
Hey folks, Well, there haven't been any more comments knocking holes in this idea, so I'm wondering now if I should log this as an RFE? Is this something others would find useful? Ross -- This message posted from opensolaris.org ___ zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-06 Thread Richard Elling
Ross wrote: Hey folks, Well, there haven't been any more comments knocking holes in this idea, so I'm wondering now if I should log this as an RFE? go for it! Is this something others would find useful? Yes. But remember that this has a very limited scope. Basically it will

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-02 Thread Ross Smith
Thinking about it, we could make use of this too. The ability to add a remote iSCSI mirror to any pool without sacrificing local performance could be a huge benefit. From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org Subject: Re: Availability:

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-31 Thread Johan Hartzenberg
On Thu, Aug 28, 2008 at 11:21 PM, Ian Collins [EMAIL PROTECTED] wrote: Miles Nordin writes: suggested that unlike the SVM feature it should be automatic, because by so being it becomes useful as an availability tool rather than just performance optimisation. So on a server with a read

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-31 Thread Richard Elling
Ross Smith wrote: Triple mirroring you say? That'd be me then :D The reason I really want to get ZFS timeouts sorted is that our long term goal is to mirror that over two servers too, giving us a pool mirrored across two servers, each of which is actually a zfs iscsi volume hosted on

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ross
Wow, some great comments on here now, even a few people agreeing with me which is nice :D I'll happily admit I don't have the in depth understanding of storage many of you guys have, but since the idea doesn't seem pie-in-the-sky crazy, I'm going to try to write up all my current thoughts on

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Bob Friesenhahn
On Sat, 30 Aug 2008, Ross wrote: while the problem is diagnosed. - With that said, could the write timeout default to on when you have a slog device? After all, the data is safely committed to the slog, and should remain there until it's written to all devices. Bob, you seemed the most

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ross Smith
the pool to continue accepting writes while the pool is in a non redundant state. Ross Date: Sat, 30 Aug 2008 10:59:19 -0500 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ian Collins
Eric Schrock writes: A better option would be to not use this to perform FMA diagnosis, but instead work into the mirror child selection code. This has already been alluded to before, but it would be cool to keep track of latency over time, and use this to both a) prefer one drive over

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ian Collins
Miles Nordin writes: bf == Bob Friesenhahn [EMAIL PROTECTED] writes: bf You are saying that I can't split my mirrors between a local bf disk in Dallas and a remote disk in New York accessed via bf iSCSI? nope, you've misread. I'm saying reads should go to the local disk

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Nicolas Williams
On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote: Which of these do you prefer? o System waits substantial time for devices to (possibly) recover in order to ensure that subsequently written data has the least chance of being lost. o System immediately

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Nicolas Williams
On Thu, Aug 28, 2008 at 01:05:54PM -0700, Eric Schrock wrote: As others have mentioned, things get more difficult with writes. If I issue a write to both halves of a mirror, should I return when the first one completes, or when both complete? One possibility is to expose this as a tunable,

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Richard Elling
Nicolas Williams wrote: On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote: Which of these do you prefer? o System waits substantial time for devices to (possibly) recover in order to ensure that subsequently written data has the least chance of being lost.

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Miles Nordin
es == Eric Schrock [EMAIL PROTECTED] writes: es The main problem with exposing tunables like this is that they es have a direct correlation to service actions, and es mis-diagnosing failures costs everybody (admin, companies, es Sun, etc) lots of time and money. Once you expose

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Miles Nordin
re == Richard Elling [EMAIL PROTECTED] writes: re if you use Ethernet switches in the interconnect, you need to re disable STP on the ports used for interconnects or risk re unnecessary cluster reconfigurations. RSTP/802.1w plus setting the ports connected to Solaris as ``edge'' is

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Richard Elling
Miles Nordin wrote: re == Richard Elling [EMAIL PROTECTED] writes: re if you use Ethernet switches in the interconnect, you need to re disable STP on the ports used for interconnects or risk re unnecessary cluster reconfigurations. RSTP/802.1w plus setting the

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross
Since somebody else has just posted about their entire system locking up when pulling a drive, I thought I'd raise this for discussion. I think Ralf made a very good point in the other thread. ZFS can guarantee data integrity, what it can't do is guarantee data availability. The problem is,

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Ross wrote: I believe ZFS should apply the same tough standards to pool availability as it does to data integrity. A bad checksum makes ZFS read the data from elsewhere, why shouldn't a timeout do the same thing? A problem is that for some devices, a five minute

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
Ross, thanks for the feedback. A couple points here - A lot of work went into improving the error handling around build 77 of Nevada. There are still problems today, but a number of the complaints we've seen are on s10 software or older nevada builds that didn't have these fixes. Anything from

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
es == Eric Schrock [EMAIL PROTECTED] writes: es Finally, imposing additional timeouts in ZFS is a bad idea. es [...] As such, it doesn't have the necessary context to know es what constitutes a reasonable timeout. you're right in terms of fixed timeouts, but there's no reason it

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin wrote: you're right in terms of fixed timeouts, but there's no reason it can't compare the performance of redundant data sources, and if one vdev performs an order of magnitude slower than another set of vdevs with sufficient redundancy,

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Miles Nordin wrote: you're right in terms of fixed timeouts, but there's no reason it can't compare the performance of redundant data sources, and if one vdev performs an order of magnitude slower than another set of vdevs with sufficient redundancy, stop issuing reads

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross Smith
willing to test the new B_FAILFAST logic on iSCSI once it's released. Just let me know when it's out. Ross Date: Thu, 28 Aug 2008 11:29:21 -0500 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
es == Eric Schrock [EMAIL PROTECTED] writes: es I don't think you understand how this works. Imagine two es I/Os, just with different sd timeouts and retry logic - that's es B_FAILFAST. It's quite simple, and independent of any es hardware implementation. AIUI the main timeout

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
bf == Bob Friesenhahn [EMAIL PROTECTED] writes: bf If the system or device is simply overwelmed with work, then bf you would not want the system to go haywire and make the bf problems much worse. None of the decisions I described its making based on performance statistics are

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote: Personally, if a SATA disk wasn't responding to any requests after 2 seconds I really don't care if an error has been detected, as far as I'm concerned that disk is faulty. Unless you have power management enabled, or there's a bad

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Miles Nordin wrote: None of the decisions I described its making based on performance statistics are ``haywire''---I said it should funnel reads to the faster side of the mirror, and do this really quickly and unconservatively. What's your issue with that? From what I

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bill Sommerfeld
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: A better option would be to not use this to perform FMA diagnosis, but instead work into the mirror child selection code. This has already been alluded to before, but it would be cool to keep track of latency over time, and use this to