Re: [zfs-discuss] zfs not yet suitable for HA applications?

Ross Smith Fri, 05 Dec 2008 14:00:42 -0800

Hi Dan, replying in line:

On Fri, Dec 5, 2008 at 9:19 PM, David Anderson <[EMAIL PROTECTED]> wrote:
> Trying to keep this in the spotlight. Apologies for the lengthy post.


Heh, don't apologise, you should see some of my posts... o_0

> I'd really like to see features as described by Ross in his summary of the
> "Availability: ZFS needs to handle disk removal / driver failure better"
>  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031 ).
> I'd like to have these/similar features as well. Has there already been
> internal discussions regarding adding this type of functionality to ZFS
> itself, and was there approval, disapproval or no decision?
>
> Unfortunately my situation has put me in urgent need to find workarounds in
> the meantime.
>
> My setup: I have two iSCSI target nodes, each with six drives exported via
> iscsi (Storage Nodes). I have a ZFS Node that logs into each target from
> both Storage Nodes and creates a mirrored Zpool with one drive from each
> Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors).
>
> My problem: If a Storage Node crashes completely, is disconnected from the
> network, iscsitgt core dumps, a drive is pulled, or a drive has a problem
> accessing data (read retries), then my ZFS Node hangs while ZFS waits
> patiently for the layers below to report a problem and timeout the devices.
> This can lead to a roughly 3 minute or longer halt when reading OR writing
> to the Zpool on the ZFS node. While this is acceptable in certain
> situations, I have a case where my availability demand is more severe.
>
> My goal: figure out how to have the zpool pause for NO LONGER than 30
> seconds (roughly within a typical HTTP request timeout) and then issue
> reads/writes to the good devices in the zpool/mirrors while the other side
> comes back online or is fixed.
>
> My ideas:
>  1. In the case of the iscsi targets disappearing (iscsitgt core dump,
> Storage Node crash, Storage Node disconnected from network), I need to lower
> the iSCSI login retry/timeout values. Am I correct in assuming the only way
> to accomplish this is to recompile the iscsi initiator? If so, can someone
> help point me in the right direction (I have never compiled ONNV sources -
> do I need to do this or can I just recompile the iscsi initiator)?

I believe it's possible to just recompile the initiator and install
the new driver.  I have some *very* rough notes that were sent to me
about a year ago, but I've no experience compiling anything in
Solaris, so don't know how useful they will be.  I'll try to dig them
out in case they're useful.

>
>   1.a. I'm not sure in what Initiator session states iscsi_sess_max_delay is
> applicable - only for the initial login, or also in the case of reconnect?
> Ross, if you still have your test boxes available, can you please try
> setting "set iscsi:iscsi_sess_max_delay = 5" in /etc/system, reboot and try
> failing your iscsi vdevs again? I can't find a case where this was tested
> quick failover.

Will gladly have a go at this on Monday.

>    1.b. I would much prefer to have bug 6497777 addressed and fixed rather
> than having to resort to recompiling the iscsi initiator (if
> iscsi_sess_max_delay) doesn't work. This seems like a trivial feature to
> implement. How can I sponsor development?
>
>  2. In the case of the iscsi target being reachable, but the physical disk
> is having problems reading/writing data (retryable events that take roughly
> 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with
> mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently
> in the thread referenced above (with value 15), which resulted in a 60
> second hang. How did you offline the iscsi vol to test this failure? Unless
> iscsi uses a multiple of the value for retries, then maybe the way you
> failed the disk caused the iscsi system to follow a different failure path?
> Unfortunately I don't know of a way to introduce read/write retries to a
> disk while the disk is still reachable and presented via iscsitgt, so I'm
> not sure how to test this.

So far I've just been shutting down the Solaris box hosting the iSCSI
target.  Next step will involve pulling some virtual cables.
Unfortunately I don't think I've got a physical box handy to test
drive failures right now, but my previous testing (of simply pulling
drives) showed that it can be hit and miss as to how well ZFS detects
these types of 'failure'.

Like you I don't know yet how to simulate failures, so I'm doing
simple tests right now, offlining entire drives or computers.
Unfortunately I've found more than enough problems with just those
tests to keep me busy.


>    2.a With the fix of
> http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set
> sd_retry_count along with sd_io_time to cause I/O failure when a command
> takes longer than sd_retry_count * sd_io_time. Can (or should) these
> tunables be set on the imported iscsi disks in the ZFS Node, or can/should
> they be applied only to the local disk on the Storage Nodes? If there is a
> way to apply them to ONLY the imported iscsi disks (and not the local disks)
> of the ZFS Node, and without rebooting every time a new iscsi disk is
> imported, then I'm thinking this is the way to go.

Very interesting.  I'll certainly be looking to see if this helps too.

>
> In a year of having this setup in customer beta I have never had Storage
> nodes (or both sides of a mirror) down at the same time. I'd like ZFS to
> take advantage of this. If (and only if) both sides fail then ZFS can enter
> failmode=wait.
>
> Currently using Nevada b96. Planning to move to >100 shortly to avoid zpool
> commands hanging while the zpool is waiting to reach a device.

They still hang if my experience last week is anything to go by.  But
at least they recover now and don't lock up the pool for good.  Filing
the bug for that is on my to do list :)


>
> David Anderson
> Aktiom Networks, LLC
>
> Ross wrote:
>> I discussed this exact issue on the forums in February, and filed a bug at
>> the time.  I've also e-mailed and chatted with the iSCSI developers, and the
>> iSER developers a few times.  There was also been another thread about the
>> iSCSI timeouts being made configurable a few months back, and finally, I
>> started another discussion on ZFS availability, and filed an RFE for pretty
>> much exactly what you're asking for.
>>
>> So the question is being asked, but as for how long it will be before Sun
>> improve ZFS availability, I really wouldn't like to say.  One potential
>> problem is that Sun almost certainly have a pretty good HA system with
>> Fishworks running on their own hardware, and I don't know how much they are
>> going to want to create an open source alternative to that.
>>
>> My original discussion in Feb:
>> http://opensolaris.org/jive/thread.jspa?messageID=213482
>>
>> The iSCSI timeout bugs.  The first one was raised in November 2006!!
>> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777
>>
>> http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866
>>
>> The ZFS availability thread:
>> http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031
>>
>> I can't find the RFE I filed on the back of that just yet, I'll have a
>> look through my e-mails on Monday to find it for you.
>>
>> The one bright point is that it does look like it would be possible to
>> edit iscsi.h manually and recompile the driver, but that's a bit outside of
>> my experience right now so I'm leaving that until I have no other choice.
>>
>> Ross
>
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs not yet suitable for HA applications?

Reply via email to