Trying to keep this in the spotlight. Apologies for the lengthy post.
I'd really like to see features as described by Ross in his summary of
the Availability: ZFS needs to handle disk removal / driver failure
better (http://www.opensolaris.org/jive/thread.jspa?messageID=274031#274031
). I'd like to have these/similar features as well. Has there
already been internal discussions regarding adding this type of
functionality to ZFS itself, and was there approval, disapproval or no
decision?
Unfortunately my situation has put me in urgent need to find
workarounds in the meantime.
My setup: I have two iSCSI target nodes, each with six drives exported
via iscsi (Storage Nodes). I have a ZFS Node that logs into each
target from both Storage Nodes and creates a mirrored Zpool with one
drive from each Storage Node comprising each half of the mirrored
vdevs (6 x 2-way mirrors).
My problem: If a Storage Node crashes completely, is disconnected from
the network, iscsitgt core dumps, a drive is pulled, or a drive has a
problem accessing data (read retries), then my ZFS Node hangs while
ZFS waits patiently for the layers below to report a problem and
timeout the devices. This can lead to a roughly 3 minute or longer
halt when reading OR writing to the Zpool on the ZFS node. While this
is acceptable in certain situations, I have a case where my
availability demand is more severe.
My goal: figure out how to have the zpool pause for NO LONGER than 30
seconds (roughly within a typical HTTP request timeout) and then issue
reads/writes to the good devices in the zpool/mirrors while the other
side comes back online or is fixed.
My ideas:
1. In the case of the iscsi targets disappearing (iscsitgt core
dump, Storage Node crash, Storage Node disconnected from network), I
need to lower the iSCSI login retry/timeout values. Am I correct in
assuming the only way to accomplish this is to recompile the iscsi
initiator? If so, can someone help point me in the right direction (I
have never compiled ONNV sources - do I need to do this or can I just
recompile the iscsi initiator)?
1.a. I'm not sure in what Initiator session states
iscsi_sess_max_delay is applicable - only for the initial login, or
also in the case of reconnect? Ross, if you still have your test boxes
available, can you please try setting set iscsi:iscsi_sess_max_delay
= 5 in /etc/system, reboot and try failing your iscsi vdevs again? I
can't find a case where this was tested quick failover.
1.b. I would much prefer to have bug 649 addressed and fixed
rather than having to resort to recompiling the iscsi initiator (if
iscsi_sess_max_delay) doesn't work. This seems like a trivial feature
to implement. How can I sponsor development?
2. In the case of the iscsi target being reachable, but the
physical disk is having problems reading/writing data (retryable
events that take roughly 60 seconds to timeout), should I change the
iscsi_rx_max_window tunable with mdb? Is there a tunable for iscsi_tx?
Ross, I know you tried this recently in the thread referenced above
(with value 15), which resulted in a 60 second hang. How did you
offline the iscsi vol to test this failure? Unless iscsi uses a
multiple of the value for retries, then maybe the way you failed the
disk caused the iscsi system to follow a different failure path?
Unfortunately I don't know of a way to introduce read/write retries to
a disk while the disk is still reachable and presented via iscsitgt,
so I'm not sure how to test this.
2.a With the fix of http://bugs.opensolaris.org/view_bug.do?bug_id=6518995
, we can set sd_retry_count along with sd_io_time to cause I/O
failure when a command takes longer than sd_retry_count * sd_io_time.
Can (or should) these tunables be set on the imported iscsi disks in
the ZFS Node, or can/should they be applied only to the local disk on
the Storage Nodes? If there is a way to apply them to ONLY the
imported iscsi disks (and not the local disks) of the ZFS Node, and
without rebooting every time a new iscsi disk is imported, then I'm
thinking this is the way to go.
In a year of having this setup in customer beta I have never had
Storage nodes (or both sides of a mirror) down at the same time. I'd
like ZFS to take advantage of this. If (and only if) both sides fail
then ZFS can enter failmode=wait.
Currently using Nevada b96. Planning to move to 100 shortly to avoid
zpool commands hanging while the zpool is waiting to reach a device.
David Anderson
Aktiom Networks, LLC
Ross wrote:
I discussed this exact issue on the forums in February, and filed a
bug at the time. I've also e-mailed and chatted with the iSCSI
developers, and the iSER developers a few times. There was also been
another thread about the iSCSI timeouts being made configurable a few
months back, and finally, I started another discussion on ZFS
availability