Re: [zfs-discuss] zfs not yet suitable for HA applications?

2008-12-05 Thread David Anderson
Trying to keep this in the spotlight. Apologies for the lengthy post.

I'd really like to see features as described by Ross in his summary of  
the Availability: ZFS needs to handle disk removal / driver failure  
better  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031#274031 
  ). I'd like to have these/similar features as well. Has there  
already been internal discussions regarding adding this type of  
functionality to ZFS itself, and was there approval, disapproval or no  
decision?

Unfortunately my situation has put me in urgent need to find  
workarounds in the meantime.

My setup: I have two iSCSI target nodes, each with six drives exported  
via iscsi (Storage Nodes). I have a ZFS Node that logs into each  
target from both Storage Nodes and creates a mirrored Zpool with one  
drive from each Storage Node comprising each half of the mirrored  
vdevs (6 x 2-way mirrors).

My problem: If a Storage Node crashes completely, is disconnected from  
the network, iscsitgt core dumps, a drive is pulled, or a drive has a  
problem accessing data (read retries), then my ZFS Node hangs while  
ZFS waits patiently for the layers below to report a problem and  
timeout the devices. This can lead to a roughly 3 minute or longer  
halt when reading OR writing to the Zpool on the ZFS node. While this  
is acceptable in certain situations, I have a case where my  
availability demand is more severe.

My goal: figure out how to have the zpool pause for NO LONGER than 30  
seconds (roughly within a typical HTTP request timeout) and then issue  
reads/writes to the good devices in the zpool/mirrors while the other  
side comes back online or is fixed.

My ideas:
   1. In the case of the iscsi targets disappearing (iscsitgt core  
dump, Storage Node crash, Storage Node disconnected from network), I  
need to lower the iSCSI login retry/timeout values. Am I correct in  
assuming the only way to accomplish this is to recompile the iscsi  
initiator? If so, can someone help point me in the right direction (I  
have never compiled ONNV sources - do I need to do this or can I just  
recompile the iscsi initiator)?

1.a. I'm not sure in what Initiator session states  
iscsi_sess_max_delay is applicable - only for the initial login, or  
also in the case of reconnect? Ross, if you still have your test boxes  
available, can you please try setting set iscsi:iscsi_sess_max_delay  
= 5 in /etc/system, reboot and try failing your iscsi vdevs again? I  
can't find a case where this was tested quick failover.

 1.b. I would much prefer to have bug 649 addressed and fixed  
rather than having to resort to recompiling the iscsi initiator (if  
iscsi_sess_max_delay) doesn't work. This seems like a trivial feature  
to implement. How can I sponsor development?

   2. In the case of the iscsi target being reachable, but the  
physical disk is having problems reading/writing data (retryable  
events that take roughly 60 seconds to timeout), should I change the  
iscsi_rx_max_window tunable with mdb? Is there a tunable for iscsi_tx?  
Ross, I know you tried this recently in the thread referenced above  
(with value 15), which resulted in a 60 second hang. How did you  
offline the iscsi vol to test this failure? Unless iscsi uses a  
multiple of the value for retries, then maybe the way you failed the  
disk caused the iscsi system to follow a different failure path?  
Unfortunately I don't know of a way to introduce read/write retries to  
a disk while the disk is still reachable and presented via iscsitgt,  
so I'm not sure how to test this.

 2.a With the fix of http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 
  , we can set sd_retry_count along with sd_io_time to cause I/O  
failure when a command takes longer than sd_retry_count * sd_io_time.  
Can (or should) these tunables be set on the imported iscsi disks in  
the ZFS Node, or can/should they be applied only to the local disk on  
the Storage Nodes? If there is a way to apply them to ONLY the  
imported iscsi disks (and not the local disks) of the ZFS Node, and  
without rebooting every time a new iscsi disk is imported, then I'm  
thinking this is the way to go.

In a year of having this setup in customer beta I have never had  
Storage nodes (or both sides of a mirror) down at the same time. I'd  
like ZFS to take advantage of this. If (and only if) both sides fail  
then ZFS can enter failmode=wait.

Currently using Nevada b96. Planning to move to 100 shortly to avoid  
zpool commands hanging while the zpool is waiting to reach a device.

David Anderson
Aktiom Networks, LLC

Ross wrote:
  I discussed this exact issue on the forums in February, and filed a  
bug at the time.  I've also e-mailed and chatted with the iSCSI  
developers, and the iSER developers a few times.  There was also been  
another thread about the iSCSI timeouts being made configurable a few  
months back, and finally, I started another discussion on ZFS  
availability

Re: [zfs-discuss] ZFS + ISCSI + LINUX QUESTIONS

2007-05-31 Thread David Anderson
.

 -Rule of thumb is 1Ghz is needed to saturate 1GbE link.
 -Run 64bit Solaris  give ZFS nodes as much RAM as possible
 -Read the documentation
 -...
 -Profit  ;)

David Anderson



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] nv59 + HA-ZFS

2007-03-13 Thread David Anderson

Hi,

I'm attempting to build a ZFS SAN with iSCSI+IPMP transport. I have two 
ZFS nodes that access iSCSI disks on the storage network and then the 
ZFS nodes share ZVOLs via iSCSI to my front-end Linux boxes. My 
throughput from one Linux box is about 170+MB/s with nv59 (earlier 
builds were about 60MB/s), so I am pleased with the performance so far.


My next step is to configure HA-ZFS for failover between the two ZFS 
nodes. Does Sun Cluster 3.2 work with SXCE? If so, are there any caveats 
for my situation?


Has anyone been successful in developing a HA-ZFS environment without 
SC? I'd rather not go this route if possible, but if s10u4 release is 
too far off I may have to give this a go.


Related question: anyone know when s10u4 is slated for release? June?

Thanks in advance,

--Dave
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss