Greetings.
I'm working on a high-availability iSCSI target (tgt) cluster using
CentOS 6.4, to support a blade VM infrastructure. I've encountered a
number of problems i haven't found documented elsewhere (not for lack of
looking), and i want to run some solutions past the list to see what
sticks. I have one outstanding problem i haven't been able to work out,
which i will present in another thread to follow.
I'm going to forgo posting detailed configs initially here, as i think
the problems are abstract enough that it won't be necessary. If it turns
out to be necessary, we can do that. I hope also that this list is an
acceptable place for this; i found a lot of pointers here so it seemed
appropriate.
The cluster is two Dell boxes with a bunch of directly attached Dell SAS
storage. On each box, disks are organized into two RAID10 volumes and
four RAID6 volumes, to serve different profiles of IOPS and storage
volume needed by various applications. Each of these volumes is synced
to the other box using DRBD 8.4 over an LACP bond of two crossover 10
GbE links. The pacemaker/CMAN/corosync stack also talks over the
crossover. Each box is connected to the network using another 10 GbE
link, with a 1 GbE link bonded in active/backup mode and attached to a
separate switch so i can reboot the primary switch if necessary without
losing connectivity. (The 10 GbE transceivers for my switch
infrastructure are expensive, and multi-switch LACP is way too bleeding
edge for me.) Write cache is disabled on all tgt LUNs (using mode_page),
and all the RAIDs are battery backed.
One of my requirements here is to run multiple tgt targets. Each DRBD
volume is the sole physical volume of a distinct LVM volume group, and
each VG is assigned to a unique target. pacemaker naturally balances the
six targets into three on each box. This configuration has the advantage
that when most initiators are readers, outgoing bandwidth from each box
can theoretically hit 10 Gb/s, resulting in 20 Gb/s read bandwidth. (In
practice i'm maxing out around 12 Gb/s, but i think that's because of
limitations in my switch infrastructure.) Another advantage of multiple
targets is that when something screwy happens and a target goes down,
this only takes down the LUNs on that target, rather than the whole kit
and kaboodle.
Another requirement i have is password authentication to the targets.
So, first the problems, then the workarounds:
The existing iSCSITarget RA is not designed to support multiple targets
with a single user account in the tgt implementation. But getting the
initiators (libvirt nodes, also using CentOS 6.4) to support a different
account for each target is non-trivial, since there's only one slot in
/etc/iscsi/iscsi.conf for authentication info. There are workarounds for
initiators but they're pretty ugly, especially if you want to manage
your iSCSI targets using libvirt pools.
In practice i ran into failover problems under load almost immediately.
Under load, when i would initiate a failover, there was a race
condition: the iSCSILogicalUnit RA will take down the LUNs one at a
time, waiting for each connection to terminate, and if the initiators
reconnect quickly enough, they get pissed off at finding that the target
still exists but the LUN they were using no longer does, which is often
the case during this transient takedown process. On the initiator, it
looks something like this, and it's fatal (here LUN 4 has gone away but
the target is still alive, maybe working on disconnecting LUN 3):
Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal
Request [current]
Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit
not supported
Nov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical
block 16542656
One solution to this is using the portblock RA to block all initiator
traffic during failover, but this creates another problem: tgtd doesn't
allow established connections to expire as long as there's outstanding
data in the Send-Q for the TCP connection; if tgtd has already queued a
bunch of traffic to an initiator when a failover starts, and portblock
starts blocking ACK packets from the initiator, the Send-Q never drains,
and the tgtd connection hangs permanently. This stops failover from
completing, and eventually everyone is unhappy, especially me.
I was having another problem with portblock: start and stop operations
would frequently but unpredictably fail when multiple targets were
simultaneously failing over. The error reported by pacemaker was
"'insufficient privileges' (rc=4)". This was pretty mysterious since
everything was running as root, and root has no problem executing
iptables. This would blow failover sequences up, and the target resource
would go down.
So, here's how i've worked around these problems. Comments about how
stupid was not to have done X are, of course, welcome. I'd rather not
hear "You must have a pacemaker config problem," from people who haven't
at least tried to do multiple targets under load on 10 GbE media or
something similar; new things happen under these circumstances. I've got
stonith configured and all the necessary order and colocation
constraints in place, and with the following workarounds things are
stable, so i think i have the config correct.
1. Lack of support for multiple targets using the same tgt account. This
is a problem because the iSCSITarget RA defines the user and the target
at the same time. If it allowed multiple targets to use the same user,
it wouldn't know when it is safe to delete the user in a stop operation,
because some other target might still be using it.
To solve this i did two things: first i wrote a new RA that manages a
tgt user; this is instantiated as a clone so it runs along with the tgtd
clone. Second i tweaked the iSCSITarget RA so that on start, if
incoming_username is defined but incoming_password is not, the RA skips
the account creation step and simply binds the new target to
incoming_username. On stop, it similarly no longer deletes the account
if incoming_password is unset. I also had to relax the uniqueness
constraint on incoming_username in the RA metadata.
2. Disappearing LUNs during failover cause initiators to blow chunks.
For this i used portblock, but had to modify it because the TCP Send-Q
would never drain.
3. portblock preventing TCP Send-Q from draining, causing tgtd
connections to hang. I modified portblock to reverse the sense of the
iptables rules it was adding: instead of blocking traffic from the
initiator on the INPUT chain, it now blocks traffic from the target on
the OUTPUT chain with a tcp-reset response. With this setup, as soon as
portblock goes active, the next packet tgtd attempts to send to a given
initiator will get a TCP RST response, causing tgtd to hang up the
connection immediately. This configuration allows the connections to
terminate promptly under load.
I'm not totally satisfied with this workaround. It means
acknowledgements of operations tgtd has actually completed never make it
back to the initiator. I suspect this could cause problems in some
scenarios. I don't think it causes a problem the way i'm using it, with
each LUN as backing store for a distinct VM--when the LUN is back up on
the other node, the outstanding operations are re-sent by the initiator.
Maybe with a clustered filesystem this would cause problems; it
certainly would cause problems if the target device were, for example, a
tape drive.
4. "Insufficient privileges" faults in the portblock RA. This was
another race condition that occurred because i was using multiple
targets, meaning that without a mutex, multiple portblock invocations
would be running in parallel during a failover. If you try to run
iptables while another iptables is running, you get "Resource not
available" and this was coming back to pacemaker as "insufficient
privileges". This is simply a bug in the portblock RA; it should have a
mutex to prevent parallel iptables invocations. I fixed this by adding
an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for
start, stop, monitor, and status operations.
I'm not sure why more people haven't run into these problems before. I
hope it's not that i'm doing things wrong, but rather that few others
haven't earnestly tried to build anything quite like this setup. If
anyone out there has set up a similar cluster and *not* had these
problems, i'd like to know about it. Meanwhile, if others *have* had
these problems, i'd also like to know, especially if they've found
alternate solutions.
Thanks in advance.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems