[Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

Jefferson Ogata Tue, 12 Nov 2013 17:47:21 -0800

Greetings.

I'm working on a high-availability iSCSI target (tgt) cluster usingCentOS 6.4, to support a blade VM infrastructure. I've encountered anumber of problems i haven't found documented elsewhere (not for lack oflooking), and i want to run some solutions past the list to see whatsticks. I have one outstanding problem i haven't been able to work out,which i will present in another thread to follow.

I'm going to forgo posting detailed configs initially here, as i thinkthe problems are abstract enough that it won't be necessary. If it turnsout to be necessary, we can do that. I hope also that this list is anacceptable place for this; i found a lot of pointers here so it seemedappropriate.

The cluster is two Dell boxes with a bunch of directly attached Dell SASstorage. On each box, disks are organized into two RAID10 volumes andfour RAID6 volumes, to serve different profiles of IOPS and storagevolume needed by various applications. Each of these volumes is syncedto the other box using DRBD 8.4 over an LACP bond of two crossover 10GbE links. The pacemaker/CMAN/corosync stack also talks over thecrossover. Each box is connected to the network using another 10 GbElink, with a 1 GbE link bonded in active/backup mode and attached to aseparate switch so i can reboot the primary switch if necessary withoutlosing connectivity. (The 10 GbE transceivers for my switchinfrastructure are expensive, and multi-switch LACP is way too bleedingedge for me.) Write cache is disabled on all tgt LUNs (using mode_page),and all the RAIDs are battery backed.

One of my requirements here is to run multiple tgt targets. Each DRBDvolume is the sole physical volume of a distinct LVM volume group, andeach VG is assigned to a unique target. pacemaker naturally balances thesix targets into three on each box. This configuration has the advantagethat when most initiators are readers, outgoing bandwidth from each boxcan theoretically hit 10 Gb/s, resulting in 20 Gb/s read bandwidth. (Inpractice i'm maxing out around 12 Gb/s, but i think that's because oflimitations in my switch infrastructure.) Another advantage of multipletargets is that when something screwy happens and a target goes down,this only takes down the LUNs on that target, rather than the whole kitand kaboodle.


Another requirement i have is password authentication to the targets.

So, first the problems, then the workarounds:

The existing iSCSITarget RA is not designed to support multiple targetswith a single user account in the tgt implementation. But getting theinitiators (libvirt nodes, also using CentOS 6.4) to support a differentaccount for each target is non-trivial, since there's only one slot in/etc/iscsi/iscsi.conf for authentication info. There are workarounds forinitiators but they're pretty ugly, especially if you want to manageyour iSCSI targets using libvirt pools.

In practice i ran into failover problems under load almost immediately.Under load, when i would initiate a failover, there was a racecondition: the iSCSILogicalUnit RA will take down the LUNs one at atime, waiting for each connection to terminate, and if the initiatorsreconnect quickly enough, they get pissed off at finding that the targetstill exists but the LUN they were using no longer does, which is oftenthe case during this transient takedown process. On the initiator, itlooks something like this, and it's fatal (here LUN 4 has gone away butthe target is still alive, maybe working on disconnecting LUN 3):

Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : IllegalRequest [current]Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unitnot supportedNov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logicalblock 16542656

One solution to this is using the portblock RA to block all initiatortraffic during failover, but this creates another problem: tgtd doesn'tallow established connections to expire as long as there's outstandingdata in the Send-Q for the TCP connection; if tgtd has already queued abunch of traffic to an initiator when a failover starts, and portblockstarts blocking ACK packets from the initiator, the Send-Q never drains,and the tgtd connection hangs permanently. This stops failover fromcompleting, and eventually everyone is unhappy, especially me.

I was having another problem with portblock: start and stop operationswould frequently but unpredictably fail when multiple targets weresimultaneously failing over. The error reported by pacemaker was"'insufficient privileges' (rc=4)". This was pretty mysterious sinceeverything was running as root, and root has no problem executingiptables. This would blow failover sequences up, and the target resourcewould go down.

So, here's how i've worked around these problems. Comments about howstupid was not to have done X are, of course, welcome. I'd rather nothear "You must have a pacemaker config problem," from people who haven'tat least tried to do multiple targets under load on 10 GbE media orsomething similar; new things happen under these circumstances. I've gotstonith configured and all the necessary order and colocationconstraints in place, and with the following workarounds things arestable, so i think i have the config correct.

1. Lack of support for multiple targets using the same tgt account. Thisis a problem because the iSCSITarget RA defines the user and the targetat the same time. If it allowed multiple targets to use the same user,it wouldn't know when it is safe to delete the user in a stop operation,because some other target might still be using it.

To solve this i did two things: first i wrote a new RA that manages atgt user; this is instantiated as a clone so it runs along with the tgtdclone. Second i tweaked the iSCSITarget RA so that on start, ifincoming_username is defined but incoming_password is not, the RA skipsthe account creation step and simply binds the new target toincoming_username. On stop, it similarly no longer deletes the accountif incoming_password is unset. I also had to relax the uniquenessconstraint on incoming_username in the RA metadata.

2. Disappearing LUNs during failover cause initiators to blow chunks.For this i used portblock, but had to modify it because the TCP Send-Qwould never drain.

3. portblock preventing TCP Send-Q from draining, causing tgtdconnections to hang. I modified portblock to reverse the sense of theiptables rules it was adding: instead of blocking traffic from theinitiator on the INPUT chain, it now blocks traffic from the target onthe OUTPUT chain with a tcp-reset response. With this setup, as soon asportblock goes active, the next packet tgtd attempts to send to a giveninitiator will get a TCP RST response, causing tgtd to hang up theconnection immediately. This configuration allows the connections toterminate promptly under load.

I'm not totally satisfied with this workaround. It meansacknowledgements of operations tgtd has actually completed never make itback to the initiator. I suspect this could cause problems in somescenarios. I don't think it causes a problem the way i'm using it, witheach LUN as backing store for a distinct VM--when the LUN is back up onthe other node, the outstanding operations are re-sent by the initiator.Maybe with a clustered filesystem this would cause problems; itcertainly would cause problems if the target device were, for example, atape drive.

4. "Insufficient privileges" faults in the portblock RA. This wasanother race condition that occurred because i was using multipletargets, meaning that without a mutex, multiple portblock invocationswould be running in parallel during a failover. If you try to runiptables while another iptables is running, you get "Resource notavailable" and this was coming back to pacemaker as "insufficientprivileges". This is simply a bug in the portblock RA; it should have amutex to prevent parallel iptables invocations. I fixed this by addingan ocf_release_lock_on_exit at the top, and adding an ocf_take_lock forstart, stop, monitor, and status operations.

I'm not sure why more people haven't run into these problems before. Ihope it's not that i'm doing things wrong, but rather that few othershaven't earnestly tried to build anything quite like this setup. Ifanyone out there has set up a similar cluster and *not* had theseproblems, i'd like to know about it. Meanwhile, if others *have* hadthese problems, i'd also like to know, especially if they've foundalternate solutions.


Thanks in advance.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

Reply via email to