Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

Vladislav Bogdanov Tue, 19 Nov 2013 23:04:49 -0800

19.11.2013 13:48, Lars Ellenberg wrote:
> On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote:
>> 13.11.2013 04:46, Jefferson Ogata wrote:
>> ...
>>>
>>> In practice i ran into failover problems under load almost immediately.
>>> Under load, when i would initiate a failover, there was a race
>>> condition: the iSCSILogicalUnit RA will take down the LUNs one at a
>>> time, waiting for each connection to terminate, and if the initiators
>>> reconnect quickly enough, they get pissed off at finding that the target
>>> still exists but the LUN they were using no longer does, which is often
>>> the case during this transient takedown process. On the initiator, it
>>> looks something like this, and it's fatal (here LUN 4 has gone away but
>>> the target is still alive, maybe working on disconnecting LUN 3):
>>>
>>> Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal
>>> Request [current]
>>> Nov  7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit
>>> not supported
>>> Nov  7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical
>>> block 16542656
>>>
>>> One solution to this is using the portblock RA to block all initiator
>>
>> In addition I force use of multipath on initiators with no_path_retry=queue
>>
>> ...
>>
>>>
>>> 1. Lack of support for multiple targets using the same tgt account. This
>>> is a problem because the iSCSITarget RA defines the user and the target
>>> at the same time. If it allowed multiple targets to use the same user,
>>> it wouldn't know when it is safe to delete the user in a stop operation,
>>> because some other target might still be using it.
>>>
>>> To solve this i did two things: first i wrote a new RA that manages a
> 
> Did I miss it, or did you post it somewhere?
> Fork on Github and push there, so we can have a look?
> 
>>> tgt user; this is instantiated as a clone so it runs along with the tgtd
>>> clone. Second i tweaked the iSCSITarget RA so that on start, if
>>> incoming_username is defined but incoming_password is not, the RA skips
>>> the account creation step and simply binds the new target to
>>> incoming_username. On stop, it similarly no longer deletes the account
>>> if incoming_password is unset. I also had to relax the uniqueness
>>> constraint on incoming_username in the RA metadata.
>>>
>>> 2. Disappearing LUNs during failover cause initiators to blow chunks.
>>> For this i used portblock, but had to modify it because the TCP Send-Q
>>> would never drain.
>>>
>>> 3. portblock preventing TCP Send-Q from draining, causing tgtd
>>> connections to hang. I modified portblock to reverse the sense of the
>>> iptables rules it was adding: instead of blocking traffic from the
>>> initiator on the INPUT chain, it now blocks traffic from the target on
>>> the OUTPUT chain with a tcp-reset response. With this setup, as soon as
>>> portblock goes active, the next packet tgtd attempts to send to a given
>>> initiator will get a TCP RST response, causing tgtd to hang up the
>>> connection immediately. This configuration allows the connections to
>>> terminate promptly under load.
>>>
>>> I'm not totally satisfied with this workaround. It means
>>> acknowledgements of operations tgtd has actually completed never make it
>>> back to the initiator. I suspect this could cause problems in some
>>> scenarios. I don't think it causes a problem the way i'm using it, with
>>> each LUN as backing store for a distinct VM--when the LUN is back up on
>>> the other node, the outstanding operations are re-sent by the initiator.
>>> Maybe with a clustered filesystem this would cause problems; it
>>> certainly would cause problems if the target device were, for example, a
>>> tape drive.
> 
> Maybe only block "new" incoming connection attempts?
>


That may cause issues on an initiator side in some circumstances (IIRC):
* connection is established
* pacemaker fires target move
* target is destroyed, connection breaks (TCP RST is sent to initiator)
* initiator connects again
* target is not available on iSCSI level (but portals answer either on
old or on new node) or portals are not available
* initiator *returns error* to an upper layer <- this one is important
* target is configured on other node then

I was hit by this, but that was several years ago, so I may miss some
details.

My experience with IET and LIO shows it is better (safer) to block all
iSCSI traffic to target's portals, both directions.
* connection is established
* pacemaker fires target move
* both directions are blocked (DROP) on both target nodes
* target is destroyed, connection stays "established" on initiator side,
just TCP packets timeout
* target is configured on other node (VIPs are moved too)
* firewall rules are removed
* initiator (re)sends request
* target sends RST (?) back - it doesn't have that connection
* initiator reconnects and continues to use target


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

Reply via email to