Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On 2013-11-21 16:34, Jefferson Ogata wrote: On 2013-11-20 08:35, Jefferson Ogata wrote: Indeed, using iptables with REJECT and tcp-reset, this seems to piss off the initiators, creating immediate i/o errors. But one can use DROP on incoming SYN packets and let established connections drain. I've been trying to get this to work but am finding that it takes so long for some connections to drain that something times out. I haven't given up on this approach, tho. Testing this stuff can be tricky because if i make one mistake, stonith kicks in and i end up having to wait 5-10 minutes for the machine to reboot and resync its DRBD devices. Follow-up on this: the original race condition i reported still occurs with this strategy: if existing TCP connections are allowed to drain by passing packets from established initiator connections (by blocking only SYN packets), then the initiator can also send new requests to the target during the takedown process; the takedown removes LUNs from the live target and the initiator generates an i/o error if it happens to try to access a LUN that has been removed before the connection is removed. This happens because the configuration looks something like this (crm): group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1 iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 portunblock On takedown, if portblock is tweaked to pass packets for existing connections so they can drain, there's a window while LUNs lun3, lun2, lun1 are being removed from the target where this race condition occurs. The connection isn't removed until iSCSITarget runs to stop the target. A way to handle this that should actually work is to write a new RA that deletes the connections from the target *before* the LUNs are removed during takedown. The config would look something like this, then: group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1 iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 tgtConnections portunblock On takedown, then, portunblock will block new incoming connections, tgtConnections will shut down existing connections and wait for them to drain, then the LUNs can be safely removed before the target is taken down. I'll write this RA today and see how that works. So, this strategy worked. The final RA is attached. The config (crm) then looks like this, using the tweaked portblock RA that blocks syn only, the tgtUser RA that adds a tgtd user, and the tweaked iSCSITarget RA that doesn't add a user if no password is provided (see previous discussion for the latter two RAs). This is a two-node cluster using DRBD-backed LVMs and multiple targets. The names have been changed to protect the innocent, and is simplified to only a single target for brevity, but it should be clear how to do multiple DRBDs/VGs/targets. I've left out the stonith config here also. primitive tgtd lsb:tgtd op monitor interval=10s clone clone.tgtd tgtd primitive user.username ocf:local:tgtUser params username=username password=password clone clone.user.username user.username order clone.tgtd_before_clone.user.username inf: clone.tgtd:start clone.user.username:start primitive drbd.pv1 ocf:linbit:drbd params drbd_resource=pv1 op monitor role=Slave interval=29s timeout=600s op monitor role=Master interval=31s timeout=600s op start timeout=240s op stop timeout=240s ms ms.drbd.pv1 drbd.pv1 meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true primitive lvm.vg1 ocf:heartbeat:LVM params volgrpname=vg1 op monitor interval=30s timeout=30s op start timeout=30s op stop timeout=30s order ms.drbd.pv1_before_lvm.vg1 inf: ms.drbd.pv1:promote lvm.vg1:start colocation ms.drbd.pv1_with_lvm.vg1 inf: ms.drbd.pv1:Master lvm.vg1 primitive target.1 ocf:local:iSCSITarget params iqn=iqnt1 tid=1 incoming_username=username implementation=tgt portals= op monitor interval=30s op start timeout=30s op stop timeout=120s primitive lun.1.1 ocf:heartbeat:iSCSILogicalUnit params target_iqn=iqnt1 lun=1 path=/dev/vg1/lv1 additional_parameters=scsi_id=vg1/lv1 mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0 implementation=tgt op monitor interval=30s op start timeout=30s op stop timeout=120s primitive ip.192.168.1.244 ocf:heartbeat:IPaddr params ip=192.168.1.244 cidr_netmask=24 nic=bond0 primitive portblock.ip.192.168.1.244 ocf:local:portblock params ip=192.168.1.244 action=block protocol=tcp portno=3260 syn_only=true op monitor interval=10s timeout=10s depth=0 primitive tgtfinal.1 ocf:local:tgtFinal params tid=1 op monitor interval=30s timeout=30s op stop timeout=60s primitive portunblock.ip.192.168.1.244 ocf:local:portblock params ip=192.168.1.244 action=unblock protocol=tcp portno=3260 syn_only=true op monitor interval=10s timeout=10s depth=0 group group.target.1 lvm.vg1 portblock.ip.192.168.12.244 ip.192.168.12.244 target.6 lun.6.1 tgtfinal.1 portunblock.ip.192.168.1.244 order clone.tgtd_before_group.target.1 inf: clone.tgtd:start
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On 2013-11-20 08:35, Jefferson Ogata wrote: Indeed, using iptables with REJECT and tcp-reset, this seems to piss off the initiators, creating immediate i/o errors. But one can use DROP on incoming SYN packets and let established connections drain. I've been trying to get this to work but am finding that it takes so long for some connections to drain that something times out. I haven't given up on this approach, tho. Testing this stuff can be tricky because if i make one mistake, stonith kicks in and i end up having to wait 5-10 minutes for the machine to reboot and resync its DRBD devices. Follow-up on this: the original race condition i reported still occurs with this strategy: if existing TCP connections are allowed to drain by passing packets from established initiator connections (by blocking only SYN packets), then the initiator can also send new requests to the target during the takedown process; the takedown removes LUNs from the live target and the initiator generates an i/o error if it happens to try to access a LUN that has been removed before the connection is removed. This happens because the configuration looks something like this (crm): group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1 iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 portunblock On takedown, if portblock is tweaked to pass packets for existing connections so they can drain, there's a window while LUNs lun3, lun2, lun1 are being removed from the target where this race condition occurs. The connection isn't removed until iSCSITarget runs to stop the target. A way to handle this that should actually work is to write a new RA that deletes the connections from the target *before* the LUNs are removed during takedown. The config would look something like this, then: group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1 iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 tgtConnections portunblock On takedown, then, portunblock will block new incoming connections, tgtConnections will shut down existing connections and wait for them to drain, then the LUNs can be safely removed before the target is taken down. I'll write this RA today and see how that works. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On 2013-11-20 07:04, Vladislav Bogdanov wrote: 19.11.2013 13:48, Lars Ellenberg wrote: On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote: 13.11.2013 04:46, Jefferson Ogata wrote: ... 3. portblock preventing TCP Send-Q from draining, causing tgtd connections to hang. I modified portblock to reverse the sense of the iptables rules it was adding: instead of blocking traffic from the initiator on the INPUT chain, it now blocks traffic from the target on the OUTPUT chain with a tcp-reset response. With this setup, as soon as portblock goes active, the next packet tgtd attempts to send to a given initiator will get a TCP RST response, causing tgtd to hang up the connection immediately. This configuration allows the connections to terminate promptly under load. I'm not totally satisfied with this workaround. It means acknowledgements of operations tgtd has actually completed never make it back to the initiator. I suspect this could cause problems in some scenarios. I don't think it causes a problem the way i'm using it, with each LUN as backing store for a distinct VM--when the LUN is back up on the other node, the outstanding operations are re-sent by the initiator. Maybe with a clustered filesystem this would cause problems; it certainly would cause problems if the target device were, for example, a tape drive. Maybe only block new incoming connection attempts? That may cause issues on an initiator side in some circumstances (IIRC): * connection is established * pacemaker fires target move * target is destroyed, connection breaks (TCP RST is sent to initiator) * initiator connects again * target is not available on iSCSI level (but portals answer either on old or on new node) or portals are not available * initiator *returns error* to an upper layer - this one is important * target is configured on other node then I was hit by this, but that was several years ago, so I may miss some details. Indeed, using iptables with REJECT and tcp-reset, this seems to piss off the initiators, creating immediate i/o errors. But one can use DROP on incoming SYN packets and let established connections drain. I've been trying to get this to work but am finding that it takes so long for some connections to drain that something times out. I haven't given up on this approach, tho. Testing this stuff can be tricky because if i make one mistake, stonith kicks in and i end up having to wait 5-10 minutes for the machine to reboot and resync its DRBD devices. My experience with IET and LIO shows it is better (safer) to block all iSCSI traffic to target's portals, both directions. * connection is established * pacemaker fires target move * both directions are blocked (DROP) on both target nodes * target is destroyed, connection stays established on initiator side, just TCP packets timeout * target is configured on other node (VIPs are moved too) * firewall rules are removed * initiator (re)sends request * target sends RST (?) back - it doesn't have that connection * initiator reconnects and continues to use target As already noted, this approach doesn't work with TGT because it refuses to teardown its config until it has drained all its connections, and they can't drain if ACK packets can't come in. The only reliable solution i've found so far is to send RSTs to tgtd (but leave the initiators in the dark). I'm also using VIPs. They don't have to be bound to a specific target in a tgt configuration; you just have each initiator connect to a given target only using its unique VIP. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote: 13.11.2013 04:46, Jefferson Ogata wrote: ... In practice i ran into failover problems under load almost immediately. Under load, when i would initiate a failover, there was a race condition: the iSCSILogicalUnit RA will take down the LUNs one at a time, waiting for each connection to terminate, and if the initiators reconnect quickly enough, they get pissed off at finding that the target still exists but the LUN they were using no longer does, which is often the case during this transient takedown process. On the initiator, it looks something like this, and it's fatal (here LUN 4 has gone away but the target is still alive, maybe working on disconnecting LUN 3): Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal Request [current] Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit not supported Nov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical block 16542656 One solution to this is using the portblock RA to block all initiator In addition I force use of multipath on initiators with no_path_retry=queue ... 1. Lack of support for multiple targets using the same tgt account. This is a problem because the iSCSITarget RA defines the user and the target at the same time. If it allowed multiple targets to use the same user, it wouldn't know when it is safe to delete the user in a stop operation, because some other target might still be using it. To solve this i did two things: first i wrote a new RA that manages a Did I miss it, or did you post it somewhere? Fork on Github and push there, so we can have a look? tgt user; this is instantiated as a clone so it runs along with the tgtd clone. Second i tweaked the iSCSITarget RA so that on start, if incoming_username is defined but incoming_password is not, the RA skips the account creation step and simply binds the new target to incoming_username. On stop, it similarly no longer deletes the account if incoming_password is unset. I also had to relax the uniqueness constraint on incoming_username in the RA metadata. 2. Disappearing LUNs during failover cause initiators to blow chunks. For this i used portblock, but had to modify it because the TCP Send-Q would never drain. 3. portblock preventing TCP Send-Q from draining, causing tgtd connections to hang. I modified portblock to reverse the sense of the iptables rules it was adding: instead of blocking traffic from the initiator on the INPUT chain, it now blocks traffic from the target on the OUTPUT chain with a tcp-reset response. With this setup, as soon as portblock goes active, the next packet tgtd attempts to send to a given initiator will get a TCP RST response, causing tgtd to hang up the connection immediately. This configuration allows the connections to terminate promptly under load. I'm not totally satisfied with this workaround. It means acknowledgements of operations tgtd has actually completed never make it back to the initiator. I suspect this could cause problems in some scenarios. I don't think it causes a problem the way i'm using it, with each LUN as backing store for a distinct VM--when the LUN is back up on the other node, the outstanding operations are re-sent by the initiator. Maybe with a clustered filesystem this would cause problems; it certainly would cause problems if the target device were, for example, a tape drive. Maybe only block new incoming connection attempts? 4. Insufficient privileges faults in the portblock RA. This was another race condition that occurred because i was using multiple targets, meaning that without a mutex, multiple portblock invocations would be running in parallel during a failover. If you try to run iptables while another iptables is running, you get Resource not available and this was coming back to pacemaker as insufficient privileges. This is simply a bug in the portblock RA; it should have a mutex to prevent parallel iptables invocations. I fixed this by adding an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for start, stop, monitor, and status operations. I'm not sure why more people haven't run into these problems before. I hope it's not that i'm doing things wrong, but rather that few others haven't earnestly tried to build anything quite like this setup. If anyone out there has set up a similar cluster and *not* had these problems, i'd like to know about it. Meanwhile, if others *have* had these problems, i'd also like to know, especially if they've found alternate solutions. Can't say about 1, I use IET, it doesn't seem to have that limitation. 2 - I use alternative home-brew ms RA which blocks (DROP) both input and output for a specified VIP on demote (targets are configured to be bound to that VIPs). I also export one big LUN per
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On 2013-11-19 10:48, Lars Ellenberg wrote: On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote: 13.11.2013 04:46, Jefferson Ogata wrote: ... In practice i ran into failover problems under load almost immediately. Under load, when i would initiate a failover, there was a race condition: the iSCSILogicalUnit RA will take down the LUNs one at a time, waiting for each connection to terminate, and if the initiators reconnect quickly enough, they get pissed off at finding that the target still exists but the LUN they were using no longer does, which is often the case during this transient takedown process. On the initiator, it looks something like this, and it's fatal (here LUN 4 has gone away but the target is still alive, maybe working on disconnecting LUN 3): Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal Request [current] Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit not supported Nov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical block 16542656 One solution to this is using the portblock RA to block all initiator In addition I force use of multipath on initiators with no_path_retry=queue ... 1. Lack of support for multiple targets using the same tgt account. This is a problem because the iSCSITarget RA defines the user and the target at the same time. If it allowed multiple targets to use the same user, it wouldn't know when it is safe to delete the user in a stop operation, because some other target might still be using it. To solve this i did two things: first i wrote a new RA that manages a Did I miss it, or did you post it somewhere? Fork on Github and push there, so we can have a look? Not set up with git right now; i've attached it here. It's short. tgt user; this is instantiated as a clone so it runs along with the tgtd clone. Second i tweaked the iSCSITarget RA so that on start, if incoming_username is defined but incoming_password is not, the RA skips the account creation step and simply binds the new target to incoming_username. On stop, it similarly no longer deletes the account if incoming_password is unset. I also had to relax the uniqueness constraint on incoming_username in the RA metadata. 2. Disappearing LUNs during failover cause initiators to blow chunks. For this i used portblock, but had to modify it because the TCP Send-Q would never drain. 3. portblock preventing TCP Send-Q from draining, causing tgtd connections to hang. I modified portblock to reverse the sense of the iptables rules it was adding: instead of blocking traffic from the initiator on the INPUT chain, it now blocks traffic from the target on the OUTPUT chain with a tcp-reset response. With this setup, as soon as portblock goes active, the next packet tgtd attempts to send to a given initiator will get a TCP RST response, causing tgtd to hang up the connection immediately. This configuration allows the connections to terminate promptly under load. I'm not totally satisfied with this workaround. It means acknowledgements of operations tgtd has actually completed never make it back to the initiator. I suspect this could cause problems in some scenarios. I don't think it causes a problem the way i'm using it, with each LUN as backing store for a distinct VM--when the LUN is back up on the other node, the outstanding operations are re-sent by the initiator. Maybe with a clustered filesystem this would cause problems; it certainly would cause problems if the target device were, for example, a tape drive. Maybe only block new incoming connection attempts? That's a good idea. Theoretically that should allow the existing connections to drain. I'm worried it can lead to pacemaker timeouts firing if there's a lot of queued data in the send queues, but i'll test it. Thanks for the suggestion. #!/bin/sh # # Resource script for managing tgt users # # Description: Manages a tgt user as an OCF resource in # an High Availability setup. # # Author: Jefferson Ogata jefferson.og...@noaa.gov # License: GNU General Public License (GPL) # # # usage: $0 {start|stop|status|monitor|validate-all|meta-data} # # The start arg adds the user. # # The stop arg deletes it. # # OCF parameters: # OCF_RESKEY_username # OCF_RESKEY_password # # Example crm configuration: # # primitive tgtd lsb:tgtd op monitor interval=10s # clone clone.tgtd tgtd # primitive user.foo ocf:heartbeat:tgtUser params username=foo password=secret # clone clone.user.foo user.foo # order clone.tgtd_before_clone.user.foo inf: clone.tgtd:start clone.user.foo:start # ## # Initialization: : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs USAGE=Usage: $0 {start|stop|status|monitor|validate-all|meta-data}; ## usage() { echo $USAGE 2 } meta_data() {
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
On 2013-11-13 06:02, Vladislav Bogdanov wrote: 13.11.2013 04:46, Jefferson Ogata wrote: [snip] 4. Insufficient privileges faults in the portblock RA. This was another race condition that occurred because i was using multiple targets, meaning that without a mutex, multiple portblock invocations would be running in parallel during a failover. If you try to run iptables while another iptables is running, you get Resource not available and this was coming back to pacemaker as insufficient privileges. This is simply a bug in the portblock RA; it should have a mutex to prevent parallel iptables invocations. I fixed this by adding an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for start, stop, monitor, and status operations. I'm not sure why more people haven't run into these problems before. I hope it's not that i'm doing things wrong, but rather that few others haven't earnestly tried to build anything quite like this setup. If anyone out there has set up a similar cluster and *not* had these problems, i'd like to know about it. Meanwhile, if others *have* had these problems, i'd also like to know, especially if they've found alternate solutions. Can't say about 1, I use IET, it doesn't seem to have that limitation. 2 - I use alternative home-brew ms RA which blocks (DROP) both input and output for a specified VIP on demote (targets are configured to be bound to that VIPs). I also export one big LUN per target and then set up clvm VG on top of it (all initiators are in the same another cluster). 3 - can't say as well, IET is probably not affected. 4 - That is true, iptables doesn't have atomic rules management, so you definitely need mutex or dispatcher like firewalld (didn't try it though). The issue here is not really the lack of atomic rules management (which you can hack in iptables by defining a new chain for the rules you want to add and adding a jump to it in the INPUT or OUTPUT chain). The issue i was encountering is that if the iptables process locks a resource. If a second iptables process attempts to run concurrently, it can't lock, and it fails. The RA was feeding this back to pacemaker as a resource stop/stop failure, when really the RA should have been preventing concurrent execution of iptables in the first place. Thanks for the feedback. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
19.11.2013 13:48, Lars Ellenberg wrote: On Wed, Nov 13, 2013 at 09:02:47AM +0300, Vladislav Bogdanov wrote: 13.11.2013 04:46, Jefferson Ogata wrote: ... In practice i ran into failover problems under load almost immediately. Under load, when i would initiate a failover, there was a race condition: the iSCSILogicalUnit RA will take down the LUNs one at a time, waiting for each connection to terminate, and if the initiators reconnect quickly enough, they get pissed off at finding that the target still exists but the LUN they were using no longer does, which is often the case during this transient takedown process. On the initiator, it looks something like this, and it's fatal (here LUN 4 has gone away but the target is still alive, maybe working on disconnecting LUN 3): Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal Request [current] Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit not supported Nov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical block 16542656 One solution to this is using the portblock RA to block all initiator In addition I force use of multipath on initiators with no_path_retry=queue ... 1. Lack of support for multiple targets using the same tgt account. This is a problem because the iSCSITarget RA defines the user and the target at the same time. If it allowed multiple targets to use the same user, it wouldn't know when it is safe to delete the user in a stop operation, because some other target might still be using it. To solve this i did two things: first i wrote a new RA that manages a Did I miss it, or did you post it somewhere? Fork on Github and push there, so we can have a look? tgt user; this is instantiated as a clone so it runs along with the tgtd clone. Second i tweaked the iSCSITarget RA so that on start, if incoming_username is defined but incoming_password is not, the RA skips the account creation step and simply binds the new target to incoming_username. On stop, it similarly no longer deletes the account if incoming_password is unset. I also had to relax the uniqueness constraint on incoming_username in the RA metadata. 2. Disappearing LUNs during failover cause initiators to blow chunks. For this i used portblock, but had to modify it because the TCP Send-Q would never drain. 3. portblock preventing TCP Send-Q from draining, causing tgtd connections to hang. I modified portblock to reverse the sense of the iptables rules it was adding: instead of blocking traffic from the initiator on the INPUT chain, it now blocks traffic from the target on the OUTPUT chain with a tcp-reset response. With this setup, as soon as portblock goes active, the next packet tgtd attempts to send to a given initiator will get a TCP RST response, causing tgtd to hang up the connection immediately. This configuration allows the connections to terminate promptly under load. I'm not totally satisfied with this workaround. It means acknowledgements of operations tgtd has actually completed never make it back to the initiator. I suspect this could cause problems in some scenarios. I don't think it causes a problem the way i'm using it, with each LUN as backing store for a distinct VM--when the LUN is back up on the other node, the outstanding operations are re-sent by the initiator. Maybe with a clustered filesystem this would cause problems; it certainly would cause problems if the target device were, for example, a tape drive. Maybe only block new incoming connection attempts? That may cause issues on an initiator side in some circumstances (IIRC): * connection is established * pacemaker fires target move * target is destroyed, connection breaks (TCP RST is sent to initiator) * initiator connects again * target is not available on iSCSI level (but portals answer either on old or on new node) or portals are not available * initiator *returns error* to an upper layer - this one is important * target is configured on other node then I was hit by this, but that was several years ago, so I may miss some details. My experience with IET and LIO shows it is better (safer) to block all iSCSI traffic to target's portals, both directions. * connection is established * pacemaker fires target move * both directions are blocked (DROP) on both target nodes * target is destroyed, connection stays established on initiator side, just TCP packets timeout * target is configured on other node (VIPs are moved too) * firewall rules are removed * initiator (re)sends request * target sends RST (?) back - it doesn't have that connection * initiator reconnects and continues to use target ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)
13.11.2013 04:46, Jefferson Ogata wrote: ... In practice i ran into failover problems under load almost immediately. Under load, when i would initiate a failover, there was a race condition: the iSCSILogicalUnit RA will take down the LUNs one at a time, waiting for each connection to terminate, and if the initiators reconnect quickly enough, they get pissed off at finding that the target still exists but the LUN they were using no longer does, which is often the case during this transient takedown process. On the initiator, it looks something like this, and it's fatal (here LUN 4 has gone away but the target is still alive, maybe working on disconnecting LUN 3): Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Sense Key : Illegal Request [current] Nov 7 07:39:29 s01c kernel: sd 6:0:0:4: [sde] Add. Sense: Logical unit not supported Nov 7 07:39:29 s01c kernel: Buffer I/O error on device sde, logical block 16542656 One solution to this is using the portblock RA to block all initiator In addition I force use of multipath on initiators with no_path_retry=queue ... 1. Lack of support for multiple targets using the same tgt account. This is a problem because the iSCSITarget RA defines the user and the target at the same time. If it allowed multiple targets to use the same user, it wouldn't know when it is safe to delete the user in a stop operation, because some other target might still be using it. To solve this i did two things: first i wrote a new RA that manages a tgt user; this is instantiated as a clone so it runs along with the tgtd clone. Second i tweaked the iSCSITarget RA so that on start, if incoming_username is defined but incoming_password is not, the RA skips the account creation step and simply binds the new target to incoming_username. On stop, it similarly no longer deletes the account if incoming_password is unset. I also had to relax the uniqueness constraint on incoming_username in the RA metadata. 2. Disappearing LUNs during failover cause initiators to blow chunks. For this i used portblock, but had to modify it because the TCP Send-Q would never drain. 3. portblock preventing TCP Send-Q from draining, causing tgtd connections to hang. I modified portblock to reverse the sense of the iptables rules it was adding: instead of blocking traffic from the initiator on the INPUT chain, it now blocks traffic from the target on the OUTPUT chain with a tcp-reset response. With this setup, as soon as portblock goes active, the next packet tgtd attempts to send to a given initiator will get a TCP RST response, causing tgtd to hang up the connection immediately. This configuration allows the connections to terminate promptly under load. I'm not totally satisfied with this workaround. It means acknowledgements of operations tgtd has actually completed never make it back to the initiator. I suspect this could cause problems in some scenarios. I don't think it causes a problem the way i'm using it, with each LUN as backing store for a distinct VM--when the LUN is back up on the other node, the outstanding operations are re-sent by the initiator. Maybe with a clustered filesystem this would cause problems; it certainly would cause problems if the target device were, for example, a tape drive. 4. Insufficient privileges faults in the portblock RA. This was another race condition that occurred because i was using multiple targets, meaning that without a mutex, multiple portblock invocations would be running in parallel during a failover. If you try to run iptables while another iptables is running, you get Resource not available and this was coming back to pacemaker as insufficient privileges. This is simply a bug in the portblock RA; it should have a mutex to prevent parallel iptables invocations. I fixed this by adding an ocf_release_lock_on_exit at the top, and adding an ocf_take_lock for start, stop, monitor, and status operations. I'm not sure why more people haven't run into these problems before. I hope it's not that i'm doing things wrong, but rather that few others haven't earnestly tried to build anything quite like this setup. If anyone out there has set up a similar cluster and *not* had these problems, i'd like to know about it. Meanwhile, if others *have* had these problems, i'd also like to know, especially if they've found alternate solutions. Can't say about 1, I use IET, it doesn't seem to have that limitation. 2 - I use alternative home-brew ms RA which blocks (DROP) both input and output for a specified VIP on demote (targets are configured to be bound to that VIPs). I also export one big LUN per target and then set up clvm VG on top of it (all initiators are in the same another cluster). 3 - can't say as well, IET is probably not affected. 4 - That is true, iptables doesn't have atomic rules management, so you definitely need mutex or dispatcher like firewalld (didn't try it though).