On 2013-11-21 16:34, Jefferson Ogata wrote:
On 2013-11-20 08:35, Jefferson Ogata wrote:
Indeed, using iptables with REJECT and tcp-reset, this seems to piss off
the initiators, creating immediate i/o errors. But one can use DROP on
incoming SYN packets and let established connections drain. I've been
trying to get this to work but am finding that it takes so long for some
connections to drain that something times out. I haven't given up on
this approach, tho. Testing this stuff can be tricky because if i make
one mistake, stonith kicks in and i end up having to wait 5-10 minutes
for the machine to reboot and resync its DRBD devices.
Follow-up on this: the original race condition i reported still occurs
with this strategy: if existing TCP connections are allowed to drain by
passing packets from established initiator connections (by blocking only
SYN packets), then the initiator can also send new requests to the
target during the takedown process; the takedown removes LUNs from the
live target and the initiator generates an i/o error if it happens to
try to access a LUN that has been removed before the connection is removed.
This happens because the configuration looks something like this (crm):
group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 portunblock
On takedown, if portblock is tweaked to pass packets for existing
connections so they can drain, there's a window while LUNs lun3, lun2,
lun1 are being removed from the target where this race condition occurs.
The connection isn't removed until iSCSITarget runs to stop the target.
A way to handle this that should actually work is to write a new RA that
deletes the connections from the target *before* the LUNs are removed
during takedown. The config would look something like this, then:
group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 tgtConnections portunblock
On takedown, then, portunblock will block new incoming connections,
tgtConnections will shut down existing connections and wait for them to
drain, then the LUNs can be safely removed before the target is taken down.
I'll write this RA today and see how that works.
So, this strategy worked. The final RA is attached. The config (crm)
then looks like this, using the tweaked portblock RA that blocks syn
only, the tgtUser RA that adds a tgtd user, and the tweaked iSCSITarget
RA that doesn't add a user if no password is provided (see previous
discussion for the latter two RAs). This is a two-node cluster using
DRBD-backed LVMs and multiple targets. The names have been changed to
protect the innocent, and is simplified to only a single target for
brevity, but it should be clear how to do multiple DRBDs/VGs/targets.
I've left out the stonith config here also.
primitive tgtd lsb:tgtd op monitor interval="10s"
clone clone.tgtd tgtd
primitive user.username ocf:local:tgtUser params username="username"
password="password"
clone clone.user.username user.username
order clone.tgtd_before_clone.user.username inf: clone.tgtd:start
clone.user.username:start
primitive drbd.pv1 ocf:linbit:drbd params drbd_resource="pv1" op monitor
role="Slave" interval="29s" timeout="600s" op monitor role="Master"
interval="31s" timeout="600s" op start timeout="240s" op stop timeout="240s"
ms ms.drbd.pv1 drbd.pv1 meta master-max="1" master-node-max="1"
clone-max="2" clone-node-max="1" notify="true"
primitive lvm.vg1 ocf:heartbeat:LVM params volgrpname="vg1" op monitor
interval="30s" timeout="30s" op start timeout="30s" op stop timeout="30s"
order ms.drbd.pv1_before_lvm.vg1 inf: ms.drbd.pv1:promote lvm.vg1:start
colocation ms.drbd.pv1_with_lvm.vg1 inf: ms.drbd.pv1:Master lvm.vg1
primitive target.1 ocf:local:iSCSITarget params iqn="iqn....t1" tid="1"
incoming_username="username" implementation="tgt" portals="" op monitor
interval="30s" op start timeout="30s" op stop timeout="120s"
primitive lun.1.1 ocf:heartbeat:iSCSILogicalUnit params
target_iqn="iqn....t1" lun="1" path="/dev/vg1/lv1"
additional_parameters="scsi_id=vg1/lv1
mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0"
implementation="tgt" op monitor interval="30s" op start timeout="30s" op
stop timeout="120s"
primitive ip.192.168.1.244 ocf:heartbeat:IPaddr params
ip="192.168.1.244" cidr_netmask="24" nic="bond0"
primitive portblock.ip.192.168.1.244 ocf:local:portblock params
ip="192.168.1.244" action="block" protocol="tcp" portno="3260"
syn_only="true" op monitor interval="10s" timeout="10s" depth="0"
primitive tgtfinal.1 ocf:local:tgtFinal params tid="1" op monitor
interval="30s" timeout="30s" op stop timeout="60s"
primitive portunblock.ip.192.168.1.244 ocf:local:portblock params
ip="192.168.1.244" action="unblock" protocol="tcp" portno="3260"
syn_only="true" op monitor interval="10s" timeout="10s" depth="0"
group group.target.1 lvm.vg1 portblock.ip.192.168.12.244
ip.192.168.12.244 target.6 lun.6.1 tgtfinal.1 portunblock.ip.192.168.1.244
order clone.tgtd_before_group.target.1 inf: clone.tgtd:start
group.target.1:start
order clone.user.username_before_group.target.1 inf:
clone.user.username:start group.target.1:start
BTW, i generate the crm config from a simpler XML representation using
XSLT, so there's no redundancy in the source config and less opportunity
for error.
This setup seems to be pretty bulletproof. I tested it with six targets
being continually pounded on by 12 VMs, and it survives fencing of a
node with no corruption, and fails over quickly--no more hung tgtd
connections. The two-node setup of a couple of Dell R720s with some
attached MD1200s, 2x10 GbE LACP node interconnect, and 10 GbE connection
from each node to VM infrastructure handles > 1 GiB/s read bandwidth and
around 500 MiB/s write. Latency is not bad but might be improved using
an Infiniband interconnect. The main thing is it seems to be solid and
it doesn't hang on failover, nor does it cause I/O errors on initiators
when active LUNs disappear, the way the tgt/drbd configs i've found
documented elsewhere (e.g. Linbit's cookbook) do under load.
#!/bin/sh
#
# Resource agent for terminating tgtd connections on stop, to be used before
# deleting LUNs, to avoid race conditions with initiators
#
# Description: On stop, removes all active tgtd connections for given target
#
# Author: Jefferson Ogata <jefferson.og...@noaa.gov>
# License: GNU General Public License (GPL)
#
# Some bits stolen from the iSCSITarget RA by Florian Haas, Dejan Muhamedagic,
# and Linux-HA contributors.
#
#
# usage: $0 {start|stop|status|monitor|validate-all|meta-data}
#
# The "start" arg does nothing.
#
# The "stop" arg terminates connections. Should be wrapped in portblock.
#
# OCF parameters:
# OCF_RESKEY_tid
##########################################################################
# Initialization:
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
USAGE="Usage: $0 {start|stop|status|monitor|validate-all|meta-data}";
##########################################################################
usage()
{
echo $USAGE >&2
}
meta_data()
{
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="tgtUser">
<version>1.0</version>
<longdesc lang="en">
This pseudo-resource script deletes tgtd iSCSI connections on stop.
</longdesc>
<shortdesc lang="en">Deletes tgtd connections on stop</shortdesc>
<parameters>
<parameter name="tid">
<longdesc lang="en">
The target id.
</longdesc>
<shortdesc lang="en">target id</shortdesc>
<content type="string" default=""/>
</parameter>
</parameters>
<actions>
<action name="start" timeout="10s"/>
<action name="stop" timeout="60s"/>
<action name="monitor" depth="0" timeout="10s" interval="60s" />
<action name="validate-all" timeout="10s"/>
<action name="meta-data" timeout="5s"/>
</actions>
</resource-agent>
END
return $OCF_SUCCESS
}
start()
{
ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" start
return $OCF_SUCCESS
}
stop()
{
ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" stop
tid="${OCF_RESKEY_tid}"
while true; do
# Close existing connections. There is no other way to
# do this in tgt than to parse the output of "tgtadm --op
# show".
set -- $(tgtadm --lld iscsi --op show --mode target \
| sed -ne '/^Target '${tid}':/,/^Target/ {
/^[[:space:]]*I_T nexus: \([0-9]\+\)/ {
s/^.*: \([0-9]*\).*/--sid=\1/; h;
};
/^[[:space:]]*Connection: \([0-9]\+\)/ {
s/^.*: \([0-9]*\).*/--cid=\1/; G; p;
};
/^[[:space:]]*LUN information:/ q;
}')
if [[ -z $2 ]]; then
# No connections left.
ocf_log debug "No remaining connections for tid ${tid}"
break
fi
while [[ -n $2 ]]; do
# $2 $1 looks like "--sid=X --cid=Y"
ocf_log debug "Deleting --tid=${tid} $2 $1"
ocf_run tgtadm --lld iscsi --op delete --mode connection \
--tid=${tid} $2 $1
shift 2
done
done
return $OCF_SUCCESS
}
status()
{
local rc
rc=$OCF_ERR_GENERIC
if ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" status; then
echo "${OCF_RESOURCE_INSTANCE} is running"
rc=$OCF_SUCCESS
else
echo "${OCF_RESOURCE_INSTANCE} is stopped"
rc=$OCF_NOT_RUNNING
fi
return $rc
}
validate_all()
{
if [ -z "$OCF_RESKEY_tid" ]; then
ocf_log err "target id not specified"
exit $OCF_ERR_ARGS
fi
return $OCF_SUCCESS
}
#
# Main
#
if [ $# -ne 1 ]; then
usage
exit $OCF_ERR_ARGS
fi
case $1 in
start) start
exit $?
;;
stop) stop
exit $?
;;
status) status
exit $?
;;
monitor) status
exit $?
;;
validate-all) validate_all
exit $?
;;
meta-data) meta_data
exit $?
;;
usage) usage
exit $OCF_SUCCESS
;;
*) usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems