On 2013-11-21 16:34, Jefferson Ogata wrote:
On 2013-11-20 08:35, Jefferson Ogata wrote:
Indeed, using iptables with REJECT and tcp-reset, this seems to piss off
the initiators, creating immediate i/o errors. But one can use DROP on
incoming SYN packets and let established connections drain. I've been
trying to get this to work but am finding that it takes so long for some
connections to drain that something times out. I haven't given up on
this approach, tho. Testing this stuff can be tricky because if i make
one mistake, stonith kicks in and i end up having to wait 5-10 minutes
for the machine to reboot and resync its DRBD devices.

Follow-up on this: the original race condition i reported still occurs
with this strategy: if existing TCP connections are allowed to drain by
passing packets from established initiator connections (by blocking only
SYN packets), then the initiator can also send new requests to the
target during the takedown process; the takedown removes LUNs from the
live target and the initiator generates an i/o error if it happens to
try to access a LUN that has been removed before the connection is removed.

This happens because the configuration looks something like this (crm):

group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 portunblock

On takedown, if portblock is tweaked to pass packets for existing
connections so they can drain, there's a window while LUNs lun3, lun2,
lun1 are being removed from the target where this race condition occurs.
The connection isn't removed until iSCSITarget runs to stop the target.

A way to handle this that should actually work is to write a new RA that
deletes the connections from the target *before* the LUNs are removed
during takedown. The config would look something like this, then:

group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 tgtConnections portunblock

On takedown, then, portunblock will block new incoming connections,
tgtConnections will shut down existing connections and wait for them to
drain, then the LUNs can be safely removed before the target is taken down.

I'll write this RA today and see how that works.

So, this strategy worked. The final RA is attached. The config (crm) then looks like this, using the tweaked portblock RA that blocks syn only, the tgtUser RA that adds a tgtd user, and the tweaked iSCSITarget RA that doesn't add a user if no password is provided (see previous discussion for the latter two RAs). This is a two-node cluster using DRBD-backed LVMs and multiple targets. The names have been changed to protect the innocent, and is simplified to only a single target for brevity, but it should be clear how to do multiple DRBDs/VGs/targets. I've left out the stonith config here also.


primitive tgtd lsb:tgtd op monitor interval="10s"
clone clone.tgtd tgtd
primitive user.username ocf:local:tgtUser params username="username" password="password"
clone clone.user.username user.username
order clone.tgtd_before_clone.user.username inf: clone.tgtd:start clone.user.username:start

primitive drbd.pv1 ocf:linbit:drbd params drbd_resource="pv1" op monitor role="Slave" interval="29s" timeout="600s" op monitor role="Master" interval="31s" timeout="600s" op start timeout="240s" op stop timeout="240s" ms ms.drbd.pv1 drbd.pv1 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" primitive lvm.vg1 ocf:heartbeat:LVM params volgrpname="vg1" op monitor interval="30s" timeout="30s" op start timeout="30s" op stop timeout="30s"
order ms.drbd.pv1_before_lvm.vg1 inf: ms.drbd.pv1:promote lvm.vg1:start
colocation ms.drbd.pv1_with_lvm.vg1 inf: ms.drbd.pv1:Master lvm.vg1

primitive target.1 ocf:local:iSCSITarget params iqn="iqn....t1" tid="1" incoming_username="username" implementation="tgt" portals="" op monitor interval="30s" op start timeout="30s" op stop timeout="120s" primitive lun.1.1 ocf:heartbeat:iSCSILogicalUnit params target_iqn="iqn....t1" lun="1" path="/dev/vg1/lv1" additional_parameters="scsi_id=vg1/lv1 mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0" implementation="tgt" op monitor interval="30s" op start timeout="30s" op stop timeout="120s" primitive ip.192.168.1.244 ocf:heartbeat:IPaddr params ip="192.168.1.244" cidr_netmask="24" nic="bond0" primitive portblock.ip.192.168.1.244 ocf:local:portblock params ip="192.168.1.244" action="block" protocol="tcp" portno="3260" syn_only="true" op monitor interval="10s" timeout="10s" depth="0" primitive tgtfinal.1 ocf:local:tgtFinal params tid="1" op monitor interval="30s" timeout="30s" op stop timeout="60s" primitive portunblock.ip.192.168.1.244 ocf:local:portblock params ip="192.168.1.244" action="unblock" protocol="tcp" portno="3260" syn_only="true" op monitor interval="10s" timeout="10s" depth="0"

group group.target.1 lvm.vg1 portblock.ip.192.168.12.244 ip.192.168.12.244 target.6 lun.6.1 tgtfinal.1 portunblock.ip.192.168.1.244

order clone.tgtd_before_group.target.1 inf: clone.tgtd:start group.target.1:start order clone.user.username_before_group.target.1 inf: clone.user.username:start group.target.1:start


BTW, i generate the crm config from a simpler XML representation using XSLT, so there's no redundancy in the source config and less opportunity for error.

This setup seems to be pretty bulletproof. I tested it with six targets being continually pounded on by 12 VMs, and it survives fencing of a node with no corruption, and fails over quickly--no more hung tgtd connections. The two-node setup of a couple of Dell R720s with some attached MD1200s, 2x10 GbE LACP node interconnect, and 10 GbE connection from each node to VM infrastructure handles > 1 GiB/s read bandwidth and around 500 MiB/s write. Latency is not bad but might be improved using an Infiniband interconnect. The main thing is it seems to be solid and it doesn't hang on failover, nor does it cause I/O errors on initiators when active LUNs disappear, the way the tgt/drbd configs i've found documented elsewhere (e.g. Linbit's cookbook) do under load.
#!/bin/sh
#
# Resource agent for terminating tgtd connections on stop, to be used before
# deleting LUNs, to avoid race conditions with initiators
#
# Description:  On stop, removes all active tgtd connections for given target
#
# Author: Jefferson Ogata <jefferson.og...@noaa.gov>
# License: GNU General Public License (GPL) 
#
# Some bits stolen from the iSCSITarget RA by Florian Haas, Dejan Muhamedagic,
# and Linux-HA contributors.
#
#
#       usage: $0 {start|stop|status|monitor|validate-all|meta-data}
#
#       The "start" arg does nothing.
#
#       The "stop" arg terminates connections. Should be wrapped in portblock.
#
# OCF parameters:
# OCF_RESKEY_tid

##########################################################################
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

USAGE="Usage: $0 {start|stop|status|monitor|validate-all|meta-data}";

##########################################################################

usage() 
{
    echo $USAGE >&2
}

meta_data() 
{
    cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="tgtUser">
<version>1.0</version>
<longdesc lang="en">
This pseudo-resource script deletes tgtd iSCSI connections on stop.
</longdesc>
<shortdesc lang="en">Deletes tgtd connections on stop</shortdesc>

<parameters>

<parameter name="tid">
<longdesc lang="en">
The target id.
</longdesc>
<shortdesc lang="en">target id</shortdesc>
<content type="string" default=""/>
</parameter>

</parameters>

<actions>
<action name="start" timeout="10s"/>
<action name="stop" timeout="60s"/>
<action name="monitor" depth="0" timeout="10s" interval="60s" />
<action name="validate-all" timeout="10s"/>
<action name="meta-data"  timeout="5s"/>
</actions>
</resource-agent>
END
    return $OCF_SUCCESS
}


start()
{
    ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" start
    return $OCF_SUCCESS
}


stop()
{
    ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" stop

    tid="${OCF_RESKEY_tid}"
    while true; do
        # Close existing connections. There is no other way to
        # do this in tgt than to parse the output of "tgtadm --op
        # show".
        set -- $(tgtadm --lld iscsi --op show --mode target \
            | sed -ne '/^Target '${tid}':/,/^Target/ {
                  /^[[:space:]]*I_T nexus: \([0-9]\+\)/ {
                     s/^.*: \([0-9]*\).*/--sid=\1/; h;
                  };
                  /^[[:space:]]*Connection: \([0-9]\+\)/ { 
                      s/^.*: \([0-9]*\).*/--cid=\1/; G; p; 
                  }; 
                  /^[[:space:]]*LUN information:/ q; 
              }')
        if [[ -z $2 ]]; then
            # No connections left.
            ocf_log debug "No remaining connections for tid ${tid}"
            break
        fi
        while [[ -n $2 ]]; do
            # $2 $1 looks like "--sid=X --cid=Y"
            ocf_log debug "Deleting --tid=${tid} $2 $1"
            ocf_run tgtadm --lld iscsi --op delete --mode connection \
                --tid=${tid} $2 $1
            shift 2
        done
    done

    return $OCF_SUCCESS
}


status()
{
    local rc
    rc=$OCF_ERR_GENERIC
    if ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" status; then
        echo "${OCF_RESOURCE_INSTANCE} is running"
        rc=$OCF_SUCCESS
    else
        echo "${OCF_RESOURCE_INSTANCE} is stopped"
        rc=$OCF_NOT_RUNNING
    fi
    return $rc
}

validate_all()
{
    if [ -z "$OCF_RESKEY_tid" ]; then
        ocf_log err "target id not specified"
        exit $OCF_ERR_ARGS
    fi
    return $OCF_SUCCESS
}


#
# Main
#
 
if [ $# -ne 1 ]; then
        usage
        exit $OCF_ERR_ARGS
fi

case $1 in
        start)          start
                        exit $?
                        ;;
        
        stop)           stop
                        exit $?
                        ;;

        status)         status
                        exit $?
                        ;;

        monitor)        status
                        exit $?
                        ;;

        validate-all)   validate_all
                        exit $?
                        ;;

        meta-data)      meta_data
                        exit $?
                        ;;

        usage)  usage
                exit $OCF_SUCCESS
                ;;

        *)      usage
                exit $OCF_ERR_UNIMPLEMENTED
                ;;
esac

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to