Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

Jefferson Ogata Wed, 04 Dec 2013 09:56:45 -0800

On 2013-11-21 16:34, Jefferson Ogata wrote:

On 2013-11-20 08:35, Jefferson Ogata wrote:

Indeed, using iptables with REJECT and tcp-reset, this seems to piss off
the initiators, creating immediate i/o errors. But one can use DROP on
incoming SYN packets and let established connections drain. I've been
trying to get this to work but am finding that it takes so long for some
connections to drain that something times out. I haven't given up on
this approach, tho. Testing this stuff can be tricky because if i make
one mistake, stonith kicks in and i end up having to wait 5-10 minutes
for the machine to reboot and resync its DRBD devices.


Follow-up on this: the original race condition i reported still occurs
with this strategy: if existing TCP connections are allowed to drain by
passing packets from established initiator connections (by blocking only
SYN packets), then the initiator can also send new requests to the
target during the takedown process; the takedown removes LUNs from the
live target and the initiator generates an i/o error if it happens to
try to access a LUN that has been removed before the connection is removed.

This happens because the configuration looks something like this (crm):

group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 portunblock

On takedown, if portblock is tweaked to pass packets for existing
connections so they can drain, there's a window while LUNs lun3, lun2,
lun1 are being removed from the target where this race condition occurs.
The connection isn't removed until iSCSITarget runs to stop the target.

A way to handle this that should actually work is to write a new RA that
deletes the connections from the target *before* the LUNs are removed
during takedown. The config would look something like this, then:

group foo portblock vip iSCSITarget:target iSCSILogicalUnit:lun1
iSCSILogicalUnit:lun2 iSCSILogicalUnit:lun3 tgtConnections portunblock

On takedown, then, portunblock will block new incoming connections,
tgtConnections will shut down existing connections and wait for them to
drain, then the LUNs can be safely removed before the target is taken down.

I'll write this RA today and see how that works.

So, this strategy worked. The final RA is attached. The config (crm)then looks like this, using the tweaked portblock RA that blocks synonly, the tgtUser RA that adds a tgtd user, and the tweaked iSCSITargetRA that doesn't add a user if no password is provided (see previousdiscussion for the latter two RAs). This is a two-node cluster usingDRBD-backed LVMs and multiple targets. The names have been changed toprotect the innocent, and is simplified to only a single target forbrevity, but it should be clear how to do multiple DRBDs/VGs/targets.I've left out the stonith config here also.



primitive tgtd lsb:tgtd op monitor interval="10s"
clone clone.tgtd tgtd

primitive user.username ocf:local:tgtUser params username="username"password="password"

clone clone.user.username user.username

order clone.tgtd_before_clone.user.username inf: clone.tgtd:startclone.user.username:start

primitive drbd.pv1 ocf:linbit:drbd params drbd_resource="pv1" op monitorrole="Slave" interval="29s" timeout="600s" op monitor role="Master"interval="31s" timeout="600s" op start timeout="240s" op stop timeout="240s"ms ms.drbd.pv1 drbd.pv1 meta master-max="1" master-node-max="1"clone-max="2" clone-node-max="1" notify="true"primitive lvm.vg1 ocf:heartbeat:LVM params volgrpname="vg1" op monitorinterval="30s" timeout="30s" op start timeout="30s" op stop timeout="30s"

order ms.drbd.pv1_before_lvm.vg1 inf: ms.drbd.pv1:promote lvm.vg1:start
colocation ms.drbd.pv1_with_lvm.vg1 inf: ms.drbd.pv1:Master lvm.vg1

primitive target.1 ocf:local:iSCSITarget params iqn="iqn....t1" tid="1"incoming_username="username" implementation="tgt" portals="" op monitorinterval="30s" op start timeout="30s" op stop timeout="120s"primitive lun.1.1 ocf:heartbeat:iSCSILogicalUnit paramstarget_iqn="iqn....t1" lun="1" path="/dev/vg1/lv1"additional_parameters="scsi_id=vg1/lv1mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0"implementation="tgt" op monitor interval="30s" op start timeout="30s" opstop timeout="120s"primitive ip.192.168.1.244 ocf:heartbeat:IPaddr paramsip="192.168.1.244" cidr_netmask="24" nic="bond0"primitive portblock.ip.192.168.1.244 ocf:local:portblock paramsip="192.168.1.244" action="block" protocol="tcp" portno="3260"syn_only="true" op monitor interval="10s" timeout="10s" depth="0"primitive tgtfinal.1 ocf:local:tgtFinal params tid="1" op monitorinterval="30s" timeout="30s" op stop timeout="60s"primitive portunblock.ip.192.168.1.244 ocf:local:portblock paramsip="192.168.1.244" action="unblock" protocol="tcp" portno="3260"syn_only="true" op monitor interval="10s" timeout="10s" depth="0"

group group.target.1 lvm.vg1 portblock.ip.192.168.12.244ip.192.168.12.244 target.6 lun.6.1 tgtfinal.1 portunblock.ip.192.168.1.244

order clone.tgtd_before_group.target.1 inf: clone.tgtd:startgroup.target.1:startorder clone.user.username_before_group.target.1 inf:clone.user.username:start group.target.1:start

BTW, i generate the crm config from a simpler XML representation usingXSLT, so there's no redundancy in the source config and less opportunityfor error.

This setup seems to be pretty bulletproof. I tested it with six targetsbeing continually pounded on by 12 VMs, and it survives fencing of anode with no corruption, and fails over quickly--no more hung tgtdconnections. The two-node setup of a couple of Dell R720s with someattached MD1200s, 2x10 GbE LACP node interconnect, and 10 GbE connectionfrom each node to VM infrastructure handles > 1 GiB/s read bandwidth andaround 500 MiB/s write. Latency is not bad but might be improved usingan Infiniband interconnect. The main thing is it seems to be solid andit doesn't hang on failover, nor does it cause I/O errors on initiatorswhen active LUNs disappear, the way the tgt/drbd configs i've founddocumented elsewhere (e.g. Linbit's cookbook) do under load.

#!/bin/sh
#
# Resource agent for terminating tgtd connections on stop, to be used before
# deleting LUNs, to avoid race conditions with initiators
#
# Description:  On stop, removes all active tgtd connections for given target
#
# Author: Jefferson Ogata <jefferson.og...@noaa.gov>
# License: GNU General Public License (GPL) 
#
# Some bits stolen from the iSCSITarget RA by Florian Haas, Dejan Muhamedagic,
# and Linux-HA contributors.
#
#
#       usage: $0 {start|stop|status|monitor|validate-all|meta-data}
#
#       The "start" arg does nothing.
#
#       The "stop" arg terminates connections. Should be wrapped in portblock.
#
# OCF parameters:
# OCF_RESKEY_tid

##########################################################################
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

USAGE="Usage: $0 {start|stop|status|monitor|validate-all|meta-data}";

##########################################################################

usage() 
{
    echo $USAGE >&2
}

meta_data() 
{
    cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="tgtUser">
<version>1.0</version>
<longdesc lang="en">
This pseudo-resource script deletes tgtd iSCSI connections on stop.
</longdesc>
<shortdesc lang="en">Deletes tgtd connections on stop</shortdesc>

<parameters>

<parameter name="tid">
<longdesc lang="en">
The target id.
</longdesc>
<shortdesc lang="en">target id</shortdesc>
<content type="string" default=""/>
</parameter>

</parameters>

<actions>
<action name="start" timeout="10s"/>
<action name="stop" timeout="60s"/>
<action name="monitor" depth="0" timeout="10s" interval="60s" />
<action name="validate-all" timeout="10s"/>
<action name="meta-data"  timeout="5s"/>
</actions>
</resource-agent>
END
    return $OCF_SUCCESS
}


start()
{
    ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" start
    return $OCF_SUCCESS
}


stop()
{
    ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" stop

    tid="${OCF_RESKEY_tid}"
    while true; do
        # Close existing connections. There is no other way to
        # do this in tgt than to parse the output of "tgtadm --op
        # show".
        set -- $(tgtadm --lld iscsi --op show --mode target \
            | sed -ne '/^Target '${tid}':/,/^Target/ {
                  /^[[:space:]]*I_T nexus: \([0-9]\+\)/ {
                     s/^.*: \([0-9]*\).*/--sid=\1/; h;
                  };
                  /^[[:space:]]*Connection: \([0-9]\+\)/ { 
                      s/^.*: \([0-9]*\).*/--cid=\1/; G; p; 
                  }; 
                  /^[[:space:]]*LUN information:/ q; 
              }')
        if [[ -z $2 ]]; then
            # No connections left.
            ocf_log debug "No remaining connections for tid ${tid}"
            break
        fi
        while [[ -n $2 ]]; do
            # $2 $1 looks like "--sid=X --cid=Y"
            ocf_log debug "Deleting --tid=${tid} $2 $1"
            ocf_run tgtadm --lld iscsi --op delete --mode connection \
                --tid=${tid} $2 $1
            shift 2
        done
    done

    return $OCF_SUCCESS
}


status()
{
    local rc
    rc=$OCF_ERR_GENERIC
    if ha_pseudo_resource "${OCF_RESOURCE_INSTANCE}" status; then
        echo "${OCF_RESOURCE_INSTANCE} is running"
        rc=$OCF_SUCCESS
    else
        echo "${OCF_RESOURCE_INSTANCE} is stopped"
        rc=$OCF_NOT_RUNNING
    fi
    return $rc
}

validate_all()
{
    if [ -z "$OCF_RESKEY_tid" ]; then
        ocf_log err "target id not specified"
        exit $OCF_ERR_ARGS
    fi
    return $OCF_SUCCESS
}


#
# Main
#
 
if [ $# -ne 1 ]; then
        usage
        exit $OCF_ERR_ARGS
fi

case $1 in
        start)          start
                        exit $?
                        ;;
        
        stop)           stop
                        exit $?
                        ;;

        status)         status
                        exit $?
                        ;;

        monitor)        status
                        exit $?
                        ;;

        validate-all)   validate_all
                        exit $?
                        ;;

        meta-data)      meta_data
                        exit $?
                        ;;

        usage)  usage
                exit $OCF_SUCCESS
                ;;

        *)      usage
                exit $OCF_ERR_UNIMPLEMENTED
                ;;
esac

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] drbd/pacemaker multiple tgt targets, portblock, and race conditions (long-ish)

Reply via email to