date:20120716

Re: [Pacemaker] [patch] Seeking suggestions for cluster configuration of HA iSCSI target and initiators]

2012-07-16 Thread Phil Frost


On 07/16/2012 01:34 PM, Phil Frost wrote:
I've been doing some study of the iscsi RA since my first post, and it 
seems to me now that the "failure" in the monitor action isn't 
actually in the monitor action at all. Rather, it appears that for 
*all* actions, the RA does a "discovery" step, and that's what is 
failing. I'm not really sure what this is, or why I need it. Is it 
simply to find an unspecified portal for a given IQN? Is it therefore 
useless in my case, since I've explicitly specified the portal in the 
resource parameters?


If I were to disable the "discovery" step, what are people's thoughts 
on the case where the target is operational, but the initiator for 
some reason (network failure) can't reach it? In this case, assume 
Pacemaker knows the target is up; is there a way to encourage it to 
decide to attempt migrating the initiator to another node?


Well, after reading through the iscsi RA a dozen times, I could not 
formulate any reasonable idea of why the discovery step might be 
necessary. The portal parameter is required, so it couldn't be to locate 
the portal. And, there is logic in the discovery function to handle the 
case when a target returns multiple portals for the same target -- by 
finding the one that was specified in the portal parameter. So it can't 
really be discovering anything. It does raise an error in this case if 
the portal parameter isn't specified, but then the portal parameter 
isn't optional, so that case could never occur. It smelled like rotten 
code to me.


So, given all that, and given how it introduces a nasty race condition 
in the case that the target isn't running (or is just in the process of 
migrating to another node), I decided it was better to just get rid of 
it. Patch attached. I suppose I've introduced a different failure in 
that an initiator that can't contact a running target won't be migrated, 
but I'd rather have one of my VMs trying to run, unsuccessfully, and 
able to automatically recover when the fault is cleared, than have an 
entire VM host shot in the head on the basis of a race condition in 
non-failure situations.


One minor nastiness was observed with my patch: if the portal isn't 
specified exactly as udev will format it, then the RA will wait forever 
for the device node to appear, expecting the wrong device filename. 
Maybe canonicalizing the portal was one useful function of the discovery 
function, but in my opinion, not worth the other problems.
--- heartbeat/iscsi	2012-07-16 13:10:14.0 -0400
+++ macpros/iscsi	2012-07-16 14:50:57.0 -0400
@@ -31,7 +31,6 @@
 #	OCF_RESKEY_portal: the iSCSI portal address or host name (required)
 #	OCF_RESKEY_target: the iSCSI target (required)
 #	OCF_RESKEY_iscsiadm: iscsiadm program path (optional)
-#	OCF_RESKEY_discovery_type: discovery type (optional; default: sendtargets)
 #
 # Initialization:
 
@@ -41,11 +40,9 @@
 # Defaults
 OCF_RESKEY_udev_default="yes"
 OCF_RESKEY_iscsiadm_default="iscsiadm"
-OCF_RESKEY_discovery_type_default="sendtargets"
 
 : ${OCF_RESKEY_udev=${OCF_RESKEY_udev_default}}
 : ${OCF_RESKEY_iscsiadm=${OCF_RESKEY_iscsiadm_default}}
-: ${OCF_RESKEY_discovery_type=${OCF_RESKEY_discovery_type_default}}
 
 usage() {
   methods=`iscsi_methods`
@@ -96,15 +93,6 @@
 
 
 
-
-
-Target discovery type. Check the open-iscsi documentation for
-supported discovery types.
-
-Target discovery type
-
-
-
 
 
 open-iscsi administration utility binary.
@@ -128,8 +116,8 @@
 
 
 
-
-
+
+
 
 
 
@@ -166,7 +154,6 @@
 	fi
 }
 open_iscsi_setup() {
-	discovery=open_iscsi_discovery
 	add_disk=open_iscsi_add
 	remove_disk=open_iscsi_remove
 	disk_status=open_iscsi_status
@@ -179,72 +166,6 @@
 		return $OCF_ERR_INSTALLED
 }
 
-#
-# discovery return codes:
-#   0: ok (variable portal set)
-#   1: target not found
-#   2: target found but can't connect it unambigously
-#   3: iscsiadm returned error
-#
-# open-iscsi >= "2.0-872" changed discovery semantics
-# see http://www.mail-archive.com/open-iscsi@googlegroups.com/msg04883.html
-# there's a new discoverydb command which should be used instead discovery
- 
-open_iscsi_discovery() {
-	local output
-	local severity=err
-	local discovery_variant="discovery"
-	local options=""
-	local cmd
-	local version=`$iscsiadm --version | awk '{print $3}'`
-
-	ocf_version_cmp "$version" "2.0-871"
-	if [ $? -eq 2 ]; then # newer than 2.0-871?
-		discovery_variant="discoverydb"
-		[ "$discovery_type" = "sendtargets" ] &&
-			options="-D"
-	fi
-	cmd="$iscsiadm -m $discovery_variant -p $OCF_RESKEY_portal -t $discovery_type $options"
-	ocf_is_probe && severity=info
-	output=`$cmd`
-	if [ $? -ne 0 -o x = "x$output" ]; then
-		[ x != "x$output" ] && {
-			ocf_log $severity "$cmd FAILED"
-			echo "$output"
-		}
-		return 3
-	fi
-	portal=`echo "$output" |
-		awk -v target="$OCF_RESKEY_target" '
-		$NF==target{
-			if( NF==3 ) portal=$2; # sles compat mode
-			else portal=$1;
-			sub(",.*","",portal);
-			print portal;
-		}'`
-
-	case `echo "$portal" | wc

Re: [Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

2012-07-16 Thread Maurits van de Lande

Hello,

The last couple of months I've been busy on setting up highly available iSCSI 
targets and using the iSCSI luns in a KVM virtualisation cluster.

The iSCSI targets are setup using pacemaker, drbd, stgt and CentOS 6. 
Unfortunately I have not had the time to compile and test LIO (linux-iscsi.org) 
therefore I havn't been able to use SCSI-3 persistent reservation  fencing on 
the virtualisation hosts. http://linux.die.net/man/8/fence_scsi (lio does 
support this)

But the iSCSI target clusters with tgt will do.
I have got detailed note's on all the configuration steps to setup a cluster, 
they contain some information about our internel network and therefore are not 
suited for public use. (I'm planning to publish a public version of the 
document so time this year)

I can sent you my configuration document if you would like to see it. 

As for the virtualiation cluster. At first I setup a pacemaker cluster but it 
lacks support for a cluster filesystem to host the virtual machine config 
files. All VirtualDomains must be "definend" using virsh which means a local 
copy is stored at /etc/libvirt/qemu on each cluster node. (I just want a single 
location to store the configuration files)
Therefore a switch to a cman/rgmanager cluster for virtualization.
The VM's use iSCSI lun's for storage and I have setup a GFS2 filesystem to host 
the VM config files. One drawback of using rgmanager is that when a cluster 
node shuts down the VM's running on that node are shutdown and restared on a 
remaining cluster node.
I have written a bash script to live migrate the VM's to the remaining cluster 
nodes.

the notes on setting up the virtualisation cluster are not completely finished 
but again, if you would like to see it I can sent it to you.

Best regards,

Maurits van de Lande

| Van de Lande BV. | T +31 (0) 162 516000 | F +31 (0) 162 521417 | 
www.vdl-fittings.com |

Van: Phil Frost [p...@macprofessionals.com]
Verzonden: maandag 16 juli 2012 19:34
Aan: Digimer
CC: The Pacemaker cluster resource manager
Onderwerp: Re: [Pacemaker] Seeking suggestions for cluster configuration of HA 
iSCSI target and initiators

On 07/16/2012 01:14 PM, Digimer wrote:
> I've only tested this a little, so please take it as a general
> suggestion rather than strong advice.
>
> I created a two-node cluster, using red hat's high-availability add-on,
> using DRBD to keep the data replicated between the two "SAN" nodes and
> tgtd to export the LUNs. I had a virtual IP on the cluster to act as the
> target IP and I had DRBD in dual-primary mode with clustered LVM (so I
> had DRBD as the PV and exported the space from the LVs).
>
> Then I built a second cluster of five nodes to host KVM VMs. The
> underlying nodes used clustered LVM as well, but this time the LUNs was
> the PV. I carved this up into an LV per VM and made the VMs the HA
> service. Again using RH HA-Addon.
>
> In this setup, I was able to fail over the SAN without losing any VMs. I
> even messed up the fencing on the SAN cluster once, which meant it took
> 30s to fail over, and I didn't lose the VMs. So to the minimal extent I
> tested it, it worked excellently.
>
> I have some very rough notes on this setup. They're not fit for public
> consumption at all, but if you'd like I'll send them to you directly.
> They include the configurations which might help as a template or similar.

This sounds similar to what I have, except I'm doing it with only one
cluster. The reason I'm using one cluster is twofold:

1) the storage is replicated between only two nodes, and I wish to avoid
a two-node cluster so I can have a useful quorum.

2) my IO load is not high and my budget is low, so the storage nodes
could also run VMs and not be overloaded. Having this capability in the
event that too many VM nodes have failed is a robustness win.

As I have things configured, *usually* I can initiate a failover of the
target, and everything is fine. The problem is when I am unlucky the
initiator monitor action occurs while the target failover is occurring.
It's easy to get unlucky if something is horribly wrong, and the target
is down longer than a normal failover. It's also possible, though
harder, to get unlucky by simply issuing "crm resource mirgate
iscsitarget" at the right instant. My availability requirements aren't
so high that I couldn't deal with the occasional long-term target
failure being a special case, but simply performing a planned migration
of the target having the potential to uncleanly reboot all the VMs on
one node is pretty horrible.

I've been doing some study of the iscsi RA since my first post, and it
seems to me now that the "failure" in the monitor action isn't actually
in the monitor action at all. Rather, it appears that for *all* actions,
the RA does a "discovery" step, and that's what is failing. I'm not
really sure what this is, or why I need it. Is it simply to find an
unspecified portal for a given IQN? Is it therefor

Re: [Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

2012-07-16 Thread Phil Frost


On 07/16/2012 01:14 PM, Digimer wrote:

I've only tested this a little, so please take it as a general
suggestion rather than strong advice.

I created a two-node cluster, using red hat's high-availability add-on,
using DRBD to keep the data replicated between the two "SAN" nodes and
tgtd to export the LUNs. I had a virtual IP on the cluster to act as the
target IP and I had DRBD in dual-primary mode with clustered LVM (so I
had DRBD as the PV and exported the space from the LVs).

Then I built a second cluster of five nodes to host KVM VMs. The
underlying nodes used clustered LVM as well, but this time the LUNs was
the PV. I carved this up into an LV per VM and made the VMs the HA
service. Again using RH HA-Addon.

In this setup, I was able to fail over the SAN without losing any VMs. I
even messed up the fencing on the SAN cluster once, which meant it took
30s to fail over, and I didn't lose the VMs. So to the minimal extent I
tested it, it worked excellently.

I have some very rough notes on this setup. They're not fit for public
consumption at all, but if you'd like I'll send them to you directly.
They include the configurations which might help as a template or similar.


This sounds similar to what I have, except I'm doing it with only one 
cluster. The reason I'm using one cluster is twofold:


1) the storage is replicated between only two nodes, and I wish to avoid 
a two-node cluster so I can have a useful quorum.


2) my IO load is not high and my budget is low, so the storage nodes 
could also run VMs and not be overloaded. Having this capability in the 
event that too many VM nodes have failed is a robustness win.


As I have things configured, *usually* I can initiate a failover of the 
target, and everything is fine. The problem is when I am unlucky the 
initiator monitor action occurs while the target failover is occurring. 
It's easy to get unlucky if something is horribly wrong, and the target 
is down longer than a normal failover. It's also possible, though 
harder, to get unlucky by simply issuing "crm resource mirgate 
iscsitarget" at the right instant. My availability requirements aren't 
so high that I couldn't deal with the occasional long-term target 
failure being a special case, but simply performing a planned migration 
of the target having the potential to uncleanly reboot all the VMs on 
one node is pretty horrible.


I've been doing some study of the iscsi RA since my first post, and it 
seems to me now that the "failure" in the monitor action isn't actually 
in the monitor action at all. Rather, it appears that for *all* actions, 
the RA does a "discovery" step, and that's what is failing. I'm not 
really sure what this is, or why I need it. Is it simply to find an 
unspecified portal for a given IQN? Is it therefore useless in my case, 
since I've explicitly specified the portal in the resource parameters?


If I were to disable the "discovery" step, what are people's thoughts on 
the case where the target is operational, but the initiator for some 
reason (network failure) can't reach it? In this case, assume Pacemaker 
knows the target is up; is there a way to encourage it to decide to 
attempt migrating the initiator to another node?



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

2012-07-16 Thread Digimer

On 07/16/2012 12:08 PM, Phil Frost wrote:
> I'm designing a cluster to run both iSCSI targets and initiators to
> ultimately provide block devices to virtual machines. I'm considering
> the case of a target failure, and how to handle that as gracefully as
> possible. Ideally, IO may be paused until the target recovers, but VMs
> do not restart or see IO errors.
> 
> I've observed that the iscsi RA will configure the initiator to retry
> connections indefinitely if the target should fail. This is mostly good,
> except that if the initiator is in the retrying state, the monitor
> action will return an error.
> 
> The Right Thing to do in this case, I would think, would be to just
> wait. Of course the initiators can't work if the target is down, but the
> initiators will recover automatically when the target recovers. Ideally
> the cluster would wait for the target (which it also manages) to
> recover, then try again to monitor the initiators. For good measure, it
> might try monitoring the initiators a couple times, since it can take
> them a moment to reconnect.
> 
> Unfortunately, what actually happens is the monitor action on the
> initiator fails. Pacemaker then attempts to stop the initiator, and that
> also fails, because the target is still unavailable. Then the initiator
> node gets STONITHed, taking out all the hosted VMs with it.
> 
> I added a mandatory, non-symmetrical order constraint of target ->
> initiator, so at least Pacemaker will not attempt to re-start the
> initiator after a target failure. I made it asymetrical so that restarts
> of the target do not force restarts of the initiator. However, it
> doesn't do much to help the failed-target case.
> 
> What's a good solution? Is there some way to suspend monitoring of the
> initiators if pacemaker knows the target is failed? I suppose I could
> modify the iscsi RA to return success for monitor in the case that the
> initiator is attempting to reconnect to the target, but then what if
> actually the initiator has failed, and the target is operational? What
> then about race conditions that might exist in cases where the target
> has failed, but pacemaker has not yet detected the target failure though
> a monitor operation?

I've only tested this a little, so please take it as a general
suggestion rather than strong advice.

I created a two-node cluster, using red hat's high-availability add-on,
using DRBD to keep the data replicated between the two "SAN" nodes and
tgtd to export the LUNs. I had a virtual IP on the cluster to act as the
target IP and I had DRBD in dual-primary mode with clustered LVM (so I
had DRBD as the PV and exported the space from the LVs).

Then I built a second cluster of five nodes to host KVM VMs. The
underlying nodes used clustered LVM as well, but this time the LUNs was
the PV. I carved this up into an LV per VM and made the VMs the HA
service. Again using RH HA-Addon.

In this setup, I was able to fail over the SAN without losing any VMs. I
even messed up the fencing on the SAN cluster once, which meant it took
>30s to fail over, and I didn't lose the VMs. So to the minimal extent I
tested it, it worked excellently.

I have some very rough notes on this setup. They're not fit for public
consumption at all, but if you'd like I'll send them to you directly.
They include the configurations which might help as a template or similar.

Digimer

-- 
Digimer
Papers and Projects: https://alteeve.com



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

2012-07-16 Thread Digimer

On 07/16/2012 01:14 PM, Digimer wrote:
> On 07/16/2012 12:08 PM, Phil Frost wrote:
>> I'm designing a cluster to run both iSCSI targets and initiators to
>> ultimately provide block devices to virtual machines. I'm considering
>> the case of a target failure, and how to handle that as gracefully as
>> possible. Ideally, IO may be paused until the target recovers, but VMs
>> do not restart or see IO errors.
>>
>> I've observed that the iscsi RA will configure the initiator to retry
>> connections indefinitely if the target should fail. This is mostly good,
>> except that if the initiator is in the retrying state, the monitor
>> action will return an error.
>>
>> The Right Thing to do in this case, I would think, would be to just
>> wait. Of course the initiators can't work if the target is down, but the
>> initiators will recover automatically when the target recovers. Ideally
>> the cluster would wait for the target (which it also manages) to
>> recover, then try again to monitor the initiators. For good measure, it
>> might try monitoring the initiators a couple times, since it can take
>> them a moment to reconnect.
>>
>> Unfortunately, what actually happens is the monitor action on the
>> initiator fails. Pacemaker then attempts to stop the initiator, and that
>> also fails, because the target is still unavailable. Then the initiator
>> node gets STONITHed, taking out all the hosted VMs with it.
>>
>> I added a mandatory, non-symmetrical order constraint of target ->
>> initiator, so at least Pacemaker will not attempt to re-start the
>> initiator after a target failure. I made it asymetrical so that restarts
>> of the target do not force restarts of the initiator. However, it
>> doesn't do much to help the failed-target case.
>>
>> What's a good solution? Is there some way to suspend monitoring of the
>> initiators if pacemaker knows the target is failed? I suppose I could
>> modify the iscsi RA to return success for monitor in the case that the
>> initiator is attempting to reconnect to the target, but then what if
>> actually the initiator has failed, and the target is operational? What
>> then about race conditions that might exist in cases where the target
>> has failed, but pacemaker has not yet detected the target failure though
>> a monitor operation?
> 
> I've only tested this a little, so please take it as a general
> suggestion rather than strong advice.
> 
> I created a two-node cluster, using red hat's high-availability add-on,
> using DRBD to keep the data replicated between the two "SAN" nodes and
> tgtd to export the LUNs. I had a virtual IP on the cluster to act as the
> target IP and I had DRBD in dual-primary mode with clustered LVM (so I
> had DRBD as the PV and exported the space from the LVs).
> 
> Then I built a second cluster of five nodes to host KVM VMs. The
> underlying nodes used clustered LVM as well, but this time the LUNs was
> the PV. I carved this up into an LV per VM and made the VMs the HA
> service. Again using RH HA-Addon.
> 
> In this setup, I was able to fail over the SAN without losing any VMs. I
> even messed up the fencing on the SAN cluster once, which meant it took
>> 30s to fail over, and I didn't lose the VMs. So to the minimal extent I
> tested it, it worked excellently.
> 
> I have some very rough notes on this setup. They're not fit for public
> consumption at all, but if you'd like I'll send them to you directly.
> They include the configurations which might help as a template or similar.
> 
> Digimer

Oh woops, I just realized this was the pacemaker list, not the general
Linux Clustering list. Heh. Doing all of the management using pacemaker
instead of RH's HA-Addon should be just fine, too.

-- 
Digimer
Papers and Projects: https://alteeve.com



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

2012-07-16 Thread Phil Frost

I'm designing a cluster to run both iSCSI targets and initiators to 
ultimately provide block devices to virtual machines. I'm considering 
the case of a target failure, and how to handle that as gracefully as 
possible. Ideally, IO may be paused until the target recovers, but VMs 
do not restart or see IO errors.


I've observed that the iscsi RA will configure the initiator to retry 
connections indefinitely if the target should fail. This is mostly good, 
except that if the initiator is in the retrying state, the monitor 
action will return an error.


The Right Thing to do in this case, I would think, would be to just 
wait. Of course the initiators can't work if the target is down, but the 
initiators will recover automatically when the target recovers. Ideally 
the cluster would wait for the target (which it also manages) to 
recover, then try again to monitor the initiators. For good measure, it 
might try monitoring the initiators a couple times, since it can take 
them a moment to reconnect.


Unfortunately, what actually happens is the monitor action on the 
initiator fails. Pacemaker then attempts to stop the initiator, and that 
also fails, because the target is still unavailable. Then the initiator 
node gets STONITHed, taking out all the hosted VMs with it.


I added a mandatory, non-symmetrical order constraint of target -> 
initiator, so at least Pacemaker will not attempt to re-start the 
initiator after a target failure. I made it asymetrical so that restarts 
of the target do not force restarts of the initiator. However, it 
doesn't do much to help the failed-target case.


What's a good solution? Is there some way to suspend monitoring of the 
initiators if pacemaker knows the target is failed? I suppose I could 
modify the iscsi RA to return success for monitor in the case that the 
initiator is attempting to reconnect to the target, but then what if 
actually the initiator has failed, and the target is operational? What 
then about race conditions that might exist in cases where the target 
has failed, but pacemaker has not yet detected the target failure though 
a monitor operation?



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-16 Thread Nikola Ciprich

> It is not.
> 
> Pacemaker may just be quicker to promote now,
> or in your setup other things may have changed
> which also changed the timing behaviour.
> 
> But what you are trying to do has always been broken,
> and will always be broken.
> 

Hello Lars,

You were right, fixing configuration indeed fixed my issue. I humbly apologize
for my ignorance and will further immerse in documentation :)

have a nice day!

nik



-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgpz99YkZSCI8.pgp
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [patch] Seeking suggestions for cluster configuration of HA iSCSI target and initiators]

Re: [Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

Re: [Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

Re: [Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

Re: [Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

[Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

Re: [Pacemaker] drbd under pacemaker - always get split brain

7 matches

Site Navigation

Mail list logo

Footer information