Re: [ClusterLabs] Major problem with iSCSITarget resource on top of DRBD M/S resource.

2015-09-27 Thread Digimer
On 27/09/15 11:02 AM, Alex Crow wrote:
> 
> 
> On 27/09/15 15:54, Digimer wrote:
>> On 27/09/15 10:40 AM, Alex Crow wrote:
>>> Hi List,
>>>
>>> I'm trying to set up a failover iSCSI storage system for oVirt using a
>>> self-hosted engine. I've set up DRBD in Master-Slave for two iSCSI
>>> targets, one for the self-hosted engine and one for the VMs. I has this
>>> all working perfectly, then after trying to move the engine's LUN to the
>>> opposite host, all hell broke loose. The VMS LUN is still fine, starts
>> I'm guessing no fencing?
> 
> Hi Digimer,
> 
> No, but I've tried turning off one machine and still no success as a
> single node :-(

You *must* have working fencing, anyway. So now strikes me as a
fantastic time to add it. Turning off the node alone doesn't help.

>>> and migrates as it should. However the engine LUN always seems to try to
>>> launch the target on the host that is *NOT* the master of the DRBD
>>> resource. My constraints look fine, and should be self-explanatory about
>>> which is which:
>>>
>>> [root@granby ~]# pcs constraint --full
>>> Location Constraints:
>>> Ordering Constraints:
>>>promote drbd-vms-iscsi then start iscsi-vms-ip (kind:Mandatory)
>>> (id:vm_iscsi_ip_after_drbd)
>>>start iscsi-vms-target then start iscsi-vms-lun (kind:Mandatory)
>>> (id:vms_lun_after_target)
>>>promote drbd-vms-iscsi then start iscsi-vms-target (kind:Mandatory)
>>> (id:vms_target_after_drbd)
>>>promote drbd-engine-iscsi then start iscsi-engine-ip (kind:Mandatory)
>>> (id:ip_after_drbd)
>>>start iscsi-engine-target then start iscsi-engine-lun
>>> (kind:Mandatory)
>>> (id:lun_after_target)
>>>promote drbd-engine-iscsi then start iscsi-engine-target
>>> (kind:Mandatory) (id:target_after_drbd)
>>> Colocation Constraints:
>>>iscsi-vms-ip with drbd-vms-iscsi (score:INFINITY) (rsc-role:Started)
>>> (with-rsc-role:Master) (id:vms_ip-with-drbd)
>>>iscsi-vms-lun with drbd-vms-iscsi (score:INFINITY) (rsc-role:Started)
>>> (with-rsc-role:Master) (id:vms_lun-with-drbd)
>>>iscsi-vms-target with drbd-vms-iscsi (score:INFINITY)
>>> (rsc-role:Started) (with-rsc-role:Master) (id:vms_target-with-drbd)
>>>iscsi-engine-ip with drbd-engine-iscsi (score:INFINITY)
>>> (rsc-role:Started) (with-rsc-role:Master) (id:ip-with-drbd)
>>>iscsi-engine-lun with drbd-engine-iscsi (score:INFINITY)
>>> (rsc-role:Started) (with-rsc-role:Master) (id:lun-with-drbd)
>>>iscsi-engine-target with drbd-engine-iscsi (score:INFINITY)
>>> (rsc-role:Started) (with-rsc-role:Master) (id:target-with-drbd)
>>>
>>> But see this from pcs status, the iSCSI target has FAILED on glenrock,
>>> but the DRBD master is on granby!:
>>>
>>> [root@granby ~]# pcs status
>>> Cluster name: storage
>>> Last updated: Sun Sep 27 15:30:08 2015
>>> Last change: Sun Sep 27 15:20:58 2015
>>> Stack: cman
>>> Current DC: glenrock - partition with quorum
>>> Version: 1.1.11-97629de
>>> 2 Nodes configured
>>> 10 Resources configured
>>>
>>>
>>> Online: [ glenrock granby ]
>>>
>>> Full list of resources:
>>>
>>>   Master/Slave Set: drbd-vms-iscsi [drbd-vms]
>>>   Masters: [ glenrock ]
>>>   Slaves: [ granby ]
>>>   iscsi-vms-target(ocf::heartbeat:iSCSITarget): Started glenrock
>>>   iscsi-vms-lun(ocf::heartbeat:iSCSILogicalUnit): Started glenrock
>>>   iscsi-vms-ip(ocf::heartbeat:IPaddr2):Started glenrock
>>>   Master/Slave Set: drbd-engine-iscsi [drbd-engine]
>>>   Masters: [ granby ]
>>>   Slaves: [ glenrock ]
>>>   iscsi-engine-target(ocf::heartbeat:iSCSITarget): FAILED glenrock
>>> (unmanaged)
>>>   iscsi-engine-ip(ocf::heartbeat:IPaddr2):Stopped
>>>   iscsi-engine-lun(ocf::heartbeat:iSCSILogicalUnit): Stopped
>>>
>>> Failed actions:
>>>  iscsi-engine-target_stop_0 on glenrock 'unknown error' (1):
>>> call=177, status=Timed Out, last-rc-change='Sun Sep 27 15:20:59 2015',
>>> queued=0ms, exec=10003ms
>>>  iscsi-engine-target_stop_0 on glenrock 'unknown error' (1):
>>> call=177, status=Timed Out, last-rc-change='Sun Sep 27 15:20:59 2015',
>>> queued=0ms, exec=10003ms
>>>
>>> I have tried various combinations of pcs resource clear and cleanup, but
>>> that all result in the same outcome - apart from on some occasions when
>>> one or other of the two hosts suddenly reboots!
>>>
>>> Here is a log right after a "pcs resource cleanup" - first on the master
>>> for the DRBD m/s resource:
>>> [root@granby ~]# pcs resource cleanup; tail -f /var/log/messages
>>> All resources/stonith devices successfully cleaned up
>>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>>> granby-drbd-engine_monitor_0:117 [ \n ]
>>> Sep 27 15:33:42 granby attrd[3356]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: probe_complete (true)
>>> Sep 27 15:33:42 granby attrd[3356]:   notice: attrd_perform_update: Sent
>>> update 54: probe_complete=true
>>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation drbd-engine_monitor_1: 

Re: [ClusterLabs] Major problem with iSCSITarget resource on top of DRBD M/S resource.

2015-09-27 Thread Digimer
On 27/09/15 10:40 AM, Alex Crow wrote:
> Hi List,
> 
> I'm trying to set up a failover iSCSI storage system for oVirt using a
> self-hosted engine. I've set up DRBD in Master-Slave for two iSCSI
> targets, one for the self-hosted engine and one for the VMs. I has this
> all working perfectly, then after trying to move the engine's LUN to the
> opposite host, all hell broke loose. The VMS LUN is still fine, starts

I'm guessing no fencing?

> and migrates as it should. However the engine LUN always seems to try to
> launch the target on the host that is *NOT* the master of the DRBD
> resource. My constraints look fine, and should be self-explanatory about
> which is which:
> 
> [root@granby ~]# pcs constraint --full
> Location Constraints:
> Ordering Constraints:
>   promote drbd-vms-iscsi then start iscsi-vms-ip (kind:Mandatory)
> (id:vm_iscsi_ip_after_drbd)
>   start iscsi-vms-target then start iscsi-vms-lun (kind:Mandatory)
> (id:vms_lun_after_target)
>   promote drbd-vms-iscsi then start iscsi-vms-target (kind:Mandatory)
> (id:vms_target_after_drbd)
>   promote drbd-engine-iscsi then start iscsi-engine-ip (kind:Mandatory)
> (id:ip_after_drbd)
>   start iscsi-engine-target then start iscsi-engine-lun (kind:Mandatory)
> (id:lun_after_target)
>   promote drbd-engine-iscsi then start iscsi-engine-target
> (kind:Mandatory) (id:target_after_drbd)
> Colocation Constraints:
>   iscsi-vms-ip with drbd-vms-iscsi (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master) (id:vms_ip-with-drbd)
>   iscsi-vms-lun with drbd-vms-iscsi (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master) (id:vms_lun-with-drbd)
>   iscsi-vms-target with drbd-vms-iscsi (score:INFINITY)
> (rsc-role:Started) (with-rsc-role:Master) (id:vms_target-with-drbd)
>   iscsi-engine-ip with drbd-engine-iscsi (score:INFINITY)
> (rsc-role:Started) (with-rsc-role:Master) (id:ip-with-drbd)
>   iscsi-engine-lun with drbd-engine-iscsi (score:INFINITY)
> (rsc-role:Started) (with-rsc-role:Master) (id:lun-with-drbd)
>   iscsi-engine-target with drbd-engine-iscsi (score:INFINITY)
> (rsc-role:Started) (with-rsc-role:Master) (id:target-with-drbd)
> 
> But see this from pcs status, the iSCSI target has FAILED on glenrock,
> but the DRBD master is on granby!:
> 
> [root@granby ~]# pcs status
> Cluster name: storage
> Last updated: Sun Sep 27 15:30:08 2015
> Last change: Sun Sep 27 15:20:58 2015
> Stack: cman
> Current DC: glenrock - partition with quorum
> Version: 1.1.11-97629de
> 2 Nodes configured
> 10 Resources configured
> 
> 
> Online: [ glenrock granby ]
> 
> Full list of resources:
> 
>  Master/Slave Set: drbd-vms-iscsi [drbd-vms]
>  Masters: [ glenrock ]
>  Slaves: [ granby ]
>  iscsi-vms-target(ocf::heartbeat:iSCSITarget): Started glenrock
>  iscsi-vms-lun(ocf::heartbeat:iSCSILogicalUnit): Started glenrock
>  iscsi-vms-ip(ocf::heartbeat:IPaddr2):Started glenrock
>  Master/Slave Set: drbd-engine-iscsi [drbd-engine]
>  Masters: [ granby ]
>  Slaves: [ glenrock ]
>  iscsi-engine-target(ocf::heartbeat:iSCSITarget): FAILED glenrock
> (unmanaged)
>  iscsi-engine-ip(ocf::heartbeat:IPaddr2):Stopped
>  iscsi-engine-lun(ocf::heartbeat:iSCSILogicalUnit): Stopped
> 
> Failed actions:
> iscsi-engine-target_stop_0 on glenrock 'unknown error' (1):
> call=177, status=Timed Out, last-rc-change='Sun Sep 27 15:20:59 2015',
> queued=0ms, exec=10003ms
> iscsi-engine-target_stop_0 on glenrock 'unknown error' (1):
> call=177, status=Timed Out, last-rc-change='Sun Sep 27 15:20:59 2015',
> queued=0ms, exec=10003ms
> 
> I have tried various combinations of pcs resource clear and cleanup, but
> that all result in the same outcome - apart from on some occasions when
> one or other of the two hosts suddenly reboots!
> 
> Here is a log right after a "pcs resource cleanup" - first on the master
> for the DRBD m/s resource:
> [root@granby ~]# pcs resource cleanup; tail -f /var/log/messages
> All resources/stonith devices successfully cleaned up
> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
> granby-drbd-engine_monitor_0:117 [ \n ]
> Sep 27 15:33:42 granby attrd[3356]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Sep 27 15:33:42 granby attrd[3356]:   notice: attrd_perform_update: Sent
> update 54: probe_complete=true
> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
> Operation drbd-engine_monitor_1: master (node=granby, call=131,
> rc=8, cib-update=83, confirmed=false)
> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
> granby-drbd-engine_monitor_1:131 [ \n ]
> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
> Operation drbd-vms_monitor_2: ok (node=granby, call=130, rc=0,
> cib-update=84, confirmed=false)
> Sep 27 15:34:46 granby crmd[3358]:   notice: do_lrm_invoke: Forcing the
> status of all resources to be redetected
> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_trigger_update:
> 

Re: [ClusterLabs] [Linux-HA] fence_ec2 agent

2015-09-27 Thread 東一彦

Hi Dejan,

I made a patch file as unified diff by "hg export tip" command.

Would you please marge it ?


Regards,
Kazuhiko Higashi

On 2015/09/25 0:04, Dejan Muhamedagic wrote:

Hi Kazuhiko-san,

On Wed, Mar 25, 2015 at 10:47:01AM +0900, 東一彦 wrote:

Hi Markus,

I implemented it for trial.

[diff from http://hg.linux-ha.org/glue/rev/9da0680bc9c0 ]
50d49
< port_default=""
60c59
< ec2_tag=${tag}
---

[ -n "$tag" ] && ec2_tag="$tag"

63d61
< : ${port=${port_default}}
97c95
<   
---

   

105c103
<   
---

   

132c130
<   
---

   

142c140
<   
---

   

221a220,224

function monitor()
{
   # Is the device ok?
   aws ec2 describe-instances $options | grep INSTANCES &> /dev/null
}

267a271

[ -n "$2" ] && node_to_fence=$2

326a331,334

if [ -z "$port" ]; then
   port="$node_to_fence"
fi


379,380c387
<   # Is the device ok?
<   aws ec2 describe-instances $options | grep INSTANCES &> 
/dev/null
---

   monitor

391c398
<   instance_status $instance > /dev/null
---

   monitor




It works fine on my environment with 2 patterns settings below.

[pattern No.1]
Without "port" and "tag" parameters.
And instances has "Name=" tag.


primitive prmStonith1-2 stonith:external/ec2 \
  params \
  pcmk_off_timeout="120s" \
  op start interval="0s" timeout="60s" \
  op monitor interval="3600s" timeout="60s" \
  op stop interval="0s" timeout="60s"



[pattern No.2]
With only "tag" parameter.(Without "port" parameter.)
And, The 1st instance(node01) has "Cluster1=node01" tag.
The 2nd instance(node02) has "Cluster1=node02" tag.


primitive prmStonith1-2 stonith:external/ec2 \
  params \
  pcmk_off_timeout="120s" \
  tag="Cluster1" \
  op start interval="0s" timeout="60s" \
  op monitor interval="3600s" timeout="60s" \
  op stop interval="0s" timeout="60s"



Sounds good. Sorry for the delay, but would it be possible that
you provide a patch as unified diff or similar so that we can
apply it.

Cheers,

Dejan



Regards,
Kazuhiko Higashi


On 2015/03/24 20:48, 東一彦 wrote:

Hi Markus,

Thank you for the comment.


Would it be possible, to implement this idea as an additional configuration 
method to the fence_ec2 agent?

I think that your idea is good.

So, I tries to implement it.
I'm going to change the fence_ec2(ec2) the following points.

  - the "tag" and the "port" options will be "not" required.

  - if the "port" option is not set, the 2nd argument of ec2 will use as the 
"port".
- the 2nd argument of ec2 is "node to fence".

  - the "stat" and "status" action will be same the "monitor" action.
(for do not use the "port" parameter in "stat" action.)


By the above modifications, If it is described uname in the Name tag,
the setting of the "tag" and "port" parameters are no longer necessary.


primitive prmStonith1-2 stonith:external/ec2 \
 params \
 pcmk_off_timeout="120s" \
 op start interval="0s" timeout="60s" \
 op monitor interval="3600s" timeout="60s" \
 op stop interval="0s" timeout="60s"



You can use "tag" parameter like your "Clustername" tag.
If cluster nodes(instances) have "Cluster1" tag, and uname is described in that 
tag,
it works just like you to expect.


primitive prmStonith1-2 stonith:external/ec2 \
 params \
 pcmk_off_timeout="120s" \
 tag="Cluster1" \
 op start interval="0s" timeout="60s" \
 op monitor interval="3600s" timeout="60s" \
 op stop interval="0s" timeout="60s"


The 1st instance have "Cluster1=node01" tag-key.
The 2nd instance have "Cluster1=node02" tag-key.
The 3rd instance have "Cluster1=node03" tag-key.
...
The prmStonith1-2 can fence node01 , node02 and node03.


If you like above, I will implement that.


Regards,
Kazuhiko Higashi


On 2015/03/19 1:03, Markus Guertler wrote:

Hi Kazuhiko, Dejan,

the new resource agent is very good. Since there were a couple of days between 
my original question and the answer from
Kazuhiko, I also have written a stonith agent proof of concept (attached to 
this email) in order to continue in my
project. However, I think that your fence_ec2 agent is better from a 
development perspective and it doesn't make sense
to have two different agents for the same use case.

Nevertheless, I've implemented an idea, that is very useful in EC2 environments 
with clusters that have more than two
nodes: All EC2 instances that belong to a cluster get a unique cluster name via 
an EC2 instance tag. The agent uses this
tag to determine all cluster nodes that belong to his own cluster

--- SNIP ---
 gethosts)
 # List of hostnames of this cluster
 init_agent
 ec2-describe-instances --filter "tag-key=Clustername" --filter 
"tag-value=$clustername" | grep "^TAG" |grep