[Pacemaker] howto group resources without having an order

2013-11-25 Thread Bauer, Stefan (IZLBW Extern)
Dear Developers & Users,

i have 4 resources: p_eth0 p_conntrackd p_openvpn1 p_openvpn2

Right now, I use group and colocation to let p_eth0 and p_conntrackd start in 
the right order (first eth0, then conntrackd).
I want now to also include p_openvpn1 + 2 but not having them in any order. 
Means - running on the same cluster node but independent from each other.

I want to be able to not depend on openvpn1 to start openvpn2 (that's the 
default behavior iirc without groups/orders).

Any help is greatly appreciated.

Best regards

Stefan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Beginner Question: not able to shutdown 2nd node

2013-11-25 Thread T.J. Yang
On Mon, Nov 25, 2013 at 8:44 PM, Digimer  wrote:

> On 25/11/13 21:18, T.J. Yang wrote:
> > Hi
> >
> > I need help here, Looks like I missed a step to startup two nodes to
> > listen on port 2224 ?
> >
> > [root@ilclpm01 ~]# pcs --version
> > 0.9.90
> > [root@ilclpm01 ~]# pcs --debug cluster stop ilclpm02
> > Sending HTTP Request to: https://ilclpm02:2224/remote/cluster_stop
> > Data: None
> > Response Reason: [Errno 111] Connection refused
> > Error: unable to stop all nodes
> > Unable to connect to ilclpm02 ([Errno 111] Connection refused)
>
> Is pcsd running?
>
> If this is RHEL / CentOS 6, then I do not believe pcsd works.
>

Hi digimer

Thanks for responding to  my question.
I can't find pcsd binary from  three packages I installed.

[root@ilclpm01 ~]# rpm -qil pcs cman pacemaker |grep pcsd
[root@ilclpm01 ~]#

following are more details about my test cluster.


3617 ?SLsl   0:07 corosync -f
 3674 ?Ssl0:00 fenced
 3690 ?Ssl0:00 dlm_controld
 3749 ?Ssl0:00 gfs_controld
 3832 pts/0S  0:01 pacemakerd
 3838 ?Ss 0:01  \_ /usr/libexec/pacemaker/cib
 3839 ?Ss 0:01  \_ /usr/libexec/pacemaker/stonithd
 3840 ?Ss 0:02  \_ /usr/libexec/pacemaker/lrmd
 3841 ?Ss 0:01  \_ /usr/libexec/pacemaker/attrd
 3842 ?Ss 0:00  \_ /usr/libexec/pacemaker/pengine
 3843 ?Ss 0:01  \_ /usr/libexec/pacemaker/crmd


[root@ilclpm01 ~]# rpm -q cman
cman-3.0.12.1-59.el6.x86_64
[root@ilclpm01 ~]# rpm -q pacemaker
pacemaker-1.1.10-14.el6.x86_64
[root@ilclpm01 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.5 (Santiago)
[root@ilclpm01 ~]#


>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
T.J. Yang
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Beginner Question: not able to shutdown 2nd node

2013-11-25 Thread Digimer
On 25/11/13 21:18, T.J. Yang wrote:
> Hi 
> 
> I need help here, Looks like I missed a step to startup two nodes to
> listen on port 2224 ?
> 
> [root@ilclpm01 ~]# pcs --version
> 0.9.90
> [root@ilclpm01 ~]# pcs --debug cluster stop ilclpm02
> Sending HTTP Request to: https://ilclpm02:2224/remote/cluster_stop
> Data: None
> Response Reason: [Errno 111] Connection refused
> Error: unable to stop all nodes
> Unable to connect to ilclpm02 ([Errno 111] Connection refused)

Is pcsd running?

If this is RHEL / CentOS 6, then I do not believe pcsd works.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Beginner Question: not able to shutdown 2nd node

2013-11-25 Thread T.J. Yang
Hi

I need help here, Looks like I missed a step to startup two nodes to listen
on port 2224 ?

[root@ilclpm01 ~]# pcs --version
0.9.90
[root@ilclpm01 ~]# pcs --debug cluster stop ilclpm02
Sending HTTP Request to: https://ilclpm02:2224/remote/cluster_stop
Data: None
Response Reason: [Errno 111] Connection refused
Error: unable to stop all nodes
Unable to connect to ilclpm02 ([Errno 111] Connection refused)
[root@ilclpm01 ~]# pcs status
Cluster name: pacemaker1
Last updated: Mon Nov 25 20:12:49 2013
Last change: Mon Nov 25 19:07:52 2013 via cibadmin on ilclpm01
Stack: cman
Current DC: ilclpm02 - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
2 Resources configured


Online: [ ilclpm01 ilclpm02 ]

Full list of resources:

 my_first_svc (ocf::pacemaker:Dummy): Started ilclpm02
 ClusterIP (ocf::heartbeat:IPaddr2): Started ilclpm01


[root@ilclpm01 ~]#

[root@ilclpm01 ~]# nmap  ilclpm02

Starting Nmap 5.51 ( http://nmap.org ) at 2013-11-25 20:16 CST
Nmap scan report for ilclpm02 (100.64.16.102)
Host is up (0.82s latency).
rDNS record for 10.64.16.102: ilclpm02.test.net
Not shown: 998 closed ports
PORTSTATE SERVICE
22/tcp  open  ssh
111/tcp open  rpcbind
MAC Address: 00:50:56:xx:xx:CA (VMware)

Nmap done: 1 IP address (1 host up) scanned in 5.66 seconds
[root@ilclpm01 ~]#


-- 
T.J. Yang
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker very often STONITHs other node

2013-11-25 Thread Michał Margula

W dniu 25.11.2013 18:25, Digimer pisze:

I'd like to see the full logs, starting from a little before the issue
started.



Here are logs since Nov 17 until Nov 24 (my pastebin is too small to 
handle them):


Node A - https://www.dropbox.com/sh/dj08fbckj9zo104/Ew1QpdRq9A/A.log
Node B - https://www.dropbox.com/sh/dj08fbckj9zo104/p9ldlBkGkG/B.log


It looks though like, for whatever reason, a stop was called, failed, so
the node was fenced. This would mean that congestion, as you suggested,
is not the likely cause.

Out of curiosity though; what bonding mode are you using? My testing
showed that only mode=1 was reliable. Since I tested, corosync added
support for mode=0 and mode=2, but I've not re-tested them. When I was
doing my bonding tests, I found all other modes to break communications
in some manner of use or failure/recovery testing.




I use 802.3ad mode (so it is mode 4):

auto bond0
iface bond0 inet static
slaves eth4 eth5
bond-mode 802.3ad
bond-lacp_rate fast
bond-miimon 100
bond-downdelay 200
bond-updelay 200
address 10.0.0.1
netmask 255.255.255.0
broadcast 10.0.0.255

Do you think that it could be the reason - I mean wrong mode and some 
communication issues because of that?


Thank you once more!

--
Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker very often STONITHs other node

2013-11-25 Thread Digimer
On 25/11/13 10:39, Michał Margula wrote:
> W dniu 25.11.2013 15:44, Digimer pisze:
>> My first thought is that the network is congested. That is a lot of
>> servers to have on the system. Do you or can you isolate the corosync
>> traffic from the drbd traffic?
>>
>> Personally, I always setup a dedicated network for corosync, another for
>> drbd and a third for all traffic to/from the servers. With this, I have
>> never had a congestion-based problem.
>>
>> If possible, please past all logs from both nodes, starting just before
>> the stonith occurred until recovery completed please.
>>
> 
> Hello,
> 
> DRBD and CRM go over dedicated link (bonded two gigabit links into one).
> It is never saturated nor congested, it barely reaches 300 Mbps in
> highest points. I have a separate link for traffic from/to virtual
> machines and also separate link to manage nodes (just for SSH, SNMP). I
> can isolate corosync to separate link but it could take some time to do.
> 
> Now logs...
> 
> Trouble started at November 23, 15:14.
> Here is a log from "A" node: http://pastebin.com/yM1fqvQ6
> Node B: http://pastebin.com/nwbctcgg
> 
> Node B is the one that got hit by STONITH. It got killed at 15:18:50. I
> have some trouble understanding reasons for that.
> 
> Is reason for STONITH that those operations took long time to finish?
> 
> Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
> operation stop[114] on XEN-piaskownica for client 9529 stayed in
> operation list for 24760 ms (longer than 1 ms)
> Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
> operation stop[115] on XEN-acsystemy01 for client 9529 stayed in
> operation list for 25760 ms (longer than 1 ms)
> Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the
> operation stop[116] on XEN-frodo for client 9529 stayed in operation
> list for 50760 ms (longer than 1 ms)
> 
> But I wonder what in first place made it to stop those virtual machines?
> Another clue is here:
> 
> Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice:
> reduce operation contention either by increasing lrmd max_children or by
> increasing intervals of monitor operations
> 
> And here:
> 
> coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN:
> unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on
> rivendell-B: not running (7)
> 
> But why not running? It is not really a true. Also some trouble with
> fencing:
> 
> coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:
> unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on
> rivendell-A: unknown error (1)
> coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN:
> common_apply_stickiness: Forcing fencing-of-B away from rivendell-A
> after 100 failures (max=100)
> 
> Thank you!
> 

I'd like to see the full logs, starting from a little before the issue
started.

It looks though like, for whatever reason, a stop was called, failed, so
the node was fenced. This would mean that congestion, as you suggested,
is not the likely cause.

Out of curiosity though; what bonding mode are you using? My testing
showed that only mode=1 was reliable. Since I tested, corosync added
support for mode=0 and mode=2, but I've not re-tested them. When I was
doing my bonding tests, I found all other modes to break communications
in some manner of use or failure/recovery testing.


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker very often STONITHs other node

2013-11-25 Thread Michał Margula

W dniu 25.11.2013 15:44, Digimer pisze:

My first thought is that the network is congested. That is a lot of
servers to have on the system. Do you or can you isolate the corosync
traffic from the drbd traffic?

Personally, I always setup a dedicated network for corosync, another for
drbd and a third for all traffic to/from the servers. With this, I have
never had a congestion-based problem.

If possible, please past all logs from both nodes, starting just before
the stonith occurred until recovery completed please.



Hello,

DRBD and CRM go over dedicated link (bonded two gigabit links into one). 
It is never saturated nor congested, it barely reaches 300 Mbps in 
highest points. I have a separate link for traffic from/to virtual 
machines and also separate link to manage nodes (just for SSH, SNMP). I 
can isolate corosync to separate link but it could take some time to do.


Now logs...

Trouble started at November 23, 15:14.
Here is a log from "A" node: http://pastebin.com/yM1fqvQ6
Node B: http://pastebin.com/nwbctcgg

Node B is the one that got hit by STONITH. It got killed at 15:18:50. I 
have some trouble understanding reasons for that.


Is reason for STONITH that those operations took long time to finish?

Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the 
operation stop[114] on XEN-piaskownica for client 9529 stayed in 
operation list for 24760 ms (longer than 1 ms)
Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the 
operation stop[115] on XEN-acsystemy01 for client 9529 stayed in 
operation list for 25760 ms (longer than 1 ms)
Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the 
operation stop[116] on XEN-frodo for client 9529 stayed in operation 
list for 50760 ms (longer than 1 ms)


But I wonder what in first place made it to stop those virtual machines? 
Another clue is here:


Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice: 
reduce operation contention either by increasing lrmd max_children or by 
increasing intervals of monitor operations


And here:

coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN: 
unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on 
rivendell-B: not running (7)


But why not running? It is not really a true. Also some trouble with 
fencing:


coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: 
unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on 
rivendell-A: unknown error (1)
coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: 
common_apply_stickiness: Forcing fencing-of-B away from rivendell-A 
after 100 failures (max=100)


Thank you!

--
Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker very often STONITHs other node

2013-11-25 Thread Digimer
On 25/11/13 06:40, Michał Margula wrote:
> Hello!
> 
> I wanted to ask for your help because we are having much trouble with
> cluster based on Pacemaker.
> 
> We have two identical nodes - PowerEdge R510 with 2x Xeon X5650, 64 GB
> of RAM, MegaRAID SAS 2108 RAID (PERC H700) - system disk - RAID 1 on
> SSDs (SSDSC2CW060A3) and two volumes - one RAID 1 with WD3000FYYZ and
> one RAID 1 with WD1002FBYS -- both Western Digital disks. Both nodes are
> linked with two gigabit direct fiber links (no switch in between).
> 
> We have two DRBD volumes - /dev/drbd1 (1TB on WD1002FBYS disks) and
> /dev/drbd2 (3TB on WD3000FYYZ disks). On top of DRBD (used as PVs) we
> have a LVM with LVs for virtual machines which run under XEN.
> 
> Here is our CRM configuration - http://pastebin.com/raqsvRTA
> 
> We have previously used fast USB drives instead of SSD for root
> filesystem and it caused some trouble - it was lagging on I/O and one
> node "thought" that another one was having trouble and performing
> STONITH on it. After replacing it with SSDs we had no more trouble with
> that issue.
> 
> But now from time to time it happens that we get STONITH of one nodes,
> and reason is unclear to us.
> 
> For example last time we found it in logs:
> 
> Nov 23 15:14:24 rivendell-B crmd: [9529]: info: process_lrm_event: LRM
> operation primitive-LVM:1_monitor_12 (call=54, rc=7, cib-update=124,
> confirmed=false) not running
> 
> And after that node rivendell-B got STONITH. Previously we had trouble
> with DRBD - node stopped DRBD for no apparent reason and again -
> STONITH. Unfortunately we did not check logs that time.
> 
> Also when doing some tasks on one of nodes (for example "crm resource
> migrate" of few XEN virtual machines) it can cause STONITH also.
> 
> Could you give us some hints? Maybe our configuration is wrong? To be
> honest we had no previous experience with HA clusters so we created it
> based on configuration.
> 
> It is working now for over a year now but giving us headaches and we are
> wondering if we should drop Pacemaker and use something else (even
> manual stopping and starting of virtual machines comes in mind).
> 
> Thank you in advance!

My first thought is that the network is congested. That is a lot of
servers to have on the system. Do you or can you isolate the corosync
traffic from the drbd traffic?

Personally, I always setup a dedicated network for corosync, another for
drbd and a third for all traffic to/from the servers. With this, I have
never had a congestion-based problem.

If possible, please past all logs from both nodes, starting just before
the stonith occurred until recovery completed please.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] some questions about STONITH

2013-11-25 Thread Andrey Groshev
>...snip...
>>  Make next test:
>>  #stonith_admin --reboot=dev-cluster2-node2
>>  Node reboot, but resource don't start.
>>  In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
>>  And it will be hung.
>
> That is *probably* a race - the node reboots too fast, or still
> communicates for a bit after the fence has supposedly completed (if it's
> not a reboot -nf, but a mere reboot). We have had problems here in the
> past.
>
> You may want to file a proper bug report with crm_report included, and
> preferably corosync/pacemaker debugging enabled.

It was found that he hangs not forever.
Triggered timeout - in 20 minutes.
crm_report archive - http://send2me.ru/pen2.tar.bz2
Of course in the logs many type entries:

pgsql:1: Breaking dependency loop at msPostgresql

But where does this relationship after a timeout, I do not understand.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Pacemaker very often STONITHs other node

2013-11-25 Thread Michał Margula

Hello!

I wanted to ask for your help because we are having much trouble with 
cluster based on Pacemaker.


We have two identical nodes - PowerEdge R510 with 2x Xeon X5650, 64 GB 
of RAM, MegaRAID SAS 2108 RAID (PERC H700) - system disk - RAID 1 on 
SSDs (SSDSC2CW060A3) and two volumes - one RAID 1 with WD3000FYYZ and 
one RAID 1 with WD1002FBYS -- both Western Digital disks. Both nodes are 
linked with two gigabit direct fiber links (no switch in between).


We have two DRBD volumes - /dev/drbd1 (1TB on WD1002FBYS disks) and 
/dev/drbd2 (3TB on WD3000FYYZ disks). On top of DRBD (used as PVs) we 
have a LVM with LVs for virtual machines which run under XEN.


Here is our CRM configuration - http://pastebin.com/raqsvRTA

We have previously used fast USB drives instead of SSD for root 
filesystem and it caused some trouble - it was lagging on I/O and one 
node "thought" that another one was having trouble and performing 
STONITH on it. After replacing it with SSDs we had no more trouble with 
that issue.


But now from time to time it happens that we get STONITH of one nodes, 
and reason is unclear to us.


For example last time we found it in logs:

Nov 23 15:14:24 rivendell-B crmd: [9529]: info: process_lrm_event: LRM 
operation primitive-LVM:1_monitor_12 (call=54, rc=7, cib-update=124, 
confirmed=false) not running


And after that node rivendell-B got STONITH. Previously we had trouble 
with DRBD - node stopped DRBD for no apparent reason and again - 
STONITH. Unfortunately we did not check logs that time.


Also when doing some tasks on one of nodes (for example "crm resource 
migrate" of few XEN virtual machines) it can cause STONITH also.


Could you give us some hints? Maybe our configuration is wrong? To be 
honest we had no previous experience with HA clusters so we created it 
based on configuration.


It is working now for over a year now but giving us headaches and we are 
wondering if we should drop Pacemaker and use something else (even 
manual stopping and starting of virtual machines comes in mind).


Thank you in advance!

--
Michał Margula, alche...@uznam.net.pl, http://alchemyx.uznam.net.pl/
"W życiu piękne są tylko chwile" [Ryszard Riedel]

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Need to relax corosync due to backup of VM through snapshot

2013-11-25 Thread Gianluca Cecchi
On Sun, Nov 24, 2013 at 4:47 PM, Steven Dake wrote:

> Using a real-world example
> token: 1
> retrans_before_loss_const: 10
>
> token will be retransmitted roughly every 1000 msec and the token will be
> determined lost after 1msec.
>

OK, thank you very much for clarifying this.
I also took the time to post a comment on the openstack ha manual page
with your words.
Gianluca

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org