Re: [Linux-HA] pacemaker/corosync - cl_status . REASON: hb_api_signon: Can't initiate connection to heartbeat

2012-02-17 Thread Thomas Baumann
But which could  these be? It seems SuSE specific? Should I post a rpm -ql ?
How can it be debugged - there are lots of recurring messages...

Von meinem tiriPhone gesendet.


Am 16.02.2012 um 12:57 schrieb Andrew Beekhof :

> On Thu, Feb 16, 2012 at 7:49 AM, Thomas Baumann  wrote:
>> Thanks for your info.
>> But which process might run this cl_status as I see these messages in
>> syslog nearly all the time ?
>
> I just assumed you were running it.
> Some sort of external monitoring script perhaps?
>
>>
>> Best regards,
>> Thomas.
>>
>> -Ursprüngliche Nachricht-
>> Von: linux-ha-boun...@lists.linux-ha.org
>> [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Andrew Beekhof
>> Gesendet: Mittwoch, 15. Februar 2012 11:13
>> An: General Linux-HA mailing list
>> Betreff: Re: [Linux-HA] pacemaker/corosync - cl_status . REASON:
>> hb_api_signon: Can't initiate connection to heartbeat
>>
>> On Wed, Feb 15, 2012 at 5:50 PM, Florian Haas  wrote:
>>> On 02/14/12 03:09, Andrew Beekhof wrote:
 On Tue, Feb 14, 2012 at 7:26 AM, Thomas Baumann  wrote:
> Hello list,
>
> In my current pacemaker/corosync installation in a 2 node cluster I
> get following error:
>
> # cl_status listnodes

 This is a heartbeat command, you're running corosync Try crm_node -p
>>>
>>> Or to get the straight-up Corosync view, "corosync-objctl | grep
>>> member".
>>
>> Unless you're using corosync 2.0 in which case its:
>>
>> corosync-cmapctl | grep member
>> ___
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>> ___
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] MMM conflict with Pacemaker

2012-02-17 Thread Mark Grennan

Did I say all this is confusing? Here is a really a great talk on HA.  There is 
a good history of Linux HA and Pacemaker starting at about 9:00 minutes in.  
Some of parts of systems have been broken into project of their own while 
others have been combined. Cluster resource are at the heart of what you want 
done but this is also some of the smoke and mirrors that of some of these 
packages.  Some just call your init scripts. (/etc/init.d) some call their on 
init scripts. (Heartbeat) and still others expect you to write your own.

> But pacemaker isn't even running on the machines the mmm float is on!

Remember MMM and MHA for that matter, use SSH, with certs, to reach out and run 
stuff on remote systems. Where pacemaker or MMM is running and what can be 
effected is a bit tricky. 


> ...I can't understand why you're using a public IP for mcast, or why it's 
> even there at all

There are many refinements I could make to my setup document. I took some short 
cuts to help people just get it working.  But, not so sort they would have to 
rebuild the system to make it better.  Networking is one of them.  I like to 
use multiple network interfaces to isolate the database traffic from all the 
"systems" traffic.  Multi nics is also good because pacemaker can check through 
a different path. 

> I'm still mystified by whether I should use ucast, mcast or bcast...

If I am using a crossover cable to connect two hosts together, I just broadcast 
the heartbeat out of the appropriate interface. (bcast eth3)  If there are more 
then two hosts in the Pacemaker cluster on the same private network I use 
mcast. 

> ...can't understand why you're using a public IP for mcast...

I'm using mcast because it's the best way to talk to multiple nodes and I 
expect some people will try that. 239.255.42.0 is not a public IP. 
(http://tldp.org/HOWTO/Multicast-HOWTO-2.html) The range 224.0.0.0 - 
239.255.255.255 is reserved for Multi-Cast and the range 239.0.0.0 to 
239.255.255.255 is reserved for this administrative scoping.

> My mmm config was originally installed by Percona, and I've done several 
> others since.

MMM was the way to go until just recently. If it's working for you keep using 
it.  But it may be already at it's end of life. Here is another resource 
(http://technocation.org/content/oursql-episode-67%3A-ha-and-replication) on 
what's been happening. 

> One critical aspect of an HA system is that it should be really easy to deal 
> with when things go wrong;

This may be the biggest problem with HA/MySQL systems. If you can't fix it when 
it breaks what good is it. And, complicated is the enemy or reliability.  

 
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-17 Thread Dejan Muhamedagic
On Fri, Feb 17, 2012 at 01:15:04PM +0100, Dejan Muhamedagic wrote:
> On Fri, Feb 17, 2012 at 12:13:49PM +1100, Andrew Beekhof wrote:
[...]
> > What about notifications?  The would be the right point to
> > re-configure things I'd have thought.
> 
> Sounds like the right way. Still, it may be hard to coordinate
> between different instances. Unless we figure out how to map
> nodes to numbers used by the CLUSTERIP. For instance, the notify
> operation gets:
> 
> OCF_RESKEY_CRM_meta_notify_stop_resource="ip_lb:2 "
> OCF_RESKEY_CRM_meta_notify_stop_uname="xen-f "
> 
> But the instance number may not match the node number from

Scratch that.

IP_CIP_FILE="/proc/net/ipt_CLUSTERIP/$OCF_RESKEY_ip"
IP_INC_NO=`expr ${OCF_RESKEY_CRM_meta_clone:-0} + 1`
...
echo "+$IP_INC_NO" >$IP_CIP_FILE

> /proc/net/ipt_CLUSTERIP/ and that's where we should add the
> node. It should be something like:
> 
> notify() {
>   if node_down; then
>   echo "+node_num" >> /proc/net/ipt_CLUSTERIP/
>   elif node_up; then
>   echo "-node_num" >> /proc/net/ipt_CLUSTERIP/
>   fi
> }
> 
> Another issue is that the above code should be executed on
> _exactly_ one node.

OK, I guess that'd also be doable by checking the following
variables:

OCF_RESKEY_CRM_meta_notify_inactive_resource (set of
currently inactive instances)
OCF_RESKEY_CRM_meta_notify_stop_resource (set of
instances which were just stopped)

Any volunteers for a patch? :)

Thanks,

Dejan

> Cheers,
> 
> Dejan
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-17 Thread Dejan Muhamedagic
On Fri, Feb 17, 2012 at 12:13:49PM +1100, Andrew Beekhof wrote:
> On Fri, Feb 17, 2012 at 5:05 AM, Dejan Muhamedagic  
> wrote:
> > Hi,
> >
> > On Wed, Feb 15, 2012 at 04:24:15PM -0500, William Seligman wrote:
> >> On 2/10/12 4:53 PM, William Seligman wrote:
> >> > I'm trying to set up an Active/Active cluster (yes, I hear the sounds of 
> >> > kittens
> >> > dying). Versions:
> >> >
> >> > Scientific Linux 6.2
> >> > pacemaker-1.1.6
> >> > resource-agents-3.9.2
> >> >
> >> > I'm using cloned IPaddr2 resources:
> >> >
> >> > primitive ClusterIP ocf:heartbeat:IPaddr2 \
> >> >         params ip="129.236.252.13" cidr_netmask="32" \
> >> >         op monitor interval="30s"
> >> > primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \
> >> >         params ip="10.44.7.13" cidr_netmask="32" \
> >> >         op monitor interval="31s"
> >> > primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \
> >> >         params ip="10.43.7.13" cidr_netmask="32" \
> >> >         op monitor interval="32s"
> >> > group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox
> >> > clone ClusterIPClone ClusterIPGroup
> >> >
> >> > When both nodes of my two-node cluster are running, everything looks and
> >> > functions OK. From "service iptables status" on node 1 (hypatia-tb):
> >> >
> >> > 5    CLUSTERIP  all  --  0.0.0.0/0            10.43.7.13          
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
> >> > local_node=1 hash_init=0
> >> > 6    CLUSTERIP  all  --  0.0.0.0/0            10.44.7.13          
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
> >> > local_node=1 hash_init=0
> >> > 7    CLUSTERIP  all  --  0.0.0.0/0            129.236.252.13      
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
> >> > local_node=1 hash_init=0
> >> >
> >> > On node 2 (orestes-tb):
> >> >
> >> > 5    CLUSTERIP  all  --  0.0.0.0/0            10.43.7.13          
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
> >> > local_node=2 hash_init=0
> >> > 6    CLUSTERIP  all  --  0.0.0.0/0            10.44.7.13          
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
> >> > local_node=2 hash_init=0
> >> > 7    CLUSTERIP  all  --  0.0.0.0/0            129.236.252.13      
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
> >> > local_node=2 hash_init=0
> >> >
> >> > If I do a simple test of ssh'ing into 129.236.252.13, I see that I 
> >> > alternately
> >> > login into hypatia-tb and orestes-tb. All is good.
> >> >
> >> > Now take orestes-tb offline. The iptables rules on hypatia-tb are 
> >> > unchanged:
> >> >
> >> > 5    CLUSTERIP  all  --  0.0.0.0/0            10.43.7.13          
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
> >> > local_node=1 hash_init=0
> >> > 6    CLUSTERIP  all  --  0.0.0.0/0            10.44.7.13          
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
> >> > local_node=1 hash_init=0
> >> > 7    CLUSTERIP  all  --  0.0.0.0/0            129.236.252.13      
> >> > CLUSTERIP
> >> > hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
> >> > local_node=1 hash_init=0
> >> >
> >> > If I attempt to ssh to 129.236.252.13, whether or not I get in seems to 
> >> > be
> >> > machine-dependent. On one machine I get in, from another I get a 
> >> > time-out. Both
> >> > machines show the same MAC address for 129.236.252.13:
> >> >
> >> > arp 129.236.252.13
> >> > Address                  HWtype  HWaddress           Flags Mask          
> >> >   Iface
> >> > hamilton-tb.nevis.colum  ether   B1:95:5A:B5:16:79   C                   
> >> >   eth0
> >> >
> >> > Is this the way the cloned IPaddr2 resource is supposed to behave in the 
> >> > event
> >> > of a node failure, or have I set things up incorrectly?
> >>
> >> I spent some time looking over the IPaddr2 script. As far as I can tell, 
> >> the
> >> script has no mechanism for reconfiguring iptables in the event of a 
> >> change of
> >> state in the number of clones.
> >>
> >> I might be stupid -- er -- dedicated enough to make this change on my own, 
> >> then
> >> share the code with the appropriate group. The change seems to be 
> >> relatively
> >> simple. It would be in the monitor operation. In pseudo-code:
> >>
> >> if (  ) then
> >>   if ( OCF_RESKEY_CRM_meta_clone_max != OCF_RESKEY_CRM_meta_clone_max last 
> >> time
> >>     || OCF_RESKEY_CRM_meta_clone     != OCF_RESKEY_CRM_meta_clone last 
> >> time )
> >>     ip_stop
> >>     ip_start
> >
> > Just changing the iptables entries should suffice, right?
> > Besides, doing stop/start in the monitor is sort of unexpected.
> > Another option is to add the missing node to one of the nodes
> > which are still running (echo "+" >>
> > /pr

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-17 Thread Dejan Muhamedagic
On Thu, Feb 16, 2012 at 11:14:37PM -0500, William Seligman wrote:
> On 2/16/12 8:13 PM, Andrew Beekhof wrote:
> >On Fri, Feb 17, 2012 at 5:05 AM, Dejan Muhamedagic  
> >wrote:
> >>Hi,
> >>
> >>On Wed, Feb 15, 2012 at 04:24:15PM -0500, William Seligman wrote:
> >>>On 2/10/12 4:53 PM, William Seligman wrote:
> I'm trying to set up an Active/Active cluster (yes, I hear the sounds of 
> kittens
> dying). Versions:
> 
> Scientific Linux 6.2
> pacemaker-1.1.6
> resource-agents-3.9.2
> 
> I'm using cloned IPaddr2 resources:
> 
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>  params ip="129.236.252.13" cidr_netmask="32" \
>  op monitor interval="30s"
> primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \
>  params ip="10.44.7.13" cidr_netmask="32" \
>  op monitor interval="31s"
> primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \
>  params ip="10.43.7.13" cidr_netmask="32" \
>  op monitor interval="32s"
> group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox
> clone ClusterIPClone ClusterIPGroup
> 
> When both nodes of my two-node cluster are running, everything looks and
> functions OK. From "service iptables status" on node 1 (hypatia-tb):
> 
> 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
> local_node=1 hash_init=0
> 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
> local_node=1 hash_init=0
> 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
> local_node=1 hash_init=0
> 
> On node 2 (orestes-tb):
> 
> 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
> local_node=2 hash_init=0
> 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
> local_node=2 hash_init=0
> 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
> local_node=2 hash_init=0
> 
> If I do a simple test of ssh'ing into 129.236.252.13, I see that I 
> alternately
> login into hypatia-tb and orestes-tb. All is good.
> 
> Now take orestes-tb offline. The iptables rules on hypatia-tb are 
> unchanged:
> 
> 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
> local_node=1 hash_init=0
> 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
> local_node=1 hash_init=0
> 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  
> CLUSTERIP
> hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
> local_node=1 hash_init=0
> 
> If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be
> machine-dependent. On one machine I get in, from another I get a 
> time-out. Both
> machines show the same MAC address for 129.236.252.13:
> 
> arp 129.236.252.13
> Address  HWtype  HWaddress   Flags Mask   
>  Iface
> hamilton-tb.nevis.colum  ether   B1:95:5A:B5:16:79   C
>  eth0
> 
> Is this the way the cloned IPaddr2 resource is supposed to behave in the 
> event
> of a node failure, or have I set things up incorrectly?
> >>>
> >>>I spent some time looking over the IPaddr2 script. As far as I can tell, 
> >>>the
> >>>script has no mechanism for reconfiguring iptables in the event of a 
> >>>change of
> >>>state in the number of clones.
> >>>
> >>>I might be stupid -- er -- dedicated enough to make this change on my own, 
> >>>then
> >>>share the code with the appropriate group. The change seems to be 
> >>>relatively
> >>>simple. It would be in the monitor operation. In pseudo-code:
> >>>
> >>>if (  ) then
> >>>   if ( OCF_RESKEY_CRM_meta_clone_max != OCF_RESKEY_CRM_meta_clone_max 
> >>> last time
> >>> || OCF_RESKEY_CRM_meta_clone != OCF_RESKEY_CRM_meta_clone last 
> >>> time )
> >>> ip_stop
> >>> ip_start
> >>
> >>Just changing the iptables entries should suffice, right?
> >>Besides, doing stop/start in the monitor is sort of unexpected.
> >>Another option is to add the missing node to one of the nodes
> >>which are still running (echo "+">>
> >>/proc/net/ipt_CLUSTER

Re: [Linux-HA] MMM conflict with Pacemaker

2012-02-17 Thread Andrew Beekhof
On Fri, Feb 17, 2012 at 4:00 AM, Mark Grennan  wrote:
> Hi Marcus,
>
> One Issue I can think of is, Pacemaker wants to bind the floating IP as 
> eth#:#, while MMM wants to use a different method that can only be seen with 
> the IP command.   I think they are fighting over who owns the floating IP.
>
> Have you read my full HOWTO at 
> http://www.mysqlfanboy.com/2012/02/the-full-monty-version-2-3/ ?
>
> Yes HA systems are very confusing.   Pacemaker is the name of an older 
> application.  Corasync is it's new name but some of the files still maintain 
> the old name.

Not quite. Pacemaker uses Corosync to send messages to instances of
itself on other nodes.

Recommended reading:
http://theclusterguy.clusterlabs.org/post/1262495133/pacemaker-heartbeat-corosync-wtf

>
> You should know, even the developer of MMM has abandoned it.  The author of 
> MMM (Alexey Kovyrin) said in a reply to Brian his blog “…Every time I try to 
> add HA to my clusters I remember MMM and want to stab myself because I simply 
> could not trust my data to the tool…”.  Read 
> (http://www.xaprb.com/blog/2011/05/04/whats-wrong-with-mmm/)
>
> Pacemaker is the way to go. But, yes, it is difficult.  I hope my HOWTO helps.
>
>
> - Original Message -
> From: "Marcus Bointon" 
> To: linux-ha@lists.linux-ha.org
> Sent: Thursday, February 16, 2012 3:17:32 AM
> Subject: [Linux-HA] MMM conflict with Pacemaker
>
> I have 5 servers where 2 are running a redundant web front-end with pacemaker 
> (managing a single floating IP), two are running MySQL with mmm agents and 
> the last one is running the mmm monitor node. So at present there is no 
> overlap between these groups. I need to retire one of the web servers and its 
> functions will be moved to the machine currently doing mmm monitoring. Easier 
> said than done.
> If I install pacemaker (from the linux-ha PPA for Lucid, with empty initial 
> config, as per the docs) and start its corosync service,  mmm's monitor goes 
> nuts, loses connectivity to agents causes them to drops their floating IP 
> (even though it's not on the machines involved with  pacemaker). I can 
> appreciate that there is some overlap in functionality, but I don't see why 
> it should conflict like this. Anyone got an explanation? Is anyone else 
> running this combo?
>
> I've temporarily bypassed the front-end so I can work on this, so I'm clear 
> to start entirely from scratch. This is proving difficult too, since the 
> shifting terminology means documentation is mostly out of sync - of the three 
> guides I've tried so far, one doesn't mention ha.cf at all (others do, but 
> with obsolete options), one suggests doing everything with corosync (though 
> appears to be missing any config for pacemaker). One thing that would be very 
> helpful is something to explain the relative merits of ucast, bcast and mcast 
> options, as I suspect they may be part of the problem I'm seeing with mmm.
>
> (and I'm not looking to switch to DRBD!)
>
> Marcus
> --
> Marcus Bointon
> Synchromedia Limited: Creators of http://www.smartmessages.net/
> UK info@hand CRM solutions
> mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/
>
>
>
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems