Re: [ClusterLabs] Antw: Re: Random failure with clone of IPaddr2
I use same corosync / pacemaker on three host, but: Host A & B have same kernel, C is different: A & B : 3.7.10-1.45-desktop C: 4.1.15-8-default > This work fine with : > - 3.11.10-25-default / 3.11.10-29-default > - 3.7.10-1.1-desktop / 3.11.10-29-default > - 3.7.10-1.16-desktop / 3.11.10-21-default > > But so don't work with: > - 3.7.10-1.45-desktop / 4.1.15-8-default I upgrade A host to 4.1 kernel, and A & C can talk of course together without problem. Sorry for the noise. B host upgrade is for today of course. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Random failure with clone of IPaddr2
>>> I use same corosync / pacemaker on three host, but: >>> Host A & B have same kernel, C is different: >>> A & B : 3.7.10-1.45-desktop >>> C: 4.1.15-8-default >> >> I don't know why you do that, but I'd either put C into standby, or put >> A >> nad B into standby and upgrade them one by one to the versions of C >> (undo >> standby after that). Your configuration makes troubleshooting a >> challenge >> (for you and others)! > > For pleasure ... as I like rebuild rpm from source ... Without joke, I > start use cororysnc 3 years ago on A & B, but I need to upgrade these > hosts, without losing connection or load-balancing. So the only way I > found is to add a third host, to upgrade the first one. But when I get my > third host this isn't the old os, so I rebuild rpm from source for A & B, > and now I hit this wall. I can't put all services on C hosts. So no it > isn't really for pleasure ;-) > > And after this night, I think the problem didn't come from corosync / > pacemaker but from ipt_clusterip module ... I know this is isn't a good > idea this config of different, but I try to find all the way to get out of > this. > I check other configuration I have. This work fine with : - 3.11.10-25-default / 3.11.10-29-default - 3.7.10-1.1-desktop / 3.11.10-29-default - 3.7.10-1.16-desktop / 3.11.10-21-default But so don't work with: - 3.7.10-1.45-desktop / 4.1.15-8-default I totally understand the problem didn't come from corosync / pacemaker stuff, but from ip kernel 4.x configuration, around ipt_clusterip just like this kernel didn't choose same "modulo" to answer or not to a packet. (I enable debug trace of the module). The upgrade process in production environment is not a easy stuff for sure ... ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Cluster failure
> I'm fairly new to Pacemaker and have a few questions about > > The following log event and why resources was removed from my cluster > Right before the resources being killed SIGTERM I notice the following > message. > Dec 18 19:18:18 clusternode38.mf stonith-ng[10739]: notice: On loss of > CCM Quorum: Ignore > > What exactly does this mean stonith: Shoot The Other Node In The Head Quorum: just like democracy: If you have more than 2 node, half node + 1 choose if resource must be shoot or not. As you have only 2 node quorum is disable. > my resources recovered after a few minutes and > did not fail over any idea what's going on here? Can you post your crm configuration ? > or documentation I can read that explains what exactly happened? http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/ch05.html ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Random failure with clone of IPaddr2
I change my setting from clusterip_hash="sourceip-sourceport" to clusterip_hash="sourceip". And try to ping. >From one host (not a node) on the network, I get no answer. >From another host (not a node) on the network I get: PING 10.0.0.97 (10.0.0.97) 56(84) bytes of data. 64 bytes from 10.0.0.97: icmp_seq=1 ttl=64 time=0.310 ms 64 bytes from 10.0.0.97: icmp_seq=1 ttl=64 time=0.320 ms (DUP!) 64 bytes from 10.0.0.97: icmp_seq=2 ttl=64 time=0.184 ms 64 bytes from 10.0.0.97: icmp_seq=2 ttl=64 time=0.544 ms (DUP!) 64 bytes from 10.0.0.97: icmp_seq=3 ttl=64 time=0.144 ms 64 bytes from 10.0.0.97: icmp_seq=3 ttl=64 time=0.173 ms (DUP!) 64 bytes from 10.0.0.97: icmp_seq=4 ttl=64 time=0.169 ms 64 bytes from 10.0.0.97: icmp_seq=4 ttl=64 time=0.627 ms (DUP!) So for me it just like one node answer when this is not is turn and doesn't answer when it's for him. No ? >> 3 Nodes A B C. >> If resource on: >> A + B => ok >> Only A => ok >> Only B => ok >> Only C => ok >> A + C => random fail >> B + C => random fail >> A + B + C => random fail > > I use same corosync / pacemaker on three host, but: > Host A & B have same kernel, C is different: > A & B : 3.7.10-1.45-desktop > C: 4.1.15-8-default > I use same version, but no same binary: > pacemaker-1.1.13-12.2.x86_64 > corosync-2.3.5-4.2.x86_64 > On C host it's native rpm. I use the src-rpm to rebuild it for host A & B. > I check all the sysctl settings, but see no difference ... > > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Random failure with clone of IPaddr2
> ... For me, this work with arp multicast, who give same "virtual" arp > to different hosts Every hosts in the cluster get the request, and a modulo choose which one answer. It's just how I understand this shared ip. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Random failure with clone of IPaddr2
> Maybe I'm missing something here, and if so, my apologies, but to me it > looks like you are trying to put the same IP address on three different > machines SIMULTANEOUSLY. Yes it what I do. But it's seem normal for me, I just follow guide like http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_clone_the_ip_address.html and work fine in a 2 nodes configurations. For me, this work with arp multicast, who give same "virtual" arp to different hosts, and work with iptable CLUSTERIP special rule (in very shortcut). But may be I totally misunderstand the stuff, but I work fine with that for the last 4 years so ... ? ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Random failure with clone of IPaddr2
> 3 Nodes A B C. > If resource on: > A + B => ok > Only A => ok > Only B => ok > Only C => ok > A + C => random fail > B + C => random fail > A + B + C => random fail I use same corosync / pacemaker on three host, but: Host A & B have same kernel, C is different: A & B : 3.7.10-1.45-desktop C: 4.1.15-8-default I use same version, but no same binary: pacemaker-1.1.13-12.2.x86_64 corosync-2.3.5-4.2.x86_64 On C host it's native rpm. I use the src-rpm to rebuild it for host A & B. I check all the sysctl settings, but see no difference ... ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Random failure with clone of IPaddr2
Hi, My problem is still here. I search but don't find. I try to change network cable to put the 3 hosts together on same switch, but same problem. So with this: primitive ip_apache_localnet ocf:heartbeat:IPaddr2 \ params ip="10.0.0.99" \ cidr_netmask="32" op monitor interval="30s" clone cl_ip_apache_localnet ip_apache_localnet \ meta globally-unique="true" clone-max="3" clone-node-max="3" 3 Nodes A B C. If resource on: A + B => ok Only A => ok Only B => ok Only C => ok A + C => random fail B + C => random fail A + B + C => random fail When I say random fail, I do a curl http://10.0.0.99. I can see request with tcpdump. I can reach all the three hosts. But 1 time on 6 or 7, the curl request hang. I see with tcpdump the request get in, but no host answer. I suspect host C but can't find why he don't do the job. If I ctrl-c & redo the request, I got answer. I check all firewall / log and don't see any error msg. If someone have a clue, he's very welcome ! ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Random failure with clone of IPaddr2
Hi, I got some trouble since one week and can't find solution by myself. Any help will be really appreciated ! I use corosync / pacemaker for 3 or 4 years and all works well, for failover or load-balancing. I have shared ip between 3 servers, and need to remove one for upgrade. But after I remove the server from the cluster i got random fail to access to my shared ip. I think first that some packet want go to the old server. So I put it again in the cluster, can reach it, but random failure is still here :-/ My test is just a curl http://my_ip (or ssh same stuff, random failed to connect). A ping didn't loose any packet. I can reach each of the three servers, but sometime, the request hang, and got a timeout. I see via tcpdump the packet coming, and resend, but no one respond. How I can diagnostic this ? I think one request on five fail. But I didn't see any messages in firewall or /var/log/message, nothing, just like the switch choose to remove random packet. I didn't see any counter on network interface, check the iptable setting, recheck the log, recheck all firewall ... Where go these packets ?? I try with another new ip, and same problem append. I try ip on two differents subnets (10.xxx and external ip) and same stuff. I have no problem with virtual ip in failover mode. If someone has any clue ... ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Random failure with clone of IPaddr2
> On 12/15/2016 02:02 PM, al...@amisw.com wrote: >> primitive ip_apache_localnet ocf:heartbeat:IPaddr2 params ip="10.0.0.99" >> cidr_netmask="32" op monitor interval="30s" >> clone cl_ip_apache_localnet ip_apache_localnet \ >> meta globally-unique="true" clone-max="3" clone-node-max="1" > > > ^^^ Here you have clone-node-max="1", which will prevent surviving nodes > from picking up any failed node's share of requests. clone-max and > clone-node-max should both stay at 3, regardless of whether you are > intentionally taking down any node. Thank you for the tip. It doesn't explain my problem, but it help me: I can reach all the 3 node with my curl request, but sometime, one not respond. And as he doesn't answer I don't know who don't answer :-) But with the clone-node-max at 3, I can play to move my resource, and see that problem happen when the resource is on one specific node. Not the one I want remove. So 2 nodes works well, and one node sometime don't answer. I know that this node isn't on the same switch that the first two, there is another switch between, (3 switch interconnected), can the multicast arp lost in the way ? If not, this is a firewall / systctl difference between hosts ... ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Random failure with clone of IPaddr2
> > Seeing your configuration might help. Did you set globally-unique=true > and clone-node-max=3 on the clone? If not, the other nodes can't pick up > the lost node's share of requests. Yes for both, I have globally-unique=true, and I change clone-node-max=3 to clone-node-max=2, and now, as I come back to old configuration, I come back to clone-node-max=3 So now I have three node in the cluster. Here my config: primitive ip_apache_localnet ocf:heartbeat:IPaddr2 params ip="10.0.0.99" cidr_netmask="32" op monitor interval="30s" clone cl_ip_apache_localnet ip_apache_localnet \ meta globally-unique="true" clone-max="3" clone-node-max="1" target-role="Started" is-managed="true" sudo /usr/sbin/iptables -L CLUSTERIP all -- anywhere 10.0.0.99 CLUSTERIP hashmode=sourceip-sourceport clustermac=A1:99:D6:EA:43:77 total_nodes=3 local_node=2 hash_init=0 and check I have different local_node on each node. And just a question. Is the mac adress "normal" ? Doesn't need to begin with 01-00-5E ? ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Random failure with clone of IPaddr2
Hi, I got some trouble since one week and can't find solution by myself. Any help will be really appreciated ! I use corosync / pacemaker for 3 or 4 years and all works well, for failover or load-balancing. I have shared ip between 3 servers, and need to remove one for upgrade. But after I remove the server from the cluster i got random fail to access to my shared ip. I think first that some packet want go to the old server. So I put it again in the cluster, can reach it, but random failure is still here :-/ My test is just a curl http://my_ip (or ssh same stuff, random failed to connect). A ping didn't loose any packet. I can reach each of the three servers, but sometime, the request hang, and got a timeout. I see via tcpdump the packet coming, and resend, but no one respond. How I can diagnostic this ? I think one request on five fail. But I didn't see any messages in firewall or /var/log/message, nothing, just like the switch choose to remove random packet. I didn't see any counter on network interface, check the iptable setting, recheck the log, recheck all firewall ... Where go these packets ?? I try with another new ip, and same problem append. I try ip on two differents subnets (10.xxx and external ip) and same stuff. I have no problem with virtual ip in failover mode. If someone has any clue ... ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org