Re: [ClusterLabs] Antw: Re: Random failure with clone of IPaddr2

2016-12-23 Thread alian
 I use same corosync / pacemaker on three host, but:
 Host A & B have same kernel, C is different:
 A & B : 3.7.10-1.45-desktop
 C: 4.1.15-8-default

> This work fine with :
> - 3.11.10-25-default / 3.11.10-29-default
> - 3.7.10-1.1-desktop / 3.11.10-29-default
> - 3.7.10-1.16-desktop / 3.11.10-21-default
>
> But so don't work with:
> - 3.7.10-1.45-desktop / 4.1.15-8-default

I upgrade A host to 4.1 kernel, and A & C can talk of course together
without problem. Sorry for the noise. B host upgrade is for today of
course.


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Random failure with clone of IPaddr2

2016-12-20 Thread alian
>>> I use same corosync / pacemaker on three host, but:
>>> Host A & B have same kernel, C is different:
>>> A & B : 3.7.10-1.45-desktop
>>> C: 4.1.15-8-default
>>
>> I don't know why you do that, but I'd either put C into standby, or put
>> A
>> nad B into standby and upgrade them one by one to the versions of C
>> (undo
>> standby after that). Your configuration makes troubleshooting a
>> challenge
>> (for you and others)!
>
> For pleasure ... as I like rebuild rpm from source ... Without joke, I
> start use cororysnc 3 years ago on A & B, but I need to upgrade these
> hosts, without losing connection or load-balancing. So the only way I
> found is to add a third host, to upgrade the first one. But when I get my
> third host this isn't the old os, so I rebuild rpm from source for A & B,
> and now I hit this wall. I can't put all services on C hosts. So no it
> isn't really for pleasure ;-)
>
> And after this night, I think the problem didn't come from corosync /
> pacemaker but from ipt_clusterip module ... I know this is isn't a good
> idea this config of different, but I try to find all the way to get out of
> this.
>

I check other configuration I have.
This work fine with :
- 3.11.10-25-default / 3.11.10-29-default
- 3.7.10-1.1-desktop / 3.11.10-29-default
- 3.7.10-1.16-desktop / 3.11.10-21-default

But so don't work with:
- 3.7.10-1.45-desktop / 4.1.15-8-default

I totally understand the problem didn't come from corosync / pacemaker
stuff, but from ip kernel 4.x configuration, around ipt_clusterip just
like this kernel didn't choose same "modulo" to answer or not to a packet.
(I enable debug trace of the module).

The upgrade process in production environment is not a easy stuff for sure
...



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster failure

2016-12-20 Thread alian
> I'm fairly new to Pacemaker and have a few questions about
>
> The following log event and why resources was removed from my cluster
> Right before the resources being killed SIGTERM I notice the following
> message.
> Dec 18 19:18:18 clusternode38.mf stonith-ng[10739]:   notice: On loss of
> CCM Quorum: Ignore
>
> What exactly does this mean

stonith: Shoot The Other Node In The Head
Quorum: just like democracy: If you have more than 2 node, half node + 1
choose if resource must be shoot or not.
As you have only 2 node quorum is disable.

> my resources recovered after a few minutes and
> did not fail over any idea what's going on here?

Can you post your crm configuration ?

> or documentation I can read that explains what exactly happened?

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/ch05.html


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-19 Thread alian
I change my setting from clusterip_hash="sourceip-sourceport"  to
clusterip_hash="sourceip".
And try to ping.
>From one host (not a node) on the network, I get no answer.
>From another host (not a node) on the network I get:
PING 10.0.0.97 (10.0.0.97) 56(84) bytes of data.
64 bytes from 10.0.0.97: icmp_seq=1 ttl=64 time=0.310 ms
64 bytes from 10.0.0.97: icmp_seq=1 ttl=64 time=0.320 ms (DUP!)
64 bytes from 10.0.0.97: icmp_seq=2 ttl=64 time=0.184 ms
64 bytes from 10.0.0.97: icmp_seq=2 ttl=64 time=0.544 ms (DUP!)
64 bytes from 10.0.0.97: icmp_seq=3 ttl=64 time=0.144 ms
64 bytes from 10.0.0.97: icmp_seq=3 ttl=64 time=0.173 ms (DUP!)
64 bytes from 10.0.0.97: icmp_seq=4 ttl=64 time=0.169 ms
64 bytes from 10.0.0.97: icmp_seq=4 ttl=64 time=0.627 ms (DUP!)

So for me it just like one node answer when this is not is turn and
doesn't answer when it's for him. No ?


>> 3 Nodes A B C.
>> If resource on:
>> A + B => ok
>> Only A => ok
>> Only B => ok
>> Only C => ok
>> A + C => random fail
>> B + C => random fail
>> A + B + C => random fail
>
> I use same corosync / pacemaker on three host, but:
> Host A & B have same kernel, C is different:
> A & B : 3.7.10-1.45-desktop
> C: 4.1.15-8-default
> I use same version, but no same binary:
> pacemaker-1.1.13-12.2.x86_64
> corosync-2.3.5-4.2.x86_64
> On C host it's native rpm. I use the src-rpm to rebuild it for host A & B.
> I check all the sysctl settings, but see no difference ...
>
>



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-19 Thread alian
> ... For me, this work with arp multicast, who give same "virtual" arp
> to different hosts

Every hosts in the cluster get the request, and a modulo choose which one
answer. It's just how I understand this shared ip.



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-19 Thread alian
> Maybe I'm missing something here, and if so, my apologies, but to me it
> looks like you are trying to put the same IP address on three different
> machines SIMULTANEOUSLY.

Yes it what I do. But it's seem normal for me, I just follow guide like
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_clone_the_ip_address.html

and work fine in a 2 nodes configurations. For me, this work with arp
multicast, who give same "virtual" arp to different hosts, and work with
iptable CLUSTERIP special rule (in very shortcut). But may be I totally
misunderstand the stuff, but I work fine with that for the last 4 years so
... ?


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-19 Thread alian
> 3 Nodes A B C.
> If resource on:
> A + B => ok
> Only A => ok
> Only B => ok
> Only C => ok
> A + C => random fail
> B + C => random fail
> A + B + C => random fail

I use same corosync / pacemaker on three host, but:
Host A & B have same kernel, C is different:
A & B : 3.7.10-1.45-desktop
C: 4.1.15-8-default
I use same version, but no same binary:
pacemaker-1.1.13-12.2.x86_64
corosync-2.3.5-4.2.x86_64
On C host it's native rpm. I use the src-rpm to rebuild it for host A & B.
I check all the sysctl settings, but see no difference ...



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-19 Thread alian
Hi,

My problem is still here. I search but don't find. I try to change network
cable to put the 3 hosts together on same switch, but same problem.

So with this:

primitive ip_apache_localnet ocf:heartbeat:IPaddr2 \
  params ip="10.0.0.99" \
  cidr_netmask="32" op monitor interval="30s"
clone cl_ip_apache_localnet ip_apache_localnet \
  meta globally-unique="true" clone-max="3" clone-node-max="3"

3 Nodes A B C.
If resource on:
A + B => ok
Only A => ok
Only B => ok
Only C => ok
A + C => random fail
B + C => random fail
A + B + C => random fail

When I say random fail, I do a curl http://10.0.0.99. I can see request
with tcpdump. I can reach all the three hosts. But 1 time on 6 or 7, the
curl request hang. I see with tcpdump the request get in, but no host
answer. I suspect host C but can't find why he don't do the job. If I
ctrl-c & redo the request, I got answer.

I check all firewall / log and don't see any error msg. If someone have a
clue, he's very welcome !


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Random failure with clone of IPaddr2

2016-12-16 Thread alian
Hi,

I got some trouble since one week and can't find solution by myself. Any
help will be really appreciated !
I use corosync / pacemaker for 3 or 4 years and all works well, for
failover or load-balancing.

I have shared ip between 3 servers, and need to remove one for upgrade.
But after I remove the server from the cluster i got random fail to access
to my shared ip. I think first that some packet want go to the old server.
So I put it again in the cluster, can reach it, but random failure is
still here :-/

My test is just a curl http://my_ip (or ssh same stuff, random failed to
connect).
A ping didn't loose any packet.
I can reach each of the three servers, but sometime, the request hang, and
got a timeout.
I see via tcpdump the packet coming, and resend, but no one respond. How I
can diagnostic this ?
I think one request on five fail. But I didn't see any messages in
firewall or /var/log/message, nothing, just like the switch choose to
remove random packet. I didn't see any counter on network interface, check
the iptable setting, recheck the log, recheck all firewall ... Where go
these packets ??

I try with another new ip, and same problem append. I try ip on two
differents subnets (10.xxx and external ip) and same stuff.

I have no problem with virtual ip in failover mode.

If someone has any clue ...



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-15 Thread alian
> On 12/15/2016 02:02 PM, al...@amisw.com wrote:

>> primitive ip_apache_localnet ocf:heartbeat:IPaddr2 params ip="10.0.0.99"
>> cidr_netmask="32" op monitor interval="30s"
>> clone cl_ip_apache_localnet ip_apache_localnet \
>> meta globally-unique="true" clone-max="3" clone-node-max="1"
>
>
> ^^^ Here you have clone-node-max="1", which will prevent surviving nodes
> from picking up any failed node's share of requests. clone-max and
> clone-node-max should both stay at 3, regardless of whether you are
> intentionally taking down any node.

Thank you for the tip. It doesn't explain my problem, but it help me: I
can reach all the 3 node with my curl request, but sometime, one not
respond. And as he doesn't answer I don't know who don't answer :-) But
with the clone-node-max at 3, I can play to move my resource, and see that
problem happen when the resource is on one specific node. Not the one I
want remove.

So 2 nodes works well, and one node sometime don't answer.
I know that this node isn't on the same switch that the first two, there
is another switch between, (3 switch interconnected), can the multicast
arp lost in the way ?

If not, this is a firewall / systctl difference between hosts ...



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-15 Thread alian
>
> Seeing your configuration might help. Did you set globally-unique=true
> and clone-node-max=3 on the clone? If not, the other nodes can't pick up
> the lost node's share of requests.

Yes for both, I have globally-unique=true, and I change clone-node-max=3
to clone-node-max=2, and now, as I come back to old configuration, I come
back to clone-node-max=3

So now I have three node in the cluster.
Here my config:

primitive ip_apache_localnet ocf:heartbeat:IPaddr2 params ip="10.0.0.99" 
cidr_netmask="32" op monitor interval="30s"
clone cl_ip_apache_localnet ip_apache_localnet \
meta globally-unique="true" clone-max="3" clone-node-max="1"
target-role="Started" is-managed="true"

 sudo  /usr/sbin/iptables -L
CLUSTERIP  all  --  anywhere 10.0.0.99  CLUSTERIP
hashmode=sourceip-sourceport clustermac=A1:99:D6:EA:43:77 total_nodes=3
local_node=2 hash_init=0

and check I have different local_node on each node.
And just a question. Is the mac adress "normal" ? Doesn't need to begin
with 01-00-5E ?


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Random failure with clone of IPaddr2

2016-12-15 Thread alian
Hi,

I got some trouble since one week and can't find solution by myself. Any
help will be really appreciated !
I use corosync / pacemaker for 3 or 4 years and all works well, for
failover or load-balancing.

I have shared ip between 3 servers, and need to remove one for upgrade.
But after I remove the server from the cluster i got random fail to access
to my shared ip. I think first that some packet want go to the old server.
So I put it again in the cluster, can reach it, but random failure is
still here :-/

My test is just a curl http://my_ip (or ssh same stuff, random failed to
connect).
A ping didn't loose any packet.
I can reach each of the three servers, but sometime, the request hang, and
got a timeout.
I see via tcpdump the packet coming, and resend, but no one respond. How I
can diagnostic this ?
I think one request on five fail. But I didn't see any messages in
firewall or /var/log/message, nothing, just like the switch choose to
remove random packet. I didn't see any counter on network interface, check
the iptable setting, recheck the log, recheck all firewall ... Where go
these packets ??

I try with another new ip, and same problem append. I try ip on two
differents subnets (10.xxx and external ip) and same stuff.

I have no problem with virtual ip in failover mode.

If someone has any clue ...


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org