[ClusterLabs] Corosync 3.0.3 is available at corosync.org!

2019-11-25 Thread Jan Friesse
): cpg: notify_lib_joinlist: drop conn parameter cpg: send single confchg event per group on joinlist Fabio M. Di Nitto (2): build: add option for enabling sanitizer builds pkgconfig: Add libqb dependency Jan Friesse (17): cpg: Add more comments to notify_lib_joinlist

Re: [ClusterLabs] Maximum number of nodes support in single cluster

2019-11-24 Thread Jan Friesse
Ken Gaillot napsal(a): On Fri, 2019-11-22 at 07:32 +, S Sathish S wrote: Hi Team, In Clusterlab below pacemaker and corosync version what is maximum cluster nodes its supported ? will it support for 120 nodes in single cluster? corosync-2.4.4 à

Re: [ClusterLabs] reducing corosync-qnetd "response time"

2019-10-25 Thread Jan Friesse
Sherrard Burton napsal(a): On 10/24/19 1:30 PM, Andrei Borzenkov wrote: 24.10.2019 16:54, Sherrard Burton пишет: background: we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS, with no stonith, to a much more modern implementation. for the existing cluster, as well as

Re: [ClusterLabs] Definition of function pointer cpg_totem_confchg_fn_t

2019-10-15 Thread Jan Friesse
DaShi, typedef void (*cpg_totem_confchg_fn_t) ( cpg_handle_t handle, struct cpg_ring_id ring_id, uint32_t member_list_entries, const uint32_t *member_list); Should "struct cpg_ring_id ring_id" be "struct cpg_ring_id *ring_id"? Nope, ringid is really passed

Re: [ClusterLabs] Corosync unable to start after rpm generated but when we compiled from source code is working as expected

2019-10-02 Thread Jan Friesse
Sathish, Hi Team, We have taken corosync source code from below git repository and compile and installed able to start corosync service as expected. https://github.com/corosync/corosync/tree/v1.4.8 Any reason to use 1.4.8? If you really want to stay on unsupported Corosync 1 branch, then

Re: [ClusterLabs] Pacemaker 1.1.12 does not compile with CMAN Stack.

2019-09-16 Thread Jan Friesse
Somanath, Hi , Thanks for your extended support and advise. I would like to give some background information about my exercise, which will give some fair idea to you folks. We are planning to use pacemaker + Corosync + CMAN stack in RHEL 6.5 OS. For the same, we are trying to collect the

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-03 Thread Jan Friesse
rds, Honza On Mon, Sep 2, 2019 at 5:50 PM Jan Friesse wrote: Jeevan, Jeevan Patnaik napsal(a): Hi, Also, both are physical machines. On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik wrote: Hi, We see the following messages almost everyday in our 2 node cluster and resources gets migrated whe

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-02 Thread Jan Friesse
Jeevan, Jeevan Patnaik napsal(a): Hi, Also, both are physical machines. On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik wrote: Hi, We see the following messages almost everyday in our 2 node cluster and resources gets migrated when it happens: [16187] node1 corosyncwarning [MAIN ]

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-13 Thread Jan Friesse
Олег Самойлов napsal(a): 13 авг. 2019 г., в 15:55, Jan Friesse написал(а): There is going to be slightly different solution (set this timeouts based on corosync token timeout) which I'm working on, but it's kind of huge amount of work and not super high prio (workaround exists), so no ETA

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-13 Thread Jan Friesse
Олег Самойлов napsal(a): 12 авг. 2019 г., в 8:46, Jan Friesse написал(а): Let me try to bring some light in there: - dpd_interval is qnetd variable how often qnetd walks thru the list of all clients (qdevices) and checks timestamp of last sent message. If diff between current timestamp

Re: [ClusterLabs] Restoring network connection breaks cluster services

2019-08-13 Thread Jan Friesse
Momcilo On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger wrote: On 8/7/19 12:26 PM, Momcilo Medic wrote: We have three node cluster that is setup to stop resources on lost quorum. Failure (network going down) handling is done properly, but recovery doesn't seem to work. What do you mean by

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-12 Thread Jan Friesse
Andrei Borzenkov napsal(a): Отправлено с iPhone 12 авг. 2019 г., в 8:46, Jan Friesse написал(а): Олег Самойлов napsal(a): 9 авг. 2019 г., в 9:25, Jan Friesse написал(а): Please do not set dpd_interval that high. dpd_interval on qnetd side is not about how often is the ping is sent

Re: [ClusterLabs] Ubuntu 18.04 and corosync-qdevice

2019-08-12 Thread Jan Friesse
Nickle, Richard napsal(a): I've built a two-node DRBD cluster with SBD and STONITH, following advice from ClusterLabs, LinBit, Beekhof's blog on SBD. I still cannot get automated failover when I down one of the nodes. I thought that perhaps I needed to have an odd-numbered quorum so I

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-11 Thread Jan Friesse
Олег Самойлов napsal(a): 9 авг. 2019 г., в 9:25, Jan Friesse написал(а): Please do not set dpd_interval that high. dpd_interval on qnetd side is not about how often is the ping is sent. Could you please retry your test with dpd_interval=1000? I'm pretty sure it will work then. Honza Yep

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-09 Thread Jan Friesse
Andrei Borzenkov napsal(a): On Fri, Aug 9, 2019 at 9:25 AM Jan Friesse wrote: Олег Самойлов napsal(a): Hello all. I have a test bed with several virtual machines to test pacemaker. I simulate random failure on one of the node. The cluster will be on several data centres, so

Re: [ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-08-09 Thread Jan Friesse
Roger Zhou napsal(a): On 8/9/19 2:27 PM, Roger Zhou wrote: On 7/29/19 12:24 AM, Andrei Borzenkov wrote: corosync.service sets StopWhenUnneded=yes which normally stops it when pacemaker is shut down. One more thought, Make sense to add "RefuseManualStop=true" to pacemaker.service? The same

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-09 Thread Jan Friesse
Олег Самойлов napsal(a): Hello all. I have a test bed with several virtual machines to test pacemaker. I simulate random failure on one of the node. The cluster will be on several data centres, so there is not stonith device, instead I use qnetd on the third data centre and watchdog

[ClusterLabs] Corosync 2.4.5 is available at corosync.org!

2019-07-30 Thread Jan Friesse
ap key name runtime.config.totem.token Hideo Yamauchi (1): corosync-notifyd: Delete registered tracking key Jan Friesse (43): cpg: Inform clients about left nodes during pause totemudpu: Pass correct paramto totemip_nosigpipe util: Fix strncpy in setcs_name_t function

Re: [ClusterLabs] Node reset on shutdown by SBD watchdog with corosync-qdevice

2019-07-29 Thread Jan Friesse
On 7/28/19 6:35 PM, Andrei Borzenkov wrote: In two node cluster + qnetd I consistently see the node that is being shut down last being reset during shutdown. I.e. - shutdown the first node - OK - shutdown the second node - reset As far as I understand what happens is - during shutdown

Re: [ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-07-29 Thread Jan Friesse
Andrei Andrei Borzenkov napsal(a): corosync.service sets StopWhenUnneded=yes which normally stops it when This was the case only for very limited time (v 3.0.1) and it's removed now (v 3.0.2) because it was causing more troubles than it was worth. Regards, Honza pacemaker is shut down.

Re: [ClusterLabs] Corosync 3.x and Pacemaker 2.x supported OS versions

2019-07-25 Thread Jan Friesse
Somanath, Hi All , We are planning to update to corosync 3.x and pacemaker 2.x versions. I want to confirm whether this is supported on RHEL 7.X OS version. Answering just Corosync part of the question (eventho pcmk is probably very similar story). If "supported" means officially

Re: [ClusterLabs] 2 nodes split brain with token timeout

2019-07-23 Thread Jan Friesse
Jean-Jacques, Hello everyone, I'm having stability issue with a 2 nodes active/passive HA infrastructure (Zabbix VMs in this case). Daily backup create a latency, slowing Corosync scheduling and triggering a token timeout. It frequently ends up on a split brain issue, where service is

Re: [ClusterLabs] Two node cluster goes into split brain scenario during CPU intensive tasks

2019-06-24 Thread Jan Friesse
Somanath, Hi All, I have a two node cluster with multicast (udp) transport . The multicast IP used in 224.1.1.1 . Would you mind to give a try to UDPU (unicast)? For two node cluster there is going to be no difference in terms of speed/throughput. Whenever there is a CPU intensive task

[ClusterLabs] FYI possibly incompatible change in Corosync CPG service in master

2019-06-13 Thread Jan Friesse
I would like to inform all developers who is using Corosync CPG service about possibly "incompatible" change in Corosync master branch. Previously CPG sent joinlist always contained only 1 member. This behavior was not entirely correct so we fixed that so now only 1 CPG join list confchg

[ClusterLabs] Corosync 3.0.2 is available at corosync.org!

2019-06-12 Thread Jan Friesse
.ac: AC_PROG_SED is already present build: Use the SED variable provided by configure build: Use the AWK variable provided by configure Jan Friesse (32): doc: Update INSTALL file corosync-cfgtool: Fix -i matching quorumtool: Fix exit status codes configure:

Re: [ClusterLabs] Not all bind address belong to the same IP family - corosync fails

2019-05-14 Thread Jan Friesse
lejeczek, hi guys, I have a node which fails to start, but only at sys boot time to make it more curious, like this: Config would be helpful, but reason is probably usage of DNS and RRP config. Together with delay in DHCP (or DHCPv6) this may cause problem. You can ether change

Re: [ClusterLabs] corosync-qdevice[3772]: Heuristics worker waitpid failed (10): No child processes

2019-05-06 Thread Jan Friesse
Andrei, While testing corosync-qdevice I repeatedly got the above message. The reason seems to be startup sequence in corosync-qdevice. Consider: ● corosync-qdevice.service - Corosync Qdevice daemon Loaded: loaded (/etc/systemd/system/corosync-qdevice.service; disabled; vendor preset:

Re: [ClusterLabs] Timeout stopping corosync-qdevice service

2019-05-02 Thread Jan Friesse
Andrei, 30.04.2019 9:51, Jan Friesse пишет: Now, corosync-qdevice gets SIGTERM as "signal to terminate", but it installs SIGTERM handler that does not exit and only closes some socket. May be this should trigger termination of main loop, but somehow it does not. Yep, this is e

Re: [ClusterLabs] Corosync unable to reach consensus for membership

2019-05-02 Thread Jan Friesse
30, 2019 at 5:26 PM Jan Friesse wrote: Prasad, Hello : I have a 3 node corosync and pacemaker cluster and the nodes are: Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Masters: [ SG-azfw2-189 ] Slaves: [ SG

Re: [ClusterLabs] Corosync unable to reach consensus for membership

2019-04-30 Thread Jan Friesse
Prasad, Hello : I have a 3 node corosync and pacemaker cluster and the nodes are: Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Masters: [ SG-azfw2-189 ] Slaves: [ SG-azfw2-190 SG-azfw2-191 ] For my network

Re: [ClusterLabs] Timeout stopping corosync-qdevice service

2019-04-30 Thread Jan Friesse
Andrei, 29.04.2019 14:32, Jan Friesse пишет: Andrei, I setup qdevice in openSUSE Tumbleweed and while it works as expected I Is it corosync-qdevice or corosync-qnetd daemon? corosync-qdevice cannot stop it - it always results in timeout and service finally gets killed by systemd

Re: [ClusterLabs] Timeout stopping corosync-qdevice service

2019-04-29 Thread Jan Friesse
Andrei, I setup qdevice in openSUSE Tumbleweed and while it works as expected I Is it corosync-qdevice or corosync-qnetd daemon? cannot stop it - it always results in timeout and service finally gets killed by systemd. Is it a known issue? TW is having quite up-to-date versions, it usually

Re: [ClusterLabs] some comments on corosync.conf(5)

2019-04-23 Thread Jan Friesse
Hi! Reading the corosync.conf manual page of corosync 2.3.6 (SLES12 SP4), I have some random comments: The tag indent for "clear_node_high_bit" seems broken. Ack, should be fixed Shouldn't "hold" be "token_hold"? Why is "token_retransmits_before_loss_const" so long (why the "_const")?

Re: [ClusterLabs] corosync caused network breakdown

2019-04-08 Thread Jan Friesse
Sven, Hi, we were running a corosync config including 2 Rings for about 2.5 years on a two node NFS Cluster (active/passive). The first ring (ring 0) is configured on a dedicated NIC for Cluster internal communications. The second ring (ring 1) was configured in the interface where the NFS

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-22 Thread Jan Friesse
Brian, I've followed several tutorials about setting up a simple three-node cluster, with no resources (yet), under CentOS 7. I've discovered the cluster won't restart upon rebooting a node. The other two nodes, however, do claim the cluster is up, as shown with 'pcs status cluster'. I

Re: [ClusterLabs] Interface confusion

2019-03-20 Thread Jan Friesse
On Tue, 2019-03-19 at 21:49 +0100, Adam Budziński wrote: Not sure I got “or (due to two_node / wait_for_all in corosync.conf) waits until it can see the other node before doing anything” right. I mean according tohttps://access.redhat.com/solutions/1295713“Th e 'two_node' parameter sets the

Re: [ClusterLabs] Corosync rrp or bonding

2019-03-15 Thread Jan Friesse
support). Honza czw., 14.03.2019, 17:13 użytkownik Jan Friesse napisał: Adam, What is more recommended for providing corosync/pacemaker link redundancy? > a)Configure two rings and rrp within corosync, if yes: - can we use two separate VLANs connected to two separate physi

Re: [ClusterLabs] Corosync rrp or bonding

2019-03-14 Thread Jan Friesse
Adam, What is more recommended for providing corosync/pacemaker link redundancy? > a)Configure two rings and rrp within corosync, if yes: - can we use two separate VLANs connected to two separate physical NIC for the heartbeat connections Yes - do they have to be the

Re: [ClusterLabs] Does corosync not like bridge interfaces?

2019-02-28 Thread Jan Friesse
lejeczek, hi everyone would there be some reason why cluster is not happy with bridge interfaces used for its own purposes, not as for resources? At least corosync should work with bridge without any problem. Regards, Honza And I mean 'not happy' in a non-obvious way, when it does not

Re: [ClusterLabs] redundant ring and corosync makes/sees it as loopback??

2019-02-28 Thread Jan Friesse
lejeczek, hi everyone My cluster faulted my secondary ring today and on one node I found this: Printing ring status. Local node ID 3 RING ID 0     id    = 10.5.8.65     status    = ring 0 active with no faults RING ID 1     id    = 127.0.0.1     status    = Marking ringid 1 interface

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-25 Thread Jan Friesse
Edwin Hi, I've done some more tests and I am now able to reproduce the 100% CPU usage / infinite loop of corosync/libqb on all the kernels that I tested. Therefore I think this is NOT a regression in 4.19 kernel. That's good A suitable workaround for 4.19+ kernels is to run this on startup

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-21 Thread Jan Friesse
Edwin, On 20/02/2019 13:08, Jan Friesse wrote: Edwin Török napsal(a): On 20/02/2019 07:57, Jan Friesse wrote: Edwin, On 19/02/2019 17:02, Klaus Wenninger wrote: On 02/19/2019 05:41 PM, Edwin Török wrote: On 19/02/2019 16:26, Edwin Török wrote: On 18/02/2019 18:27, Edwin Török wrote

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Friesse
Edwin Török napsal(a): On 20/02/2019 07:57, Jan Friesse wrote: Edwin, On 19/02/2019 17:02, Klaus Wenninger wrote: On 02/19/2019 05:41 PM, Edwin Török wrote: On 19/02/2019 16:26, Edwin Török wrote: On 18/02/2019 18:27, Edwin Török wrote: Did a test today with CentOS 7.6 with upstream

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Jan Friesse
Edwin, On 19/02/2019 17:02, Klaus Wenninger wrote: On 02/19/2019 05:41 PM, Edwin Török wrote: On 19/02/2019 16:26, Edwin Török wrote: On 18/02/2019 18:27, Edwin Török wrote: Did a test today with CentOS 7.6 with upstream kernel and with 4.20.10-1.el7.elrepo.x86_64 (tested both with

Re: [ClusterLabs] Documentation for Corosync

2019-02-19 Thread Jan Friesse
Guido, Hi: My name's Guido and I'm working at an ISP provider. We are deploying a cluster using Corosync as cluster engine and We're very happy about how's it work. I just want to get more deep about how corosync works at all. I could not find specific material on the web, just how to configure

Re: [ClusterLabs] Antw: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Jan Friesse
Ulrich Windl napsal(a): Hi! IMHO any process running at real-time priorities must make sure that it consumes the CPU only for shorrt moment that are really critical to be performed in time. Specifically having some code that performs poorly (for various reasons) is absolutely _not_ a candidate

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Jan Friesse
: Jan Friesse Sent: 14 February 2019 18:34 To: Cluster Labs - All topics related to open-source clustering welcomed; Edvin Torok Cc: Mark Syms Subject: Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock? Edwin, Hello, We were testing

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Jan Friesse
Edwin, Hello, We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and noticed a fundamental problem with realtime priorities: - corosync runs on CPU3, and interrupts for the NIC used by corosync are also routed to CPU3 - corosync runs with SCHED_RR, ksoftirqd does not (should

Re: [ClusterLabs] Increasing Token Timeout Safe By Itself?

2019-01-20 Thread Jan Friesse
Eric Robinson napsal(a): I have a few corosync+pacemeker clusters in Azure. Occasionally, cluster nodes failover, possibly because of intermittent connectivity loss, but more likely because one or more nodes experiences high load and is not able to respond in a timely fashion. I want to make the

[ClusterLabs] Corosync 3.0.1 is available at corosync.org!

2019-01-15 Thread Jan Friesse
: Fabio M. Di Nitto (1): [totemknet] update for libknet.so.2.0.0 init API Ferenc Wágner (3): More natural error messages Don't declare success early Config version must be specified Jan Friesse (1): totemip: Use AF_UNSPEC for ipv4-6 and ipv6-4 Jan Pokorný (1): init

Re: [ClusterLabs] Antw: Corosync-qdevice 3.0.0 is available at GitHub!

2019-01-03 Thread Jan Friesse
Hi! Once again you forgot the one-line-summary what is is ;-) I guess a quorum device... :) Yep, it's a quorum device. Regards, Ulrich Jan Friesse schrieb am 12.12.2018 um 15:20 in Nachricht : I am pleased to announce the first stable release of Corosync‑Qdevice 3.0.0 available

[ClusterLabs] Corosync 3.0.0 is available at corosync.org!

2018-12-14 Thread Jan Friesse
I am pleased to announce the first stable release of Corosync 3.0 (codename Camelback) branch available immediately from our website at http://build.clusterlabs.org/corosync/releases/. You can also download RPMs for various distributions from CI https://kronosnet.org/builds/. Corosync 3.0 is

[ClusterLabs] Corosync-qdevice 3.0.0 is available at GitHub!

2018-12-12 Thread Jan Friesse
contains just one small fix for warning found by coverity. Complete changelog for final version (compared to RC 1): Jan Friesse (1): certutils: Fix warnings found by coverity Thanks/congratulations to all people that contributed to achieve this great milestone

[ClusterLabs] Corosync 1.4.10 is available at corosync.org!

2018-12-11 Thread Jan Friesse
during Flatiron lifetime - 224 files changed, 25830 insertions(+), 45350 deletions(-) - This version began it's life on fedorahosted (decommissioned) and source code repository was SVN - Flatiron was last release with LCR supported Complete changelog for 1.4.10 (compared to v1.4.9): Jan Friesse (1

[ClusterLabs] Corosync 3.0 - Beta1 is available at corosync.org!

2018-12-07 Thread Jan Friesse
for adding/removing nodes Jan Friesse (3): testcpg2: Check cpg_dispatch return code cpghum: Check cpg_local_get return code coroparse: Remove unused cs_err initialization We did our best to test this release as best as we could, but still take it as an Beta version. For anybody

Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-11-28 Thread Jan Friesse
lejeczek napsal(a): On 23/11/2018 16:36, Jan Friesse wrote: lejeczek, On 15/10/2018 07:24, Jan Friesse wrote: lejeczek, hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this:   $ pcs status --all Cluster

Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-11-23 Thread Jan Friesse
lejeczek, On 15/10/2018 07:24, Jan Friesse wrote: lejeczek, hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this:   $ pcs status --all Cluster name: CC Stack: corosync Current DC: whale.private (version

[ClusterLabs] Corosync-qdevice 3.0 - RC1 is available at GitHub!

2018-11-23 Thread Jan Friesse
): Jan Friesse (11): tests: Enlarge timeout for process-list test init: disable stderr output in systemd unit file spec: Use different BuildRequires for SUSE spec: add exlicit docdir spec: Autogenerate bcond options based on config build: Remove WITH_LIST

Re: [ClusterLabs] Corosync cluster issue

2018-11-23 Thread Jan Friesse
Hi Pascu, Hi, We are using Corosync 2.4.0; we are having 3-node cluster (let's say, they are 1, 2, 3); we Probably not a solution but update to latest 2.4.4 may be helpful. encountered a situation where node 3 was unable to join the cluster. Node 1 logs: Nov12 22:50:02 corosync[3301]:

[ClusterLabs] Corosync 3.0 - Alpha5 is available at corosync.org!

2018-11-20 Thread Jan Friesse
(3): man: fix cmap key name runtime.config.totem.token man: Fix typo connnections -> connections man: Fix typo conains -> contains Jan Friesse (40): spec: Add explicit gcc build requirement totemknet: Free instance on failure exit util: Fix strncpy in setcs_name_t

Re: [ClusterLabs] corosync in multicast mode produces lots of unicast traffic

2018-10-17 Thread Jan Friesse
Klaus, Hi Jan! Thanks for your answer. I have a Proxmox cluster which uses Corosync as cluster engine. Corosync uses the "default" multicast configuration. Nevertheless, using tcpdump I see much more packets sent by corosync using unicast between the node members than multicast packets. Is

Re: [ClusterLabs] corosync in multicast mode produces lots of unicast traffic

2018-10-17 Thread Jan Friesse
Klaus, Hi! I have a Proxmox cluster which uses Corosync as cluster engine. Corosync uses the "default" multicast configuration. Nevertheless, using tcpdump I see much more packets sent by corosync using unicast between the node members than multicast packets. Is this normal behavior? If

Re: [ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

2018-10-15 Thread Jan Friesse
lejeczek, hi guys, I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or probably something else in between) is not right. I see this:  $ pcs status --all Cluster name: CC Stack: corosync Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum

Re: [ClusterLabs] compatibility between Corosync 1.4.7 and 2.4.0

2018-10-04 Thread Jan Friesse
Ramasubramanian, Hi Team, We are doing an upgrade of corosync to 2.4.0 and would like to know if that is compatible with 1.4.7. The reason is that one node in the cluster will be 1.47 during the upgrade.Let me know. No, Corosync 1.x and 2.x are not wire compatible. Configs usually are. I

Re: [ClusterLabs] Keep printing "Sent 0 CPG messages" in corosync.log

2018-10-01 Thread Jan Friesse
lkxjtu, Corosync.log has kept printing the following logs for several days. What's wrong with the corosync cluster? Now the cpu load is not high. Interesting messages from logs you've sent are: Sep 30 01:23:28 [127667] paas-controller-172-21-0-2 corosync warning [MAIN ]

Re: [ClusterLabs] Corosync 3 release plans?

2018-09-26 Thread Jan Friesse
Ferenc, Jan Friesse writes: wagner.fer...@kifu.gov.hu writes: triggered by your favourite IPC mechanism (SIGHUP and SIGUSRx are common choices, but logging.* cmap keys probably fit Corosync better). That would enable proper log rotation. What is the reason that you find "copytru

Re: [ClusterLabs] Corosync 3 release plans?

2018-09-25 Thread Jan Friesse
Ferenc, Jan Friesse writes: Default example config should be definitively ported to newer style of nodelist without interface section. example.udpu can probably be deleted as well as example.xml (whole idea of having XML was because of cluster config tools like pcs, but these tools never

Re: [ClusterLabs] Corosync 3 release plans?

2018-09-24 Thread Jan Friesse
Ferenc, Jan Friesse writes: Have you had a time to play with packaging current alpha to find out if there are no issues? I had no problems with Fedora, but Debian has a lot of patches, and I would be really grateful if we could reduce them a lot - so please let me know if there is patch

Re: [ClusterLabs] Corosync 3 release plans?

2018-08-27 Thread Jan Friesse
Ferenc, Jan Friesse writes: Currently I'm pretty happy with current Corosync alpha stability so it would be possible to release final right now, but because I want to give us some room to break protocol/abi (only if needed and right now I don't see any strong reason for such breakage), I

Re: [ClusterLabs] Corosync 3 release plans?

2018-08-27 Thread Jan Friesse
Ferenc, Jan Friesse writes: try corosync 3.x (current Alpha4 is pretty stable [...] Hi Honza, Can you provide an estimate for the Corosync 3 release timeline? We have to plan the ABI transition in Debian anf the freeze date is drawing closer. Currently I'm pretty happy with current

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread Jan Friesse
=ubuntu-18-04-lts-x86-64/consoleText). Also please consult http://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf about changes in corosync configuration. Regards, Honza Thank you for your quick response! 2018-08-23 8:40 GMT+02:00 Jan Friesse : David, Hello, Im getting crazy about

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread Jan Friesse
David, Hello, Im getting crazy about this problem, that I expect to resolve here, with your help guys: I have 2 nodes with Corosync redundant ring feature. Each node has 2 similarly connected/configured NIC's. Both nodes are connected each other by two crossover cables. I believe this is

Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-23 Thread Jan Friesse
pires, so with large token timeout detection last long. Regards, Honza Thanks for the help Prasaf On Wed, 22 Aug 2018, 11:55 am Jan Friesse, wrote: Prasad, Thanks Ken and Ulrich. There is definitely high IO on the system with sometimes IOWAIT s of upto 90% I have come across some previo

Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-22 Thread Jan Friesse
Prasad, Thanks Ken and Ulrich. There is definitely high IO on the system with sometimes IOWAIT s of upto 90% I have come across some previous posts that IOWAIT is also considered as CPU load by Corosync. Is this true ? Does having high IO may lead corosync complain as in " Corosync main process

Re: [ClusterLabs] Using the max_network_delay option of the corosync

2018-08-17 Thread Jan Friesse
Vladislav, Hello! I have the cluster based on pacemaker and corosync. The cluster nodes are located in several data centers. And there is up to 450 ms ping latency between some of the nodes. Sometimes these delays lead to split brains. Yep, that's quite a lot. Enlarge token timeout so

Re: [ClusterLabs] short circuiting the corosync token timeout

2018-08-13 Thread Jan Friesse
Chris Walker napsal(a): Hello, Before Pacemaker can declare a node as 'offline', the Corosync layer must first declare that the node is no longer part of the cluster after waiting a full token timeout.  For example, if I manually STONITH a node with 'crm -F node fence node2', even if the

[ClusterLabs] Corosync-qdevice 3.0 - Beta1 is available at GitHub!

2018-08-09 Thread Jan Friesse
changelog for Beta 1 (compared to Alpha 2): Jan Friesse (13): tests: Fix process-list test to work on FreeBSD tests: process-list add stdlib include tests: enlarge process-list timeouts unix-socket: Fix snprintf warning tests: Notify process-list after trap was created

Re: [ClusterLabs] corosync-qdevice doesn't daemonize (or stay running)

2018-07-13 Thread Jan Friesse
Jason, On Thu, Jun 21, 2018 at 10:47 AM Jason Gauthier wrote: On Thu, Jun 21, 2018 at 9:49 AM Jan Pokorný wrote: On 21/06/18 07:05 -0400, Jason Gauthier wrote: On Thu, Jun 21, 2018 at 5:11 AM Christine Caulfield wrote: On 19/06/18 18:47, Jason Gauthier wrote: Attached! That's very

[ClusterLabs] Corosync 3.0 - Alpha4 is available at corosync.org!

2018-07-13 Thread Jan Friesse
knet: Don't try to create loopback interface twice totemconfig: Check for things that cannot be changed on the fly config: Enforce use of 'name' node attribute in multi-link clusters config: Fail config validation if not all nodes have all links Jan Friesse (4): upstart: Remove

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-07-02 Thread Jan Friesse
Hi Thomas, Hi, Am 04/25/2018 um 09:57 AM schrieb Jan Friesse: Thomas Lamprecht napsal(a): On 4/24/18 6:38 PM, Jan Friesse wrote: On 4/6/18 10:59 AM, Jan Friesse wrote: Thomas Lamprecht napsal(a): Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: I've tested it too and yes, you are 100% right

[ClusterLabs] Corosync 3.0 - Alpha3 is available at corosync.org!

2018-04-30 Thread Jan Friesse
t: the parameter is reserved, must be NULL tools: don't distribute what we can easily make Jan Friesse (7): totemsrp: Implement sanity checks of received msgs totemsrp: Check join and leave msg length SECURITY: Remove SECURITY file totemsrp: Fix srp_addr_compare totemsr

[ClusterLabs] Corosync-qdevice 3.0 - Alpha2 is available at GitHub!

2018-04-27 Thread Jan Friesse
1): Bin Liu (1): qdevice: optarg should be str in init_from_cmap Ferenc Wágner (1): Fix typo: sucesfully -> successfully Jan Friesse (12): spec: Modernize spec file a bit init: Quote subshell result properly Quote certutils scripts properly Fix NULL poin

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-25 Thread Jan Friesse
Thomas Lamprecht napsal(a): Honza, On 4/24/18 6:38 PM, Jan Friesse wrote: On 4/6/18 10:59 AM, Jan Friesse wrote: Thomas Lamprecht napsal(a): Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-24 Thread Jan Friesse
Thomas, Hi Honza On 4/6/18 10:59 AM, Jan Friesse wrote: Thomas Lamprecht napsal(a): Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder

Re: [ClusterLabs] Corosync 2.4.4 is available at corosync.org!

2018-04-13 Thread Jan Friesse
Ferenc Wágner napsal(a): Jan Friesse <jfrie...@redhat.com> writes: Ferenc Wágner napsal(a): I wonder if c139255 (totemsrp: Implement sanity checks of received msgs) has direct security relevance as well. Not entirely direct, but quite similar. Should I include that too in the

Re: [ClusterLabs] Corosync 2.4.4 is available at corosync.org!

2018-04-13 Thread Jan Friesse
Ferenc Wágner napsal(a): Jan Pokorný <jpoko...@redhat.com> writes: On 12/04/18 14:33 +0200, Jan Friesse wrote: This release contains a lot of fixes, including fix for CVE-2018-1084. Security related updates would preferably provide more context Absolutely, thanks for pro

[ClusterLabs] Corosync 2.4.4 is available at corosync.org!

2018-04-12 Thread Jan Friesse
vor Use RuntimeDirectory instead of tmpfiles.d Jan Friesse (29): coroparse: Do not convert empty uid, gid to 0 sam: Fix snprintf compiler warnings quorumtool: Use full buffer size in snprintf man: Add note about qdevice parallel cmds start sync: Remove unneeded determ

Re: [ClusterLabs] Pacemake/Corosync good fit for embedded product?

2018-04-11 Thread Jan Friesse
David, Hi, We are planning on creating a HA product in an active/standby configuration whereby the standby unit needs to take over from the active unit very fast (<50ms including all services restored). We are able to do very fast signaling (say 1000Hz) between the two units to detect

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-06 Thread Jan Friesse
Hi Thomas, Thomas Lamprecht napsal(a): Hi Honza, Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: Thomas, TotemConfchgCallback: ringid (1.1436) active processors 3: 1 2 3 EXIT Finalize result is 1 (should be 1) Hope I did both test right, but as it reproduces multiple times with testcpg

Re: [ClusterLabs] corosync 2+ version on centos6.9

2018-03-27 Thread Jan Friesse
installed pacemaker 1.1.15-5. Choose 2.4.3. Other versions in 2.x branch are old and contains more or less severe bugs. Honza -Jing On Mon, Mar 26, 2018 at 3:33 AM, Jan Friesse <jfrie...@redhat.com> wrote: Jing, Hi, Is there a way to install corosync 2.0 or above on CentOS 6.9.

Re: [ClusterLabs] corosync 2+ version on centos6.9

2018-03-26 Thread Jan Friesse
Jing, Hi, Is there a way to install corosync 2.0 or above on CentOS 6.9. I need it to work with pgsqlms module which supports postgresql10. Thanks, Yes, but no official rpms are provided. You can compile it by yourself and it will work (I'm still use some RHEL 6 boxes for Needle development

[ClusterLabs] Corosync 3.0 - Alpha2 is available at corosync.org!

2018-03-16 Thread Jan Friesse
] fix build with non-standard knet location [rpm] fixup corosync.spec.in to build on opensuse [rpm] use rpm macros to identify build distro Jan Friesse (9): logging: Make blackbox configurable logging: Close before and open blackbox after fork init: Quote subshell result

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-09 Thread Jan Friesse
Thomas, Hi, On 3/7/18 1:41 PM, Jan Friesse wrote: Thomas, First thanks for your answer! On 3/7/18 11:16 AM, Jan Friesse wrote: ... TotemConfchgCallback: ringid (1.1436) active processors 3: 1 2 3 EXIT Finalize result is 1 (should be 1) Hope I did both test right

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-07 Thread Jan Friesse
Thomas, First thanks for your answer! On 3/7/18 11:16 AM, Jan Friesse wrote: Thomas, Hi, first some background info for my questions I'm going to ask: We use corosync as a basis for our distributed realtime configuration file system (pmxcfs)[1]. nice We got some reports

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-07 Thread Jan Friesse
Thomas, Hi, first some background info for my questions I'm going to ask: We use corosync as a basis for our distributed realtime configuration file system (pmxcfs)[1]. nice We got some reports of a completely hanging FS with the only correlations being high load, often IO, and most

Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?

2018-03-05 Thread Jan Friesse
Eric, Well, I finally got around to trying out the tag and I managed to get two rings running, but the first ring is not obeying my address and port directives. Here's my cluster.conf

Re: [ClusterLabs] Pacemaker 2.0.0-rc1 now available

2018-02-19 Thread Jan Friesse
Ken Gaillot napsal(a): On Fri, 2018-02-16 at 15:06 -0600, Ken Gaillot wrote: Source code for the first release candidate for Pacemaker version 2.0.0 is now available at:   https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.0 .0 -rc1 The main goal of the change from Pacemaker 1

Re: [ClusterLabs] What does these logs mean in corosync.log

2018-02-12 Thread Jan Friesse
lkxjtu, I will just comment corosync log. These logs are both print when system is abnormal, I am very confused what they mean. Does anyone know what they mean? Thank you very much corosync version 2.4.0 pacemaker version 1.1.16 1) Feb 01 10:57:58 [18927] paas-controller-192-167-0-2

Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?

2018-02-12 Thread Jan Friesse
Eric, General question. I tried to set up a cman + corosync + pacemaker cluster using two corosync rings. When I start the cluster, everything works fine, except when I do a 'corosync-cfgtool -s' it only shows one ring. I tried manually editing the /etc/cluster/cluster.conf file adding two

Re: [ClusterLabs] [IMPORTANT] Fatal, yet rare issue verging on libqb's design flaw and/or it's use corosync around daemon-forking

2018-01-22 Thread Jan Friesse
It was discovered that corosync exposes itself for a self-crash under rare circumstance whereby corosync executable is run when there is already a daemon instance around (does not apply to corosync serving without any backgrounding, i.e. launched with "-f" switch). Such a circumstance can be

<    1   2   3   >