from:"\"Christine Caulfield\""

[ClusterLabs] [Announce] libqb 2.0.7 released

2023-07-21 Thread Christine caulfield


We are pleased to announce the release of libqb 2.0.8

https://github.com/ClusterLabs/libqb/releases/tag/v2.0.8

The main purpose of this release is to fix a potential memory overwrite 
caused by very long log messages, so an upgrade is recommended.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 2.0.7 released

2023-06-07 Thread Christine caulfield


We are pleased to announce the release of libqb 2.0.7

https://github.com/ClusterLabs/libqb/releases/tag/v2.0.7


This release mainly fixes build and test issues (especially building 
with -j which is now supported), but there

are a few obscure bugfixes in here too that are worthwhile upgrading to fix.

Chrissie Caulfield (11):
library fixes:
lib: Fix some small bugs spotted by newest covscan (#471)
ipc: Retry receiving credentials if the the message is short (#476)
timer: Move state check to before time check (#479)
tests: Close race condition in check_loop (#480)
blackbox: fix potential overlow/memory corruption (#486)

other fixes
tests: Make ipc test more portable (#466)
tests: cleanup the last of the empty directories (#467)
doxygen2man: Fix function parameter alignment (#468)
tests: Fix tests on FreeBSD-devel (#469)
test: Remove gnu/lib-names.h from libstat_wrapper.c (#482)
tests: allow -j to work (#485)
Update -version info for 2.0.7

Fabrice Fontaine (1):
Add --disable-tests option (#475)

Jan Friesse (2):
configure: Modernize configure.ac a bit (#470)
spec: Migrate to SPDX license (#487)

growdu (1):
add simplified chinese readme (#474)

wferi (2):
m4/ax_pthread.m4: update to latest upstream version (serial 31) (#472)
strlcpy: avoid compiler warning from strncpy (#473)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker-fenced /dev/shm errors

2023-03-27 Thread Christine caulfield


On 27/03/2023 07:48, d tbsky wrote:

Hi:
the cluster is running under RHEL 9.0 elements. today I saw log
report strange errors like below:

Mar 27 13:07:06.287 example.com pacemaker-fenced[2405]
(qb_sys_mmap_file_open) error: couldn't allocate file
/dev/shm/qb-2405-2403-12-A9UUaJ/qb-request-stonith-ng-data:
Interrupted system call (4)
Mar 27 13:07:06.288 example.com pacemaker-fenced[2405]
(qb_rb_open_2)  error: couldn't create file for mmap
Mar 27 13:07:06.288 example.com pacemaker-fenced[2405]
(qb_ipcs_shm_rb_open)   error:
qb_rb_open:/dev/shm/qb-2405-2403-12-A9UUaJ/qb-request-stonith-ng:
Interrupted system call (4)
Mar 27 13:07:06.288 example.com pacemaker-fenced[2405]
(qb_ipcs_shm_connect)   error: shm connection FAILED: Interrupted
system call (4)
Mar 27 13:07:06.288 example.com pacemaker-fenced[2405]
(handle_new_connection) error: Error in connection setup
(/dev/shm/qb-2405-2403-12-A9UUaJ/qb): Interrupted system call (4)
Mar 27 13:07:06.288 example.com pacemakerd  [2403]
(pcmk__ipc_is_authentic_process_active) info: Could not connect to
stonith-ng IPC: Interrupted system call
Mar 27 13:07:06.288 example.com pacemakerd  [2403]
(check_active_before_startup_processes) notice:
pacemaker-fenced[2405] is unresponsive to ipc after 1 tries

there are no more "pacemaker-fenced" keywords in the log. the cluster
seems fine and the process id "2405" of pacemaker-fenced is still
running. may I assume the cluster is ok and I don't need to do
anything since pacemaker didn't complain further?



It sounds like you're running an old version of libqb, upgrading to 
libqb 2.0.6 (in RHEL 9.1) should fix those messages


Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker-remoted /dev/shm errors

2023-03-06 Thread Christine caulfield


Hi,

The error is coming from libqb - which is what manages the local IPC 
connections between local clients and the server.


I'm the libqb maintainer but I've never seen that error before! Is there 
anything unusual about the setup on this node? Like filesystems on NFS 
or some other networked filesystem?


Other basic things to check are that /dev/shm is not full. Yes, normally 
you'd get ENOSPC in that case but it's always worth checking because odd 
things can happen when filesystems get full.


It might be helpful strace the client and server processes when the 
error occurs (if that's possible). I'm not 100% sure which operation is 
failing with EREMOTEIO - though I can't find many useful references to 
that error in the kernel which is also slightly weird.


Chrissie

On 06/03/2023 13:03, Alexander Epaneshnikov via Users wrote:

Hello. we are using pacemaker 2.1.4-5.el8  and seeing strange errors in the
logs when a request is made to the cluster.

Feb 17 08:18:15 gm-srv-oshv-001.int.cld pacemaker-remoted   [2984] 
(handle_new_connection)  error: Error in connection setup 
(/dev/shm/qb-2984-1077673-18-7xR8Y0/qb): Remote I/O error (121)
Feb 17 08:19:15 gm-srv-oshv-001.int.cld pacemaker-remoted   [2984] 
(handle_new_connection)  error: Error in connection setup 
(/dev/shm/qb-2984-1077927-18-dX5NSt/qb): Remote I/O error (121)
Feb 17 08:20:16 gm-srv-oshv-001.int.cld pacemaker-remoted   [2984] 
(handle_new_connection)  error: Error in connection setup 
(/dev/shm/qb-2984-1078160-18-RjzD4K/qb): Remote I/O error (121)
Feb 17 08:21:16 gm-srv-oshv-001.int.cld pacemaker-remoted   [2984] 
(handle_new_connection)  error: Error in connection setup 
(/dev/shm/qb-2984-1078400-18-YyJmJJ/qb): Remote I/O error (121)

other than that pacemaker/corosync works fine.

any suggestions on the cause of the error, or at least where to start 
debugging, are welcome.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] corosync not starting

2022-06-28 Thread Christine caulfield


On 27/06/2022 17:10, Sridhar K wrote:

Hi Team,

corosync not starting , getting below error  any port number which I can 
do telnet and check similar to that of 2224 for pcs


image.png

image.png



The error message from Corosync is "no interfaces defined" - so it looks 
like the node(s) being started can't find a name in corosync.conf that 
matches the host - either by ip address or name. Without knowing more 
it's hard to help further, but if the config was generated by pcs then 
it might be worth posting your corosync.conf file to see what has 
happened to it. I think it would be odd for pcs to generate a file on a 
node with incorrect names, though I'm not pcs expert.


If you generated the corosync.conf file yourself then check that the 
names in the file match those available to interfaces in the nodes 
themselves. Either use IP addresses or non-ambiguous names - ideally in 
/etc/hosts for reliability.


Chrissie


Regards
Sridharan

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] No node name in corosync-cmapctl output

2022-06-01 Thread Christine caulfield


On 01/06/2022 11:17, Jan Friesse wrote:

On 31/05/2022 16:28, Andreas Hasenack wrote:

Hi,

On Tue, May 31, 2022 at 1:35 PM Jan Friesse  wrote:


Hi,

On 31/05/2022 15:16, Andreas Hasenack wrote:

Hi,

corosync 3.1.6
pacemaker 2.1.2
crmsh 4.3.1

TL;DR
I only seem to get a "name" attribute in the "corosync-cmapctl | grep
nodelist" output if I set an explicit name in corosync.conf's
nodelist. If I rely on the default of "name will be uname -n if it's
not set", I get nothing.



wondering where is problem? name is not set so it's not in cmap, what is
(mostly) 1:1 mapping of config file. So this is expected, not a bug.


It was surprising to me, because the node clearly has a name (crm_node 
-n).



Why not also use "uname -n" when "name" is not explicitly set in the
corosync nodelist config?


Can you please share use case for this behavior? It shouldn't be be hard
to implement.


The use case is a test script[1], which installs the package, starts
the services, and then runs some quick checks. One of the tests is to
check for the node name in "crm status" output, and for that it needs
to discover the node name.


got it



Sure, plenty of ways to achieve that. Set it in the config to a known
name, or run "crm_node -n", or something else. The script is doing:
POS="$(corosync-cmapctl -q -g nodelist.local_node_pos)"
NODE="$(corosync-cmapctl -q -g nodelist.node.$POS.name)"


Ok, so you need only local node name - then why not to add
```
[ "$NODE" == " ] && NODE=`uname -n`
```



corosync-quorumtool display node names. It just calls getnameinfo() on 
the IP address and returns the first result. but might serve.


Chrissie


No matter what, implementing resolving of just local node name would be 
really easy - implementing it clusterwise would be super hard (on 
corosync level). On the other hand, I'm really not that keen to have 
filled just local node name + it creates bunch of other problems 
(default value during reload, ...).






and I was surprised that there was no "name" entry. In this cluster
stack, depending on which layer you ask,  you may get different
answers :)


Yup, agree. Sometimes it's confusing :( But the test is really about 
`crm` so pacemaker level...


Regards,
   Honza






1. 
https://salsa.debian.org/ha-team/crmsh/-/blob/master/debian/tests/pacemaker-node-status.sh 


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync Transport- Knet Vs UDPU

2022-03-27 Thread Christine caulfield


On 28/03/2022 03:30, Somanath Jeeva via Users wrote:

Hi ,

I am upgrading from corosync 2.x/pacemaker 1.x to corosync 3.x/pacemaker 
2.1.x


In our use case we are using a 2 node corosync/pacemaker cluster.

In corosync 2.x version I was using udpu as transport method. In the 
corosync 3.x , as per man pages, the default transport mode is knet . 
And in knet it uses udp as knet method.


I have the below doubts on the transport method.

 1. Does knet require any special configuration on network level(like
multicast enabling).



No. What knet calls UDP is similar (from the user POV) to corosync's 
UDPU, it's a unicast transport and doesn't need any multicast 
configuration.


Sorry that's confusing, but it's more technically 'correct'. The main 
reason UDPU was called that was because it was new to corosync when the 
old (multicast) UDP protocol caused trouble for some people without good 
multicast networks.




 2. In corosync 2.x udp was used for multicast, in knet transport does
udp mean multicast.


No, see above. There is no multicast transport in knet.


 3. Will udpu be deprecated in future.



Yes. We strongly recommend people use knet as the corosync transport as 
that is the one getting most development. The old UDP/UDPU protocols 
will only get bugfixes. Knet provides muti-homing up to 8 links, and 
link priorities and much more.


I wrote a paper on this when we first introduced knet into corosync 
which might help:


https://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf


Chrissie



Kindly help me with these doubts.

With Regards

Somanath Thilak J


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 2.0.6 released

2022-03-23 Thread Christine caulfield


A quick update to 2.0.5 that fixes the tests and RPM building.

*the new ipc_sock tests needs to be run as root as otherwise each 
sub-test will timeout - making the run-time huge.
*Make sure that the libstat_wrapper.so library is included in the 
libqb-tests RPM (when built)


If you don't have any issues with the tests or RPMs then please feel 
free to wait for 2.0.7 - but packagers should use this release in 
preference to applying patches.


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 2.0.5 released

2022-03-21 Thread Christine caulfield


We are pleased to announce the release of libqb 2.0.5

The headline feature of this release is the addition of the new 
qb_ipcc_connect_async() API call, but there are lots of smaller fixes 
that should be helpful.


Chrissie Caulfield (7):
ipcc: Add an async connect API (#450)
Tidy some scripts (#454)
Bring the INSTALL guide up-to-date (#456)
unix: Don't fail on FreeBSD running ZFS (#461)
test: Clean /dev/shm a bit better (#459)
blackbox: Sanitize items read from the blackbox header (#438)
tests: Run IPC with use-filesystem-sockets active (#455)

Jakub Jankowski (1):
Retry if posix_fallocate is interrupted with EINTR (#453)

Ken Gaillot (5):
util: refactor so ifdef's are withing each time-related function
util: add constant for which realtime clock to use
util: drop HAVE_CLOCK_GETRES_MONOTONIC configure constant
util: use HAVE_GETTIMEOFDAY where appropriate
util: reimplement time functions as a series of fallbacks


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 2.0.4 released

2021-11-15 Thread Christine caulfield


We are pleased to announce the release of libqb 2.0.4

Source code is available at:
https://github.com/ClusterLabs/libqb/releases/

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

The most important fix in this release is that we no longer log errors 
inside the signal handler in loop_poll.c - this could cause an 
application hang in some circumstances.


There is also a new implementation of the timerlist that should inprove 
performance when a large number of timers are active.


shortlog:

Chrissie Caulfield (3):
  doxygen2man: print structure descriptions (#443)
  Fix pthread returns (#444)
  poll: Don't log in a signal handler (#447)

Jan Friesse (1):
  Implement heap based timer list (#439)

orbea (1):
  build: Fix undefined pthread reference. (#440)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Two node cluster without fencing and no split brain?

2021-07-21 Thread Christine caulfield

On 21/07/2021 09:50, Frank D. Engel, Jr. wrote:
OpenVMS can do this sort of thing without a requirement for fencing (you
still need a third disk as a quorum device in a 2-node cluster), but
Linux (at least in its current form) cannot. From what I can tell the
fencing requirements in the Linux solution are mainly due to limitations
of how deeply the clustering solution is integrated into the kernel.

There is an overview here:
https://sciinc.com/remotedba/techinfo/tech_presentations/Boot%20Camp%202013/Bootcamp_2013_Comparison%20of%20Red%20Hat%20Clusters%20with%20OpenVMS%20Clusters.pdf

An interesting document (if rather out of date now in some areas at
least). I used to work on VMS up to late V5 (I now work on corosync, but
also started the original linux DLM) and always wanted to get Linux
clustering up to that standard. There are several reasons why that
wasn't really possible.

Firstly Linux has a write-back disk cache which makes sharing disks
between machines MUCH harder, and limits a lot of what you can do. Many
of the limitations of GFS2 seem (to me) to be caused by this. VMS - at
least when i was a sysadmin - wrote straight to devices or via dedicated
intelligent controllers that were also a shared cluster resource
(HDC50/75s in my day). I see VMS6 introduced a "Cluster-wide virtual I/O
cache" which sounds like something Linux could do with. But good luck
getting that merged ;)

Secondy, you are right, we never really got kernel buy-in. The original
Sistina CMAN (which I wrote) was a kernel module, partly because I hoped
that we could get better integration that was (it never happened) and
partly we thought we might need to avoid too many kernel/usermode
context switches for GFS

Thirdly, and mainly, I got the impression that people didn't, in the
main, want that type of cluster. As that document correctly points out,
most Linux clusters are simple(ish) two node failover clusters.
Therefore because also of 1 and 2 above we pursued the path we have.

I would love for Linux to have the cluster capabilities that VMS had in
the 80s/90s but not only is is a massive amount of work, you'd need buy
in from a lot of people who don't really see the point of it.

Chrissie

I am wondering how much of what OpenVMS does could be integrated into
Linux in the future to simplify the HA clustering situation. This is one
thing OpenVMS currently does FAR better than any other platform I've
come across, so it is likely there is still much to be learned from it.

On 7/20/21 6:45 PM, Digimer wrote:

On 2021-07-20 6:04 p.m., john tillman wrote:

Greetings,

Is it possible to configure a two node cluster (pacemaker 2.0) without
fencing and avoid split brain?

No.

I was hoping there was a way to use a 3rd node's ip address, like from a
network switch, as a tie breaker to provide quorum. A simple successful
ping would do it.

Quorum is a different concept and doesn't remove the need for fencing.

I realize that this 'ping' approach is not the bullet proof solution
that

fencing would provide. However, it may be an improvement over two nodes
alone.

It would be, at best, a false sense of security.

Is there a configuration like that already? Any other ideas?

Pointers to useful documents/discussions on avoiding split brain with
two

node clusters would be welcome.

https://www.alteeve.com/w/The_2-Node_Myth

(note: currently throwing a cert error related to the let's encrypt
issue, should be cleared up soon).

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 2.0.3 released

2021-03-03 Thread Christine Caulfield


We are pleased to announce the release of libqb 2.0.3. This is the
latest stable release of libqb


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/2.0.3/libqb-2.0.3.tar.xz

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

This is another miscellaneous collection of small bugfixes with the 
addition of a new feature to the logging subsystem to allow the 
MESSAGE_ID to be specified, when using the systemd journal. This 
requires libqb to be build --with-systemd




Aleksei Burlakov (1):
syslog: Add a message-id parameter for messages (#433)

Chrissie Caulfield (5):
doxygen2man: fix printing of lines starting with '.' (#431)
strlcpy: Check for maxlen underflow (#432)
ipcc: Have a few goes at tidying up after a dead server (#434)
timers: Add some locking (#436)
tests: Fix up resources.test (#435)

wferi (1):
doxygen2man: ignore all-whitespace brief descriptions (#430)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: effieciently collecting some cluster facts

2021-02-24 Thread Christine Caulfield

The most efficient way of getting corosync facts about nodes/quorum is 
to use the votequorum API.


see /usr/include/corosync/votequorum.h
and in the corosync sources tarball tests/testvotequorum1.c

CHrissie


On 25/02/2021 07:16, Ulrich Windl wrote:

Hi!

I'm thinking about some simple cluster status display that is updated 
periodically.
I wonder how to get some "cluster facts" efficiently. Among those are:

* Is corosync running, and how many nodes can be seen?
* Is Pacemaker running, how many nodes does it see, and does it have a quorum?
* Is the current node DC?
* How many resources matching some regular expression are running?

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] corosync.conf is missing, I did not delete manually. what should I do?

2021-02-16 Thread Christine Caulfield

If you ran pcs cluster destroy then, yes that will delete cluster.conf 
(at least it did when I just tried it) - which seems reasonable 
behaviour to me.


If you want it back then you should either rerun pcs to create the 
cluster again or rescue the file from system backups I suppose.


Chrissie



On 16/02/2021 09:42, Harishkumar Pathangay wrote:

Hi,

This is so stupid of me asking such a question.

But the corosync.conf is missing in both the nodes.

Will this file be there only if I have a cluster definition? [may be I 
have destroyed the cluster not sure though….]


Assume if there is no clusters at all defined, suppose if I destroy the 
cluster should I expect corosync.conf file in its location?


Thanks,

Harish P

Sent from Mail  for 
Windows 10



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync node gets unique Ring ID

2021-01-27 Thread Christine Caulfield

A few things really stand out from this report, I think the inconsistent 
ring_id is just a symptom.

It worries me that corosync-quorumtool behaves differently on some nodes 
- some show names, some just IP addresses. That could be a cause of some 
inconsistency.

Also the messages
"
Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN  ] 
Totem is unable to form a cluster because of an operating system or 
network fault. The most common cause of this message is that the local 
firewall is configured improperly.
Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN  ] 
Totem is unable to form a cluster because of an operating system or 
network fault. The most common cause of this message is that the local 
firewall is configured improperly.
Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ] 
The consensus timeout expired.
Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ] 
entering GATHER state from 3(The consensus timeout expired.).
Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN  ] 
Totem is unable to form a cluster because of an operating system or 
network fault. The most common cause of this message is that the local 
firewall is configured improperly."

Are a BAD sign. All this is contributing to the problems and also the 
timeout on reload (which is reallly not a good thing). Those messages 
are not caused by the reload, they are caused by some networking problems.

So what seems to be happening is that the cluster is being partitioned 
somehow (I can't tell why, that's something you'll need to investigate) 
and corosync isn't recovering very well from it. One of the things that 
can make this happen is doing "ifdown" - which that old version of 
coroysnc doesn't cope with very well. Even if that's not exactly what 
you are doing (and I see no reason to beleive you are) I do wonder if 
something similar is happening by other means - NetworkManager perhaps?)

So firstly, check the networking setup and be sure that all te nodes are 
consistently configures and check that the network is not closing down 
interfaces or ports at the time of the incident.

Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that 
will help.

Chrissie

On 26/01/2021 02:45, Igor Tverdovskiy wrote:

Hi All,

 > pacemakerd -$
Pacemaker 1.1.15-11.el7

 > corosync -v
Corosync Cluster Engine, version '2.4.0'

 > rpm -qi libqb
Name        : libqb
Version     : 1.0.1

Please assist. Recently faced a strange bug (I suppose), when one of the 
cluster nodes gets different from others "Ring ID" for example after 
corosync config reload , e.g.:

*Affected node:*

(target.standby)> sudo corosync-quorumtool
Quorum information
--
Date:             Tue Jan 26 01:58:54 2021
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          5
Ring ID: *7/59268* <<<
Quorate:          Yes

Votequorum information
--
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
--
     Nodeid      Votes Name
          7          1 dispatching-sbc
          8          1 dispatching-sbc-2-6
          3          1 10.27.77.202
          5          1 cassandra-3 (local)
          6          1 10.27.77.205

*OK nodes:*
 > sudo corosync-quorumtool
Quorum information
--
Date:             Tue Jan 26 01:59:13 2021
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          8
Ring ID: *7/59300* <<<
Quorate:          Yes

Votequorum information
--
Expected votes:   5
Highest expected: 5
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
--
     Nodeid      Votes Name
          7          1 10.27.77.106
          8          1 10.27.77.107 (local)
          3          1 10.27.77.202
          6          1 10.27.77.205

Also strange is that *crm status shows only two of five nodes* on the 
affected node, but at the same time

*"sudo crm_node -l" shows all 5 nodes as members*.

(target.standby)> sudo crm_node -l
5 target.standby member
7 target.dsbc1 member
3 target.sip member
8 target.dsbc member
6 target.sec.sip member

---

(target.standby)> sudo crm status
Stack: corosync
Current DC: target.sip (version 1.1.15-11.el7-e174ec8) - partition with 
quorum
Last updated: Tue Jan 26 02:08:02 2021          Last change: Mon Jan 25 
14:27:18 2021 by root via crm_node on target.sec.sip

2 nodes and 7 resources configured

Online: [ target.sec.sip target.sip ] <<

Full list of resources:

The issue here is that crm configure operations fail with timeout error:

(target.standby)> sudo crm configure property maintenance-mode=true
*Call cib_apply_diff failed (-62): Timer expired*
ERROR: could not patch cib (rc=62)
INFO: offending xm

[ClusterLabs] {announce] [Alpha] Rust bindings for Corosync libraries

2021-01-20 Thread Christine Caulfield

I don't know how many/few people will be interested in this, but I have 
been working on some Rust bindings for the corosync libraries: cpg, cfg, 
cmap, quorum & votequorum.


They are currently in Alpha stage but all features are (I think) 
implemented and seem to work. There's a little more work to be done 
before it is suitable for submitting to crates.io but that is definitely 
intended.


If anyone wants to try these out and let me know how they get on, report 
bugs, suggestions missing features etc, I am keen to hear from you.


If there is a demand for more of this then I might turn my hand to other 
cluster library APIs (management and time permitting ;)


https://github.com/chrissie-c/rust-corosync

Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Running shell command on remote node via corosync messaging infrastructure

2021-01-04 Thread Christine Caulfield




On 04/01/2021 13:19, Klaus Wenninger wrote:

On 1/4/21 1:50 PM, Christine Caulfield wrote:



On 04/01/2021 09:21, Klaus Wenninger wrote:

On 1/4/21 8:36 AM, Christine Caulfield wrote:



On 18/12/2020 20:41, Andrei Borzenkov wrote:

18.12.2020 21:54, Ken Gaillot пишет:

On Fri, 2020-12-18 at 17:51 +, Animesh Pande wrote:

Hello,

Is there a tool that would allow for commands to be run on remote
nodes in the cluster through the corosync messaging layer? I have a
cluster configured with multiple corosync communication rings
(public
network and private network). I would like to be able to run a
command on the remote node through corosync layer even when the
communication ring associated with the public network goes down but
the private network communication ring is still connected.

Please let me know if there is such a tool provided by corosync that
I can use.

Thank you for your time!

Best regards,
Animesh


Hi,

No, there is not. I'm assuming you're using "remote" in the
conventional sense and not for Pacemaker Remote nodes, but the answer
is no either way. :)

Of course, you can configure sshd to listen on the cluster interface.


What do you call "cluster interface"? As I understand the question,
the
idea is to use redundancy of corosync communication. Is it possible to
configure virtual interface on top of corosync rings?




Yes there is. It's called the 'nozzle' device and works in corosync >=
3.0.2.

It creates a pseudo device that passes all traffic through the knet
transport, so you get the redundancy of multiple links transparently.
You don't get join/leave up/down notifications like CPG (because it's
an interface not an API) but you can use the API if you need those.

Talking of this ... would it be possible / make sense to translate that
to something
that trigger if/link up/down somehow? (up as long as there is at least
one member
apart from the local node or something)



I'm not really sure how that would work. Forging ICMP notifications
sounds a bit messy and we can't take the I/F down just for one node
going down as that would be very anti-social.

Was aware that it needs some algorithm that is a bit more sophisticated.
('up as long as there is at least one member apart from the local node
or something' was my crude first idea)
Anyway probably quite useless for anything else than point-to-point
connections. Hmm ... similar as with layer 1 indication.
The ICMP idea might of course be something to think about but sounds
like significant effort and duplication of IP-stack-stuff - unless there are
interfaces I'm not aware of. Are switches doing anything similar? (L3
probably)
Haven't tried ... How does it handle DHCP expires while isolated?

Just thought it might be fun to think about possibilities ...



oh it is! I would be interested to hear what users actually want from 
this sort of interaction.


DHCP isn't involved here, corosync.conf provides the IP address for the 
nodes on the internal network (using the nodeid as the LSB/LSW)



Chrissie


ISTM that some code would be needed to interpret what was going on for
any such situation so you might as well use CPG or quorum libraries in
that instance. The main point of libnozzle is that applications can
run unchanged.

That was the idea behind making it look like layer 1 going away or something
already handled by existing applications/services/...

Klaus


Chrissie


Klaus


I'm not sure if it's supported in pcs yet, but you just add the
information too corosync.conf (on all nodes):

nozzle {
  name: noz01
  ipaddr 192.168.10.0
  ipprefix: 24
}

See corosync.conf(5) for more information.


Chrissie



If you give the cluster interface on each node a unique name in
DNS (or
hosts or whatever), you can ssh to that name.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Running shell command on remote node via corosync messaging infrastructure

2021-01-04 Thread Christine Caulfield




On 04/01/2021 09:21, Klaus Wenninger wrote:

On 1/4/21 8:36 AM, Christine Caulfield wrote:



On 18/12/2020 20:41, Andrei Borzenkov wrote:

18.12.2020 21:54, Ken Gaillot пишет:

On Fri, 2020-12-18 at 17:51 +, Animesh Pande wrote:

Hello,

Is there a tool that would allow for commands to be run on remote
nodes in the cluster through the corosync messaging layer? I have a
cluster configured with multiple corosync communication rings (public
network and private network). I would like to be able to run a
command on the remote node through corosync layer even when the
communication ring associated with the public network goes down but
the private network communication ring is still connected.

Please let me know if there is such a tool provided by corosync that
I can use.

Thank you for your time!

Best regards,
Animesh


Hi,

No, there is not. I'm assuming you're using "remote" in the
conventional sense and not for Pacemaker Remote nodes, but the answer
is no either way. :)

Of course, you can configure sshd to listen on the cluster interface.


What do you call "cluster interface"? As I understand the question, the
idea is to use redundancy of corosync communication. Is it possible to
configure virtual interface on top of corosync rings?




Yes there is. It's called the 'nozzle' device and works in corosync >=
3.0.2.

It creates a pseudo device that passes all traffic through the knet
transport, so you get the redundancy of multiple links transparently.
You don't get join/leave up/down notifications like CPG (because it's
an interface not an API) but you can use the API if you need those.

Talking of this ... would it be possible / make sense to translate that
to something
that trigger if/link up/down somehow? (up as long as there is at least
one member
apart from the local node or something)



I'm not really sure how that would work. Forging ICMP notifications 
sounds a bit messy and we can't take the I/F down just for one node 
going down as that would be very anti-social.


ISTM that some code would be needed to interpret what was going on for 
any such situation so you might as well use CPG or quorum libraries in 
that instance. The main point of libnozzle is that applications can run 
unchanged.


Chrissie


Klaus


I'm not sure if it's supported in pcs yet, but you just add the
information too corosync.conf (on all nodes):

nozzle {
 name: noz01
 ipaddr 192.168.10.0
 ipprefix: 24
}

See corosync.conf(5) for more information.


Chrissie



If you give the cluster interface on each node a unique name in DNS (or
hosts or whatever), you can ssh to that name.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Running shell command on remote node via corosync messaging infrastructure

2021-01-03 Thread Christine Caulfield




On 18/12/2020 20:41, Andrei Borzenkov wrote:

18.12.2020 21:54, Ken Gaillot пишет:

On Fri, 2020-12-18 at 17:51 +, Animesh Pande wrote:

Hello,

Is there a tool that would allow for commands to be run on remote
nodes in the cluster through the corosync messaging layer? I have a
cluster configured with multiple corosync communication rings (public
network and private network). I would like to be able to run a
command on the remote node through corosync layer even when the
communication ring associated with the public network goes down but
the private network communication ring is still connected.

Please let me know if there is such a tool provided by corosync that
I can use.

Thank you for your time!

Best regards,
Animesh


Hi,

No, there is not. I'm assuming you're using "remote" in the
conventional sense and not for Pacemaker Remote nodes, but the answer
is no either way. :)

Of course, you can configure sshd to listen on the cluster interface.


What do you call "cluster interface"? As I understand the question, the
idea is to use redundancy of corosync communication. Is it possible to
configure virtual interface on top of corosync rings?




Yes there is. It's called the 'nozzle' device and works in corosync >= 
3.0.2.


It creates a pseudo device that passes all traffic through the knet 
transport, so you get the redundancy of multiple links transparently. 
You don't get join/leave up/down notifications like CPG (because it's an 
interface not an API) but you can use the API if you need those.


I'm not sure if it's supported in pcs yet, but you just add the 
information too corosync.conf (on all nodes):


nozzle {
name: noz01
ipaddr 192.168.10.0
ipprefix: 24
}

See corosync.conf(5) for more information.


Chrissie



If you give the cluster interface on each node a unique name in DNS (or
hosts or whatever), you can ssh to that name.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 2.0.2 released

2020-12-03 Thread Christine Caulfield


We are pleased to announce the release of libqb 2.0.2. This is the
latest stable release of libqb


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/2.0.2/libqb-2.0.2.tar.xz

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

This is another miscellaneous collection of small bugfixes. The most 
noticeable change for most people will be the improved man pages.


changelog:

Chrissie Caulfield (10):
tests: Remove deprecated check macros (#412)
doxygen2man: Add option to read copyright line from the header file (#415)
man: Tidy man pages (#416)
doxygen2man: Add support for @code blocks (#417)
doxygen2man: Remove horrible hack (#420)
ipc: add qb_ipcc_auth_get() API call (#418)
ipcs: Add missing qb_list_del when freeing server (#423)
ipcs: ftruncate is not support on WIN32 (#424)
doxygen2man: Fix a couple of covscan-detected errors (#425)
cov: Quieten some covscan warnings (#427)

Christine Caulfield (1):
lib: Update library version for 2.0.2 release

Hideo Yamauchi (1):
ipcs : Decrease log level. (#426)

wferi (1):
doc related fixups (#421)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Q: cryptic messages from "QB"

2020-11-26 Thread Christine Caulfield




On 25/11/2020 13:04, Ulrich Windl wrote:

Christine Caulfield  schrieb am 25.11.2020 um 10:17

in
Nachricht <56738406-9222-a9f3-c57c-e30400a0b...@redhat.com>:

On 25/11/2020 08:45, Ulrich Windl wrote:

Hi!

Setting up a cluster in SLES15 SP2, I wonder about a few log messages:

1) what does "QB" stand for?

2) When QB talks about "server", does it mean "service"?
Examples:
corosync[7982]:   [QB] server name: cmap
corosync[7982]:   [QB] server name: cfg
corosync[7982]:   [QB] server name: cpg
corosync[7982]:   [QB] server name: votequorum
corosync[7982]:   [QB] server name: quorum

3) what is "7982‑7987‑25" in "corosync[7982]:   [QB] Denied connection,

is

not ready (7982‑7987‑25)"?





2) "QB" is just how corosync tags messages that are issued by libqb ‑
which is the library that provides IPC services (mostly) to corosync
andothers. It's just logging the services that have been registered.

1) QB originally stood for "QuarterBack". I have no idea what that is
though I beleive it may be sport‑related. In the spec file for Fedora I
renamed it to "Quite Boring" as it's a library that provides basic
services ;‑)

3) That's just the unique name of the connection. it's made up of the
process PIDs and an incrementing number. The actual full IPC name in
/dev/shm has extra bits added on the end to stop them being guessable.


Thanks for confirming that it's all "Black Magic". ;-)



Pretty much all of clustering is :P

Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: cryptic messages from "QB"

2020-11-25 Thread Christine Caulfield


On 25/11/2020 08:45, Ulrich Windl wrote:

Hi!

Setting up a cluster in SLES15 SP2, I wonder about a few log messages:

1) what does "QB" stand for?

2) When QB talks about "server", does it mean "service"?
Examples:
corosync[7982]:   [QB] server name: cmap
corosync[7982]:   [QB] server name: cfg
corosync[7982]:   [QB] server name: cpg
corosync[7982]:   [QB] server name: votequorum
corosync[7982]:   [QB] server name: quorum

3) what is "7982-7987-25" in "corosync[7982]:   [QB] Denied connection, is not 
ready (7982-7987-25)"?




2) "QB" is just how corosync tags messages that are issued by libqb - 
which is the library that provides IPC services (mostly) to corosync 
andothers. It's just logging the services that have been registered.


1) QB originally stood for "QuarterBack". I have no idea what that is 
though I beleive it may be sport-related. In the spec file for Fedora I 
renamed it to "Quite Boring" as it's a library that provides basic 
services ;-)


3) That's just the unique name of the connection. it's made up of the 
process PIDs and an incrementing number. The actual full IPC name in 
/dev/shm has extra bits added on the end to stop them being guessable.


Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 2.0.1 released

2020-07-29 Thread Christine Caulfield

We are pleased to announce the release of libqb 2.0.1. This is the
latest stable release of libqb


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/2.0.1/libqb-2.0.1.tar.xz

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.


This is a small bugfix release. The main fix is to threading in the log
subsystem when calling logging primitives from different threads in an
application (not the same as using "Threaded logging").

The most visible change is that man pages are now included in the
package. The github.io documentation is now deprecated in favour of the
included mans. doxygen2man has also been improved so that it produces
much nicer man pages than in 2.0.0


Chrissie Caulfield (11):
Some bugs spotted by coverity (#399)
log: Fix threading races (#396)
test: Add unit test for ipcs_connection_auth_set() (#397)
array: More locking fixes (#400)
doxygen2man - Lots of new features & fixes for parsing libqb manpages (#40
doxygen2man: Fix a couple of the worst coverity errors (#404)
CI: Remove .travis.yml (#406)
Make manpages (#405)
doxygen2man - Print structure descriptions (where available) (#408)
doxygen2man: Tidy RETURN VALUE
Bump version for 2.0.1

Fabio M. Di Nitto (2):
[docs] fix man page distribution (#407)
[doxy] fix build when more aggressive -W options are used (#410)

wladmis (1):
unix.c: use posix_fallocate() (#409)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] clusterlabs.github.io

2020-06-29 Thread Christine Caulfield

On 29/06/2020 10:27, Jehan-Guillaume de Rorthais wrote:
> On Mon, 29 Jun 2020 09:27:00 +0100
> Christine Caulfield  wrote:
> 
>> Is anyone (else) using this?
> 
> I do: https://clusterlabs.github.io/PAF/
> 
>> We publish the libqb man pages to clusterlabs.github.io/libqb but I
>> can't see any other clusterlabs projects using it (just by adding, eg,
>> /pacemaker to the hostname).
>>
>> With libqb 2.0.1 having actual man pages installed with it - which seems
>> far more useful to me -  I was considering dropping it if no-one else is
>> using the facility.
> 
> I have some refactoring to do on PAF website style, to adopt clusterlabs
> look & feel. This is a long overdue task on my todo list. If you drop
> clusterlabs.github.io, where will I be able to host PAF docs & stuffs?
> 
]


To be clear, I'm not planning to remove clusterlabs.githib.io, just to
deprecate libqb from it.

Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] clusterlabs.github.io

2020-06-29 Thread Christine Caulfield

Is anyone (else) using this?

We publish the libqb man pages to clusterlabs.github.io/libqb but I
can't see any other clusterlabs projects using it (just by adding, eg,
/pacemaker to the hostname).

With libqb 2.0.1 having actual man pages installed with it - which seems
far more useful to me -  I was considering dropping it if no-one else is
using the facility.

Any opinions?

Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Linux 8.2 - high totem token requires manual setting of ping_interval and ping_timeout

2020-06-26 Thread Christine Caulfield

On 26/06/2020 07:56, Jan Friesse wrote:
> Robert,
> thank you for the info/report. More comments inside.
> 
>> All,
>> Hello.  Hope all is well.   I have been researching Oracle Linux 8.2
>> and ran across a situation that is not well documented.   I decided to
>> provide some details to the community in case I am missing something.
>>
>> Basically, if you increase the totem token above approximately 33000
>> with the knet transport, then a two node cluster will not properly
>> form.   The exact threshold value will slightly fluctuate, depending
>> on hardware type and debugging, but will consistently fail above 4.
> 
> At least corosync with 40sec timeout works just fine for me.
> 


I just tried 41 second token timeout on a 2-node and a 4-node cluster
(pcs/corosync/pacemaker) and it started up just fine. I think we'd need
to see the logs.


> # corosync-cmapctl  | grep token
> runtime.config.totem.token (u32) = 40650
> 
> # corosync-quorumtool
> Quorum information
> --
> Date: Fri Jun 26 08:45:12 2020
> Quorum provider:  corosync_votequorum
> Nodes:    2
> Node ID:  1
> Ring ID:  1.11be1
> Quorate:  Yes
> 
> Votequorum information
> --
> Expected votes:   3
> Highest expected: 3
> Total votes:  2
> Quorum:   2
> Flags:    Quorate
> 
> Membership information
> --
>     Nodeid  Votes Name
>  1  1 vmvlan-vmcos8-n05 (local)
>  6  1 vmvlan-vmcos8-n06
> 
> 
> It is indeed true that forming took a bit more time (30 sec to be more
> precise)
> 
>>
>> The failure to form a cluster would occur when running the "pcs
>> cluster start --all" command or if I would start one cluster, let it
>> stabilize, then start the second.  When it fails to form a cluster,
>> each side would say they are ONLINE, but the other side is
>> UNCLEAN(offline) (cluster state: partition WITHOUT quorum).   If I
>> define proper stonith resources, then they will not fence since the
>> cluster never makes it to an initial quorum state.  So, the cluster
>> will stay in this split state indefinitely.
> 
> Maybe some timeout in pcs?
> 
>>
>> Changing the transport back to udpu or udp, the higher totem tokens
>> worked as expected.
> 
> Yup. You've correctly find out that knet_* timeouts helps. Basically
> knet let link not working till it gets enough pongs. UDP/UDPU doesn't
> have this concept so it will create cluster faster.
> 
>>
>>  From the debug logging, I suspect that the Election Trigger (20
>> seconds) fires before all nodes are properly identified by the knet
>> transport.  I noticed that with a totem token passing 32 seconds, the
>> knet_ping* defaults were pushing up against that 20 second mark.  The
>> output of "corosync-cfgtool -s" will show each node's link as enabled,
>> but each side will state the other side's link is not connected.  
>> Since each side thinks the other node is not active, they fail to
>> properly send a join message to the other node during the election.  
>> They will essentially form a singleton cluster(??).  
> 
> Till now your analysis is correct. Corosync is really unable to send
> join message and forms single node cluster.
> 
>> It is more puzzling when you start one node at a time, waiting for the
>> node to stabilize before starting the other.   It is like the first
>> node will never see the remote knet interfaces become active,
>> regardless of how long you wait.
> 
> This shouldn't happen. Knet will eventually receive enough pongs so
> corosync broadcast message to other nodes, which founds out that new
> membership should be formed.
> 
>>
>> The solution is to manually set the knet ping_timeout and
>> ping_interval to lower values than the default values derived from the
>> totem token.  This seems to allow for the knet transport to determine
>> link status of all nodes before the election timer pops.
> 
> These timeouts are indeed not the best one. I had few ideas how to
> improve them, because currently they are in favor of multiple links
> clusters. Single links cluster may work better with slightly different
> defaults.
> 
>>
>> I tested this on both physical hardware and with VMs.  Both react
>> similarly.
>>
>> Bare bones test case to reproduce:
>> yum install pcs pacemaker fence-agents-all
>> firewall-cmd --permanent --add-service=high-availability
>> firewall-cmd --add-service=high-availability
>> systemctl start pcsd.service
>> systemctl enable pcsd.service
>> systemctl disable corosync
>> systemctl disable pacemaker
>> passwd hacluster
>> pcs host auth node1 node2
>> pcs cluster setup rhcs_test node1 node2 totem token=41000
>> pcs cluster start --all
>>
>> Example command to create cluster that will properly form and get quorum:
>> pcs cluster setup rhcs_test node1 node2 totem token=61000 transport
>> knet link ping_interval=1250 ping_timeout=2500
>>
>> Hope this helps someone in the future.
> 
> Yup. It is interesting finding an

[ClusterLabs] libqb 2.0.0 released

2020-05-04 Thread Christine Caulfield

We are pleased to announce the release of libqb 2.0.0. This is the
latest stable release of libqb


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/2.0.0/libqb-2.0.0.tar.xz

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

It has a few new features since the 1.0.x stream

 -   high resolution logging (millisecond timestamps)
 -   systemd journal logging
 -   re-opening of log files under program control

I've also removed the linker shenanigans that caused so much trouble
with compatibility in the past.


The shortlog from 1.9.1 is as follows:

Chris Murphy (1):
master: Issue 390: Clarify documentation of
qb_loop_timer_expire_time_get and provide new function to return
previously documented behavior (#391)

Chrissie Caulfield (3):
list: fix list handling for gcc10 (#383)
list: #include  in qblist.h (#384)
trie: Don't assume that chars are unsigned < 126 (#386)

Fabio M. Di Nitto (3):
[build] fix configure.ac in release tarball
[build] chown the right file
Doxygen2man (#388)

Ferenc Wágner (4):
Errors are represented as negative values
Allow group access to the IPC directory
Make it impossible to truncate or overflow the connection description
Let remote_tempdir() assume a NUL-terminated name

Jan Friesse (1):
qblist: Retype ptr in qb_list_entry to char* (#385)

Jan Pokorný (3):
build: bump version for inter-release "plain repo" generated tarballs
build: allow for possible v1 branch continuity by generous SONAME offset
log: journal: fix forgotten syslog reload when flipped from journal

Jonas Witschel (1):
Set correct ownership if qb_ipcs_connection_auth_set() has been used

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 1.0.6 released

2020-04-29 Thread Christine Caulfield

We are pleased to announce the release of libqb 1.0.6 - this is a minor
update to 1.0.5 mainly to support compilation with gcc 10.


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/1.0.6/libqb-1.0.6.tar.xz

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

Chrissie


Shortlog:
Christine Caulfield (3):
bump version for 1.0.6
Backported fixes to allow applications to compile using gcc10 (#392)
Fix error in CI tests - make distcheck

Jan Pokorný (9):
tests: ipc: avoid problems when UNIX_PATH_MAX (108) limits is hit
tests: ipc: speed the suite up with avoiding expendable sleep(3)s
tests: ipc: allow for easier tests debugging by discerning PIDs/roles
tests: ipc: refactor/split test_ipc_dispatch part into client_dispatch
tests: ipc: check deadlock-like situation due to mixing priorities
IPC: server: avoid temporary channel priority loss, up to deadlock-worth
IPC: server: fix debug message wrt. what actually went wrong
doc: qbloop.h: document pros/cons of using built-in event loop impl
CI: travis: add (redundant for now, but...) libglib2.0-dev prerequisite

Jonas Witschel (1):
Set correct ownership if qb_ipcs_connection_auth_set() has been used (#382)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 1.9.1 released

2020-03-18 Thread Christine Caulfield

We are pleased to announce the release of libqb 1.9.1 - this is a
release candidate for a future 2.0 release


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/1.9.0/libqb-1.9.1.tar.xz

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

There are a number of important changes here.

1) The soname has been bumped up to 100 to avoid any problems if we need
to do any extra releases of libqb 1
2) doxygen2man has been included (imported from knet) so it can be used
in other cluster projects. It's not currently used in libqb but I
anticipate it will be.
3) and some important bugfixes to the creation of IPC files


As always please use the signed tarballs below.

The shortlog is:

Chrissie Caulfield (3):
list: fix list handling for gcc10 (#383)
list: #include  in qblist.h (#384)
trie: Don't assume that chars are unsigned < 126 (#386)

Fabio M. Di Nitto (3):
[build] fix configure.ac in release tarball
[build] chown the right file
Doxygen2man (#388)

Ferenc Wágner (2):
Errors are represented as negative values
Allow group access to the IPC directory

Jan Friesse (1):
qblist: Retype ptr in qb_list_entry to char* (#385)

Jan Pokorný (2):
build: bump version for inter-release "plain repo" generated tarballs
build: allow for possible v1 branch continuity by generous SONAME offset

Jonas Witschel (1):
Set correct ownership if qb_ipcs_connection_auth_set() has been used

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [Announce] libqb 1.9.0 released

2020-01-13 Thread Christine Caulfield

On 13/12/2019 15:00, Yan Gao wrote:
> Hi Christine,
> 
> Congratulations and thanks for the release!
> 
> As previously brought from: 
> https://github.com/ClusterLabs/libqb/issues/338#issuecomment-503155816
> 
> , the master branch has this too:
> 
> https://github.com/ClusterLabs/libqb/commit/6a4067c1d1764d93d255eccecfd8bf9f43cb0b4d
> 
> , but doesn't seem to have:
> 
> https://github.com/ClusterLabs/libqb/pull/349
> 
> Does it mean the master branch is somehow not impacted by the issues, or 
> some other solutions are being sought there? Thanks.
> 
> Regards,
>Yan
> 


Thank you for the heads-up on that. I've re-posted wferi's commits as
PR#379.

Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [Announce] libqb 1.9.0 released

2020-01-06 Thread Christine Caulfield

Hi Yan,

I'm just back from the break, I'll look into the issues you've raised.
That's why we do release-candicates :)

Chrissie

On 13/12/2019 15:00, Yan Gao wrote:
> Hi Christine,
> 
> Congratulations and thanks for the release!
> 
> As previously brought from: 
> https://github.com/ClusterLabs/libqb/issues/338#issuecomment-503155816
> 
> , the master branch has this too:
> 
> https://github.com/ClusterLabs/libqb/commit/6a4067c1d1764d93d255eccecfd8bf9f43cb0b4d
> 
> , but doesn't seem to have:
> 
> https://github.com/ClusterLabs/libqb/pull/349
> 
> Does it mean the master branch is somehow not impacted by the issues, or 
> some other solutions are being sought there? Thanks.
> 
> Regards,
>Yan
> 
> 
> 
> On 12/12/19 5:37 PM, christine caulfield wrote:
>> We are pleased to announce the release of libqb 1.9.0 - this is a 
>> release candidate for a future 2.0 release
>>
>>
>> Source code is available at:
>> https://github.com/ClusterLabs/libqb/releases/download/1.9.0/libqb-1.9.0.tar.xz
>>  
>>
>>
>> Please use the signed .tar.gz or .tar.xz files with the version number
>> in rather than the github-generated "Source Code" ones.
>>
>> There are a small number of new features:
>>
>>      high resolution logging (millisecond timestamps)
>>      systemd journal logging
>>      re-opening of log files under program control
>>
>> and many bug fixes.
>>
>> I've also removed the linker shenanigans that caused so much trouble 
>> with compatibility in the past, which is the main reason for making this 
>> 2.0.0 rather than 1.0.6
>>
>> Thanks to all the many people that made this possible.
>>
>> Chrissie
>>
>> shortlog:
>>
>> Chrissie Caulfield (27):
>> tests: Improve test isolation (#298)
>> test: Fix 'make distcheck' (#303)
>> ipc_shm: Don't truncate SHM files of an active server (#307)
>> Allow customisable log line length (#292)
>> log: Use RTLD_NOOPEN when checking symbols (#310)
>> UPDATED: doc (ABI comparison) and various other fixes (#324)
>> logging: Remove linker 'magic' and just use statics for logging 
>> callsites (#322)
>> log: Add option to re-open a log file (#326)
>> log: Add configure-time option to use systemd journal instead of syslog 
>> (#327)
>> Add the option of hi-res (millisecond) timestamps (#329)
>> log: Remove more dead code from linker callsites (#331)
>> tests: Shorted deadlock test names (#372)
>> make: Remove splint tests (#374)
>> skiplist: fix use-after-free in the skiplist traversal
>> skiplist: Fix previous skiplist fix
>> tests: allow blackbox-segfault.sh to run out-of-tree
>> ipc: use O_EXCL on SHM files, and randomize the names
>> ipc: fixes
>> ipc: use O_EXCL when opening IPC files
>> ipc: Use mkdtemp for more secure IPC files
>> ipc: Use mkdtemp for more secure IPC files
>> version: update version-info for 1.0.4 release
>> version: bump soname for 1.0.5 release
>> ipc: fix force-filesystem-sockets
>> tests: Speed up IPC tests, especially on FreeBSD
>> ipc: Remove kqueue EOF log message
>> lib: Fix some minor warnings from newer compilers
>>
>> Daniel Black (4):
>> tests: blackbox-segfault test - remove residual core files
>> CI: travis: show logs of test failures
>> build: split hack for splint to work on non-x86 architectures
>> build: dpkg-architecture on trusty (cf. Travis CI) uses -q{NAME}
>>
>> Fabio M. Di Nitto (8):
>> tests: use RUNPATH instead of RPATH consistently (#309)
>> [build] fix supported compiler warning detection (#330)
>> [test-rpm] build test binaries by default
>> [tests] export SOCKETDIR from tests/Makefile.am
>> [tests] allow installation of test suite
>> [tests] enable building / shipping of libqb-tests.rpm
>> [tests] first pass at fixing test execution
>> [build] add --with-sanitizers= option for sanitizer builds (#366)
>>
>> Ferenc Wágner (8):
>> Fix spelling: plaform -> platform
>> Fix garbled Doxygen markup
>> Errors are represented as negative values
>> Allow group access to the IPC directory
>> Make it impossible to truncate or overflow the connection description
>> Let remote_tempdir() assume a NUL-terminated name
>> doc: qbarray.h: remove stray asterisk and parentheses
>> doc: qbarray: reword comment about index partitioning
>>
>> Jan Friesse (2):
>> ipc: Fix named socket unlink on FreeBSD
>> ipc: Always initialize response struct
>>
>> Jan Pokorný (15):
>&

[ClusterLabs] [Announce] libqb 1.9.0 released

2019-12-12 Thread christine caulfield

We are pleased to announce the release of libqb 1.9.0 - this is a 
release candidate for a future 2.0 release



Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/1.9.0/libqb-1.9.0.tar.xz

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

There are a small number of new features:

high resolution logging (millisecond timestamps)
systemd journal logging
re-opening of log files under program control

and many bug fixes.

I've also removed the linker shenanigans that caused so much trouble 
with compatibility in the past, which is the main reason for making this 
2.0.0 rather than 1.0.6


Thanks to all the many people that made this possible.

Chrissie

shortlog:

Chrissie Caulfield (27):
tests: Improve test isolation (#298)
test: Fix 'make distcheck' (#303)
ipc_shm: Don't truncate SHM files of an active server (#307)
Allow customisable log line length (#292)
log: Use RTLD_NOOPEN when checking symbols (#310)
UPDATED: doc (ABI comparison) and various other fixes (#324)
logging: Remove linker 'magic' and just use statics for logging 
callsites (#322)

log: Add option to re-open a log file (#326)
log: Add configure-time option to use systemd journal instead of syslog 
(#327)

Add the option of hi-res (millisecond) timestamps (#329)
log: Remove more dead code from linker callsites (#331)
tests: Shorted deadlock test names (#372)
make: Remove splint tests (#374)
skiplist: fix use-after-free in the skiplist traversal
skiplist: Fix previous skiplist fix
tests: allow blackbox-segfault.sh to run out-of-tree
ipc: use O_EXCL on SHM files, and randomize the names
ipc: fixes
ipc: use O_EXCL when opening IPC files
ipc: Use mkdtemp for more secure IPC files
ipc: Use mkdtemp for more secure IPC files
version: update version-info for 1.0.4 release
version: bump soname for 1.0.5 release
ipc: fix force-filesystem-sockets
tests: Speed up IPC tests, especially on FreeBSD
ipc: Remove kqueue EOF log message
lib: Fix some minor warnings from newer compilers

Daniel Black (4):
tests: blackbox-segfault test - remove residual core files
CI: travis: show logs of test failures
build: split hack for splint to work on non-x86 architectures
build: dpkg-architecture on trusty (cf. Travis CI) uses -q{NAME}

Fabio M. Di Nitto (8):
tests: use RUNPATH instead of RPATH consistently (#309)
[build] fix supported compiler warning detection (#330)
[test-rpm] build test binaries by default
[tests] export SOCKETDIR from tests/Makefile.am
[tests] allow installation of test suite
[tests] enable building / shipping of libqb-tests.rpm
[tests] first pass at fixing test execution
[build] add --with-sanitizers= option for sanitizer builds (#366)

Ferenc Wágner (8):
Fix spelling: plaform -> platform
Fix garbled Doxygen markup
Errors are represented as negative values
Allow group access to the IPC directory
Make it impossible to truncate or overflow the connection description
Let remote_tempdir() assume a NUL-terminated name
doc: qbarray.h: remove stray asterisk and parentheses
doc: qbarray: reword comment about index partitioning

Jan Friesse (2):
ipc: Fix named socket unlink on FreeBSD
ipc: Always initialize response struct

Jan Pokorný (15):
build: fix configure script neglecting, re-enable out-of-tree builds
build: configure: fix non-portable '\s' and '//{q}' in sed expression
build: allow for being consumed in a (non-endorsed) form of snapshots
build: configure: fix "snapshot consumption" feature on FreeBSD
tests: ipc: avoid problems when UNIX_PATH_MAX (108) limits is hit
tests: ipc: speed the suite up with avoiding expendable sleep(3)s
tests: ipc: allow for easier tests debugging by discerning PIDs/roles
tests: ipc: refactor/split test_ipc_dispatch part into client_dispatch
tests: ipc: check deadlock-like situation due to mixing priorities
IPC: server: avoid temporary channel priority loss, up to deadlock-worth
IPC: server: fix debug message wrt. what actually went wrong
doc: qbloop.h: document pros/cons of using built-in event loop impl
CI: travis: add (redundant for now, but...) libglib2.0-dev prerequisite
tests: ipc: fix the no-GLib conditionalizing
ringbuffer: fix mistaken errno handling around _rb_chunk_reclaim

Ken Gaillot (2):
log: Set errno when qb_log_target_alloc() fails
array,log: Never set errno to a negative value

Yusuke Iida (1):
configure: Fixed the problem that librt was explicitly needed in RHEL 6 
(#328)


wferi (2):
Fix comment typo (#296)
Add Pthreads (and possibly other) flags to the pkg-config file (#332)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors

2019-11-21 Thread christine caulfield


On 18/11/2019 21:31, Jean-Francois Malouin wrote:

Hi,

Maybe not directly a pacemaker question but maybe some of you have seen this
problem:

A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring
sometimes reports errors like this in the corosync log file:

[KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
[KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366
[KNET  ] pmtud: Global data MTU changed to: 1366
[CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at 
run-time
[CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at 
run-time

Those do not happen very frequenly, once a week or so...



Those messages are caused by a config file reload (corosync-cfgtool -R) 
being triggered by something. If they're happening once a week then 
check your cron jobs.



However the system log on the nodes reports those much more frequently, a few
times a day:

Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] link: host: 2 link: 1 is down
Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best 
link: 0 (pri: 0)
Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] rx: host: 2 link: 1 is up
Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best 
link: 1 (pri: 1)



Those don't look good. having a link down for 6 seconds looks like a 
serious network outage that needs looking into, especially if they are 
that frequent, or it could be a bug. You don't say which version of 
libknet you have installed but make sure it's the latest one.


The fencing event in your other message was caused because both links 
were down at the same time, which is a worrying co-incidence. Changing 
the token timeout won't make any difference to the knet link events, but 
if the knet links are down for long enough then that will trigger a 
token timeout and a fence event.


Definitely look for something odd in your networking - the corosync.conf 
file looks sane (though having knet_transport in the top-level totem 
stanza is doing nothing), so it's not that.


It's hard to make a judgement with just that info, but look for dropped 
packets on the interfaces, slow response to other network services or 
very high load on one of the nodes. If you can't see anything on the 
systems then enable debug logging and get back to us. If it is a bug we 
want it fixed!


Chrissie



Are those to be dismissed or are they indicative of a network misconfig/problem?
I tried setting 'knet_transport: udpu' in the totem section (the default value)
but it didn't seem to make a difference...Hard coding netmtu to 1500 and
allowing for longer (10s) token timeout also didn't seem to affect the issue.


Corosync config follows:

/etc/corosync/corosync.conf

totem {
 version: 2
 cluster_name: bicha
 transport: knet
 link_mode: passive
 ip_version: ipv4
 token: 1
 netmtu: 1500
 knet_transport: sctp
 crypto_model: openssl
 crypto_hash: sha256
 crypto_cipher: aes256
 keyfile: /etc/corosync/authkey
 interface {
 linknumber: 0
 knet_transport: udp
 knet_link_priority: 0
 }
 interface {
 linknumber: 1
 knet_transport: udp
 knet_link_priority: 1
 }
}
quorum {
 provider: corosync_votequorum
 two_node: 1
#expected_votes: 2
}
nodelist {
 node {
 ring0_addr: xxx.xxx.xxx.xxx
 ring1_addr: zzz.zzz.zzz.zzx
 name: node1
 nodeid: 1
 }
 node {
 ring0_addr: xxx.xxx.xxx.xxy
 ring1_addr: zzz.zzz.zzz.zzy
 name: node2
 nodeid: 2
 }
}
logging {
 to_logfile: yes
 to_syslog: yes
 logfile: /var/log/corosync/corosync.log
 syslog_facility: daemon
 debug: off
 timestamp: on
 logger_subsys {
 subsys: QUORUM
 debug: off
 }
}
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Announcing ClusterLabs Summit 2020

2019-11-12 Thread christine caulfield


On 11/11/2019 13:21, Thomas Lamprecht wrote:

On 11/5/19 3:07 AM, Ken Gaillot wrote:

Hi all,

A reminder: We are still interested in ideas for talks, and rough
estimates of potential attendees. "Maybe" is perfectly fine at this
stage. It will let us negotiate hotel rates and firm up the location
details.


Maybe we (Proxmox) could also come, Vienna isn't to far away, after
all.. If interested I could do a small talk about our knet/corosync +
multi-master clustered configuration filesystem + HA stack, if there's
interest at all :)




yes please!

Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DLM in the cluster can tolerate more than one node failure at the same time?

2019-10-23 Thread christine caulfield


On 22/10/2019 07:15, Gang He wrote:

Hi List,

I remember that master node has the full copy for one DLM lock resource and the 
other nodes have their own lock status,
then if one node is failed(or fenced), the DLM lock status can be recovered 
from the remained node quickly.
My question is,
if there are more than one node which are failed at the same time, the DLM lock 
service for the remained nodes in the cluster can still continue to work after 
recovery?




Yes. The local DLM keeps a copy of its own locks and the remaining DLM 
nodes will collaborate to re-master all of the locks that they know 
about.  The number of nodes that leave at one time has no impact on this.


Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [Announce] libqb 1.0.5 release

2019-04-25 Thread Christine Caulfield

We are pleased to announce the release of libqb 1.0.5

Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/v1.0.5/libqb-1.0.5.tar.xz

Please used the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

This release is an update to fix a regression in 1.0.4, huge thanks to
wferi for all the help with this

Chrissie
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Announce] libqb 1.0.4 release

2019-04-15 Thread Christine Caulfield

We are pleased to announce the release of libqb 1.0.4

Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/v1.0.4/libqb-1.0.4.tar.xz

Please used the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

This is a security update to 1.0.3. Files are now opened with O_EXCL and
are placed in directories created by mkdtemp(). It is backwards
compatible with 1.0.3.

Chrissie
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why do clusters have a name?

2019-03-28 Thread Christine Caulfield

On 26/03/2019 20:12, Brian Reichert wrote:
> This will sound like a dumb question:
> 
> The manpage for pcs(8) implies that to set up a cluster, one needs
> to provide a name.
> 
> Why do clusters have names?
> 
> Is there a use case wherein there would be multiple clusters visible
> in an administrative UI, such that they'd need to be differentiated?
> 

Alongside the current usage there's some history here.  Originally (when
cman was in the kernel) the name was used to get the correct information
from the centralised cluster configuration daemon (ccsd).

After that it got used as a hash to generate a cluster_id for clusters
that might be on the same network (cluster_id as a number was also
allowed, but as name was already a field it seemed sensible to keep
using it). The hashed cluster_id was included in the protocol so that
clashing clusters would ignore each other's messages. In a later
revision the cluster name was also hashed to generate a very primitive
encryption key for openais if one was not provided. This was, again,
more to provide isolation than actual security.

Of course it's used for none of those things now, but that's where it
came from originally :)

Chrissie
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Can subsequent rings be added to established cluster?

2019-02-25 Thread Christine Caulfield

On 21/02/2019 18:33, lejeczek wrote:
> hi guys
> 
> as per the subject.
> 
> Would there be some nice docs/howto? Or maybe it's just standard op
> procedure?
> 

With corosync 3 you can add links (similar to rings from the user POV)
dynamically just by adding the necessary ringX_addr  entries to
corosync.conf (on all nodes) and reloading. I'm not sure if pcs supports
this yetm so you might have to do it manually. When pcs gets this
feature it will get documented.

In the meantime this document probably has WAY too much information but
does explain what's going on.

https://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Christine Caulfield

On 15/02/2019 16:58, Edwin Török wrote:
> On 15/02/2019 16:08, Christine Caulfield wrote:
>> On 15/02/2019 13:06, Edwin Török wrote:
>>> I tried again with 'debug: trace', lots of process pause here:
>>> https://clbin.com/ZUHpd
>>>
>>> And here is an strace running realtime prio 99, a LOT of epoll_wait and
>>> sendmsg (gz format):
>>> https://clbin.com/JINiV
>>>
>>> It detects large numbers of members left, but I think this is because
>>> the corosync on those hosts got similarly stuck:
>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] A new membership
>>> (10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7
>>> 1 10
>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] Failed to receive
>>> the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10
>>>
>>> Looking on another host that is still stuck 100% corosync it says:
>>> https://clbin.com/6UOn6
>>>
>>
>> Thanks, that's really quite odd. I have vague recollections of a problem
>> where corosync was spinning on epoll without reading anything but can't
>> find the details at the moment, annoying.
>>
>> Some thing you might be able to try that might help.
>>
>> 1) is is possible to run without sbd. Sometimes too much polling from
>> clients can cause odd behaviour
>> 2) is it possible to try with a different kernel? We've tried a vanilla
>> 4.19 and it's fine, but not with the Xen patches obviously
> 
> I'll try with some bare-metal upstream distros and report back the repro
> steps if I can get it to reliably repro, hopefully early next week, it
> is unlikely I'll get a working repro today.
> 
>> 3) Does running corosync with the -p option help?
> 
> Yes, with "-p" I was able to run cluster create/GFS2 plug/unplug/destroy
> on 16 physical hosts in a loop for an hour with any crashes (previously
> it would crash within minutes).
> 
> I found another workaround too:
> echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
> 
> This makes the 95% realtime process CPU limit from
> sched_rt_runtime_us/sched_rt_period_us apply per core, instead of
> globally, so there would be 5% time left for non-realtime tasks on each
> core. Seems to be enough to avoid the livelock, I was not able to
> observe corosync using high CPU % anymore.
> Still got more tests to run on this over the weekend, but looks promising.
> 
> This is a safety layer of course, to prevent the system from fencing if
> we encounter high CPU usage in corosync/libq. I am still interested in
> tracking down the corosync/libq issue as it shouldn't have happened in
> the first place.
> 

That's helpful to know. Does corosync still use lots of CPU time in this
situation (without RT) or does it behave normally?

>>
>> Is there any situation where this has worked? either with different
>> components or different corosync.conf files?
>>
>> Also, and I don't think this is directly related to the issue, but I can
>> see configuration reloads happening from 2 nodes every 5 seconds. It's
>> very odd and maybe not what you want!
> 
> The configuration reloads are a way of triggering this bug reliably, I
> should've mentioned that earlier
> (the problem happens during a configuration reload, but not always, and
> by doing configuration reloads in a loop that just add/remove one node
> the problem can be triggered reliably within minutes).
> 
> 

I've been trying this on my (KVM) virtual machines today but I can't
reproduce it on a Standard RHEL-7, so I'm interested to see how you get
on with a different kernel.

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Christine Caulfield

On 15/02/2019 13:06, Edwin Török wrote:
> 
> 
> On 15/02/2019 11:12, Christine Caulfield wrote:
>> On 15/02/2019 10:56, Edwin Török wrote:
>>> On 15/02/2019 09:31, Christine Caulfield wrote:
>>>> On 14/02/2019 17:33, Edwin Török wrote:
>>>>> Hello,
>>>>>
>>>>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
>>>>> noticed a fundamental problem with realtime priorities:
>>>>> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
>>>>> also routed to CPU3
>>>>> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
>>>>> without it packets sent/received from that interface would not get 
>>>>> processed
>>>>> - corosync is in a busy loop using 100% CPU, never giving a chance for
>>>>> softirqs to be processed (including TIMER and SCHED)
>>>>>
>>>>
>>>>
>>>> Can you tell me what distribution this is please? 
>>> This is a not-yet-released development version of XenServer based on
>>> CentOS 7.5/7.6.
>>> The kernel is 4.19.19 + patches to make it work well with Xen
>>> (previously we were using a 4.4.52 + Xen patches and backports kernel)
>>>
>>> The versions of packages are:
>>> rpm -q libqb corosync dlm sbd kernel
>>> libqb-1.0.1-6.el7.x86_64
>>> corosync-2.4.3-13.xs+2.0.0.x86_64
>>> dlm-4.0.7-1.el7.x86_64
>>> sbd-1.3.1-7.xs+2.0.0.x86_64
>>> kernel-4.19.19-5.0.0.x86_64
>>>
>>> Package versions with +xs in version have xenserver specific patches
>>> applied, libqb is coming straight from upstream CentOS here:
>>> https://git.centos.org/tree/rpms!libqb.git/fe522aa5e0af26c0cff1170b6d766b5f248778d2
>>>
>>>> There are patches to
>>>> libqb that should be applied to fix a similar problem in 1.0.1-6 - but
>>>> that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just
>>>> need to be sure that those fixes are in your libqb before going any
>>> further.
>>>
>>> We have libqb 1.0.1-6 from CentOS, it looks like there is 1.0.1-7 which
>>> includes an SHM crash fix, is this the one you were refering to, or is
>>> there an additional patch elsewhere?
>>> https://git.centos.org/commit/rpms!libqb.git/b5ede72cb0faf5b70ddd504822552fe97bfbbb5e
>>>
>>
>> Thanks. libqb-1.0.1-6 does have the patch I was thinking of - I mainly
>> wanted to check it wasn't someone else's package that didn't have that
>> patch in. The SHM patch in -7 fixes a race at shutdown (often seen with
>> sbd). That shouldn't be a problem because there is a workaround in -6
>> anyway, and it's not fixing a spin, which is what we have here of course.
>>
>> Are there any messages in the system logs from either corosync or
>> related subsystems?
> 
> 
> I tried again with 'debug: trace', lots of process pause here:
> https://clbin.com/ZUHpd
> 
> And here is an strace running realtime prio 99, a LOT of epoll_wait and
> sendmsg (gz format):
> https://clbin.com/JINiV
> 
> It detects large numbers of members left, but I think this is because
> the corosync on those hosts got similarly stuck:
> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] A new membership
> (10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7
> 1 10
> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] Failed to receive
> the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10
> 
> Looking on another host that is still stuck 100% corosync it says:
> https://clbin.com/6UOn6
> 

Thanks, that's really quite odd. I have vague recollections of a problem
where corosync was spinning on epoll without reading anything but can't
find the details at the moment, annoying.

Some thing you might be able to try that might help.

1) is is possible to run without sbd. Sometimes too much polling from
clients can cause odd behaviour
2) is it possible to try with a different kernel? We've tried a vanilla
4.19 and it's fine, but not with the Xen patches obviously
3) Does running corosync with the -p option help?

Is there any situation where this has worked? either with different
components or different corosync.conf files?

Also, and I don't think this is directly related to the issue, but I can
see configuration reloads happening from 2 nodes every 5 seconds. It's
very odd and maybe not what you want!

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Christine Caulfield

On 15/02/2019 10:56, Edwin Török wrote:
> On 15/02/2019 09:31, Christine Caulfield wrote:
>> On 14/02/2019 17:33, Edwin Török wrote:
>>> Hello,
>>>
>>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
>>> noticed a fundamental problem with realtime priorities:
>>> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
>>> also routed to CPU3
>>> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
>>> without it packets sent/received from that interface would not get processed
>>> - corosync is in a busy loop using 100% CPU, never giving a chance for
>>> softirqs to be processed (including TIMER and SCHED)
>>>
>>
>>
>> Can you tell me what distribution this is please? 
> This is a not-yet-released development version of XenServer based on
> CentOS 7.5/7.6.
> The kernel is 4.19.19 + patches to make it work well with Xen
> (previously we were using a 4.4.52 + Xen patches and backports kernel)
> 
> The versions of packages are:
> rpm -q libqb corosync dlm sbd kernel
> libqb-1.0.1-6.el7.x86_64
> corosync-2.4.3-13.xs+2.0.0.x86_64
> dlm-4.0.7-1.el7.x86_64
> sbd-1.3.1-7.xs+2.0.0.x86_64
> kernel-4.19.19-5.0.0.x86_64
> 
> Package versions with +xs in version have xenserver specific patches
> applied, libqb is coming straight from upstream CentOS here:
> https://git.centos.org/tree/rpms!libqb.git/fe522aa5e0af26c0cff1170b6d766b5f248778d2
> 
>> There are patches to
>> libqb that should be applied to fix a similar problem in 1.0.1-6 - but
>> that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just
>> need to be sure that those fixes are in your libqb before going any
> further.
> 
> We have libqb 1.0.1-6 from CentOS, it looks like there is 1.0.1-7 which
> includes an SHM crash fix, is this the one you were refering to, or is
> there an additional patch elsewhere?
> https://git.centos.org/commit/rpms!libqb.git/b5ede72cb0faf5b70ddd504822552fe97bfbbb5e
> 

Thanks. libqb-1.0.1-6 does have the patch I was thinking of - I mainly
wanted to check it wasn't someone else's package that didn't have that
patch in. The SHM patch in -7 fixes a race at shutdown (often seen with
sbd). That shouldn't be a problem because there is a workaround in -6
anyway, and it's not fixing a spin, which is what we have here of course.

Are there any messages in the system logs from either corosync or
related subsystems?

Chrissie

>> Without doubt this is a bug, in normal operation corosync is quite light
>> on CPU.
> 
> Thanks for the help in advance,
> --Edwin
> 
>>
>> Chrissie
>>
>>> This appears to be a priority inversion problem, if corosync runs as
>>> realtime then everything it needs (timers...) should be realtime as
>>> well, otherwise running as realtime guarantees we'll miss the watchdog
>>> deadline, instead of guaranteeing that we process the data before the
>>> deadline.
>>>
>>> Do you have some advice on what the expected realtime priorities would
>>> be for:
>>> - corosync
>>> - sbd
>>> - hard irqs
>>> - soft irqs
>>>
>>> Also would it be possible for corosync to avoid hogging the CPU in libqb?
>>> (Our hypothesis is that if softirqs are not processed then timers
>>> wouldn't work for processes on that CPU either)
>>>
>>> Some more background and analysis:
>>>
>>> We noticed that cluster membership changes were very unstable
>>> (especially as you approach 16 physical host clusters) causing the whole
>>> cluster to fence when adding a single node. This was previously working
>>> fairly reliably with a 4.4 based kernel.
>>>
>>> I've increased SBD timeout to 3600 to be able to investigate the problem
>>> and noticed that corosync was using 100% of CPU3 [1] and I immediately
>>> lost SSH access on eth0 (corosync was using eth1), where eth0's
>>> interrupts were also processed on CPU3 (and irqbalance didn't move it).
>>>
>>> IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according
>>> to [3] it shouldn't be allowed to use more than 95% of CPU.
>>>
>>> Softirqs were not processed at all on CPU3 (see [2], the numbers in the
>>> CPU3 column did not change, the numbers in the other columns did).
>>> Tried decreasing priority of corosync using chrt to 1, which didn't
>>> help. I then increased the priority of ksoftirqd to 50 using chrt, which
>>> immediately solved the CPU usage problem on corosync.

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Christine Caulfield

On 14/02/2019 17:33, Edwin Török wrote:
> Hello,
> 
> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
> noticed a fundamental problem with realtime priorities:
> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
> also routed to CPU3
> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
> without it packets sent/received from that interface would not get processed
> - corosync is in a busy loop using 100% CPU, never giving a chance for
> softirqs to be processed (including TIMER and SCHED)
> 


Can you tell me what distribution this is please? There are patches to
libqb that should be applied to fix a similar problem in 1.0.1-6 - but
that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just
need to be sure that those fixes are in your libqb before going any further.

Without doubt this is a bug, in normal operation corosync is quite light
on CPU.

Chrissie

> This appears to be a priority inversion problem, if corosync runs as
> realtime then everything it needs (timers...) should be realtime as
> well, otherwise running as realtime guarantees we'll miss the watchdog
> deadline, instead of guaranteeing that we process the data before the
> deadline.
> 
> Do you have some advice on what the expected realtime priorities would
> be for:
> - corosync
> - sbd
> - hard irqs
> - soft irqs
> 
> Also would it be possible for corosync to avoid hogging the CPU in libqb?
> (Our hypothesis is that if softirqs are not processed then timers
> wouldn't work for processes on that CPU either)
> 
> Some more background and analysis:
> 
> We noticed that cluster membership changes were very unstable
> (especially as you approach 16 physical host clusters) causing the whole
> cluster to fence when adding a single node. This was previously working
> fairly reliably with a 4.4 based kernel.
> 
> I've increased SBD timeout to 3600 to be able to investigate the problem
> and noticed that corosync was using 100% of CPU3 [1] and I immediately
> lost SSH access on eth0 (corosync was using eth1), where eth0's
> interrupts were also processed on CPU3 (and irqbalance didn't move it).
> 
> IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according
> to [3] it shouldn't be allowed to use more than 95% of CPU.
> 
> Softirqs were not processed at all on CPU3 (see [2], the numbers in the
> CPU3 column did not change, the numbers in the other columns did).
> Tried decreasing priority of corosync using chrt to 1, which didn't
> help. I then increased the priority of ksoftirqd to 50 using chrt, which
> immediately solved the CPU usage problem on corosync.
> 
> I tried a simple infinite loop program with realtime priority, but it
> didn't reproduce the problems with interrupts getting stuck.
> 
> 
> Three problems here:
> * all softirqs were stuck (not being processed) on CPU3, which included
> TIMER and SCHED. corosync relies quite heavily on timers, would lack of
> processing them cause the 100% CPU usage?
> * is there a kernel bug introduced between 4.4 - 4.19 that causes
> realtime tasks to not respect the 95% limit anymore? This would leave 5%
> time for IRQs, including NIC IRQs
> *  if corosync runs at higher priority than the kernel softirq thread
> processing NIC IRQ how is corosync expecting incoming packets to be
> processed, if it is hogging the CPU by receiving, polling and sending
> packets?
> 
> On another host which exhibited the same problem I've run strace (which
> also had the side-effect of getting corosync unstuck from 100% CPU use
> after strace finished):
> 1 bind
> 5 close
> 688 epoll_wait
> 8 futex
> 1 getsockname
> 3 ioctl
> 1 open
> 3 recvfrom
> 190 recvmsg
> 87 sendmsg
> 9 sendto
> 4 socket
> 6 write
> 
> On yet another host I've run gdb while corosync was stuck:
> Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)):
> #0 0x7f6fd34a0afb in do_futex_wait.constprop.1 ()
> from /lib64/libpthread.so.0
> #1 0x7f6fd34a0b8f in __new_sem_wait_slow.constprop.0 ()
> from /lib64/libpthread.so.0
> #2 0x7f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from
> /lib64/libpthread.so.0
> #3 0x7f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0
> #4 0x7f6fd349ae25 in start_thread () from /lib64/libpthread.so.0
> #5 0x7f6fd31c4bad in clone () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
> #0 0x7f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
> #1 0x7f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
> #2 0x7f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
> #3 0x55592d62ff78 in main ()
> 
> 
> [1]
> top - 15:51:38 up 47 min,  2 users,  load average: 3.81, 1.70, 0.70
> Tasks: 208 total,   4 running, 130 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu1  : 53.8 us, 46.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> %Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0

Re: [ClusterLabs] Corosync 3.0.0 is available at corosync.org!

2018-12-17 Thread Christine Caulfield

On 17/12/2018 12:14, Jan Pokorný wrote:
> On 17/12/18 10:04 +0000, Christine Caulfield wrote:
>> On 17/12/2018 09:34, Ulrich Windl wrote:
>>> I wonder: Is there a migration script that can converts corosync.conf files?
>>> At least you have a few version components in the config file that will help
>>> such tool to know what to do... ;-)
>>
>> Sadly not - that I know of. The clufter project *may* be looking into it
>> but I have inside knowledge on that.
> 
> Yes, it's in the backlog, stay tuned.

excellent! thank you :)

Chrissie

> 
>> Single ring UDP/UDPU config files should work without change anyway.
>> Multi-ring configs will need changing to transport: knet and names
>> adding to nodes.
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Corosync 3.0.0 is available at corosync.org!

2018-12-17 Thread Christine Caulfield

On 17/12/2018 09:34, Ulrich Windl wrote:
 Jan Friesse  schrieb am 14.12.2018 um 15:06 in
> Nachricht
> <991569e4-2430-30f1-1bbc-827be7637...@redhat.com>:
> [...]
>> ‑ UDP/UDPU transports are still present, but supports only single ring 
>> (RRP is gone in favor of Knet) and doesn't support encryption
> [...]
> 
> I wonder: Is there a migration script that can converts corosync.conf files?
> At least you have a few version components in the config file that will help
> such tool to know what to do... ;-)
> 


Sadly not - that I know of. The clufter project *may* be looking into it
but I have inside knowledge on that.

Single ring UDP/UDPU config files should work without change anyway.
Multi-ring configs will need changing to transport: knet and names
adding to nodes.

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Corosync 3 release plans?

2018-10-01 Thread Christine Caulfield

On 01/10/18 07:45, Ulrich Windl wrote:
>>>> Ferenc Wágner  schrieb am 27.09.2018 um 21:16
> in
> Nachricht <87zhw23g5p@lant.ki.iif.hu>:
>> Christine Caulfield  writes:
>>
>>> I'm also looking into high‑res timestamps for logfiles too.
>>
>> Wouldn't that be a useful option for the syslog output as well?  I'm
>> sometimes concerned by the batching effect added by the transport
>> between the application and the (local) log server (rsyslog or systemd).
>> Reliably merging messages from different channels can prove impossible
>> without internal timestamps (even considering a single machine only).
>>
>> Another interesting feature could be structured, direct journal output
>> (if you're looking for challenges).
> 
> Make it configurable please; most lines are long enough even without extra
> timestamps.
> 

Don't worry, I will :)

Chrissie

>> ‑‑ 
>> Regards,
>> Feri
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync 3 release plans?

2018-09-28 Thread Christine Caulfield

On 27/09/18 20:16, Ferenc Wágner wrote:
> Christine Caulfield  writes:
> 
>> I'm also looking into high-res timestamps for logfiles too.
> 
> Wouldn't that be a useful option for the syslog output as well?  I'm
> sometimes concerned by the batching effect added by the transport
> between the application and the (local) log server (rsyslog or systemd).
> Reliably merging messages from different channels can prove impossible
> without internal timestamps (even considering a single machine only).
> 
> Another interesting feature could be structured, direct journal output
> (if you're looking for challenges).
> 


I'm inclined to leave syslog timestamps to syslog - rsyslog has the
option for hi-res timestamps (yes, I know it stamps them on receipt and
all that) if you need them. Adding 'proper' journal output sounds like a
good idea to me though

I'm not so much looking for challenges as looking to make libqb more
useful for the people using it :)


Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Christine Caulfield

On 27/09/18 16:01, Ken Gaillot wrote:
> On Thu, 2018-09-27 at 09:58 -0500, Ken Gaillot wrote:
>> On Thu, 2018-09-27 at 15:32 +0200, Ferenc Wágner wrote:
>>> Christine Caulfield  writes:
>>>
>>>> TBH I would be quite happy to leave this to logrotate but the
>>>> message I
>>>> was getting here is that we need additional help from libqb. I'm
>>>> willing
>>>> to go with a consensus on this though
>>>
>>> Yes, to do a proper job logrotate has to have a way to get the log
>>> files
>>> reopened.  And applications can't do that without support from
>>> libqb,
>>> if
>>> I understood Honza right.
>>
>> There are two related issues:
>>
>> * Issue #142, about automatically rotating logs once they reach a
>> certain size, can be done with logrotate already. If it's a one-time
>> thing (e.g. running a test with trace), the admin can control
>> rotation
>> directly with logrotate --force /etc/logrotate.d/whatever.conf. If
>> it's
>> desired permanently, a maxsize or size line can be added to the
>> logrotate config (which, now that I think about it, would be a good
>> idea for the default pacemaker config).
>>
>> * Issue #239, about the possibility of losing messages with
>> copytruncate, has a widely used, easily implemented, and robust
>> solution of using a signal to indicate reopening a log. logrotate is
>> then configured to rotate by moving the log to a new name, sending
>> the
>> signal, then compressing the old log.
> 
> Regarding implementation in libqb's case, libqb would simply provide
> the API for reopening the log, and clients such as pacemaker would
> intercept the signal and call the API.
> 

That sounds pretty easy to achieve. I'm also looking into high-res
timestamps for logfiles too.

> A minor complication is that pacemaker would have to supply different
> logrotate configs depending on the version of libqb available.
> 

Can't you just intercept the signal anyway and not do anything if an old
libqb is linked in?

Chrissie

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Christine Caulfield

On 27/09/18 12:52, Ferenc Wágner wrote:
> Christine Caulfield  writes:
> 
>> I'm looking into new features for libqb and the option in
>> https://github.com/ClusterLabs/libqb/issues/142#issuecomment-76206425
>> looks like a good option to me.
> 
> It feels backwards to me: traditionally, increasing numbers signify
> older rotated logs, while this proposal does the opposite.  And what
> happens on application restart?  Do you overwrite from 0?  Do you ever
> jump back to 0?  It also leaves the problem of cleaning up old log files
> unsolved...

The idea I had was to look for logs with 'old' numbers at startup and
then start a new log with the next number, starting at 0 or 1. Good
point about the numbers going the other way with logrotate though, I
hadn't considered that

> 
>> Though adding an API call to re-open the log file could be done too -
>> I'm not averse to having both,
> 
> Not addig log rotation policy (and implementation!) to each application
> is a win in my opinion, and also unifies local administration.  Syslog
> is pretty good in this regard, my only gripe with it is that its time
> stamps can't be quite as precise as the ones from the (realtime)
> application (even nowadays, under systemd).  And that it can block the
> log stream... on the other hand, disk latencies can block log writes
> just as well.
> 

TBH I would be quite happy to leave this to logrotate but the message I
was getting here is that we need additional help from libqb. I'm willing
to go with a consensus on this though

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Christine Caulfield

On 26/09/18 09:21, Ferenc Wágner wrote:
> Jan Friesse  writes:
> 
>> wagner.fer...@kifu.gov.hu writes:
>>
>>> triggered by your favourite IPC mechanism (SIGHUP and SIGUSRx are common
>>> choices, but logging.* cmap keys probably fit Corosync better).  That
>>> would enable proper log rotation.
>>
>> What is the reason that you find "copytruncate" as non-proper log
>> rotation? I know there is a risk to loose some lines, but it should be
>> pretty small.
> 
> Yes, there's a chance of losing some messages.  It may be acceptable in
> some cases, but it's never desirable.  The copy operation also wastes
> I/O bandwidth.  Reopening the log files on some external trigger is a
> better solution on all accounts and also an industry standard.
> 
>> Anyway, this again one of the feature where support from libqb would
>> be nice to have (there is actually issue opened
>> https://github.com/ClusterLabs/libqb/issues/239).
> 
> That's a convoluted one for a simple reopen!  But yes, if libqb does not
> expose such functionality, you can't do much about it.  I'll stay with
> syslog for now. :)  In cluster environments centralised log management is
> a must anyway, and that's annoying to achieve with direct file logs.
> 

I'm looking into new features for libqb and the option in
https://github.com/ClusterLabs/libqb/issues/142#issuecomment-76206425
looks like a good option to me. Though adding an API call to re-open the
log file could be done too - I'm not averse to having both,

Chrissie

>>> Jan Friesse  writes:
>>>
 No matter how much I still believe totemsrp as a library would be
 super nice to have - but current state is far away from what I would
 call library (= something small, without non-related things like
 transports/ip/..., testable (ideally with unit tests testing corner
 cases)) and making one fat binary looks like a better way.

 I'll made a patch and send PR (it should be easy).
>>>
>>> Sounds sensible.  Somebody can still split it out later if needed.
>>
>> Yep (and PR send + merged already :) )
> 
> Great!  Did you mean to keep the totem.h, totemip.h, totempg.h and
> totemstats.h header files installed nevertheless?  And totem_pg.pc could
> go as well, I guess.
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync 3 release plans?

2018-09-24 Thread Christine Caulfield

On 24/09/18 13:12, Ferenc Wágner wrote:
> Jan Friesse  writes:
> 
>> Have you had a time to play with packaging current alpha to find out
>> if there are no issues? I had no problems with Fedora, but Debian has
>> a lot of patches, and I would be really grateful if we could reduce
>> them a lot - so please let me know if there is patch which you've sent
>> PR for and it's not merged yet.
> 
> Hi Honza,
> 
> Sorry for the delay.  You've already merged my PR for two simple typos,
> thanks!  Beyond that, there really isn't much in our patch queue
> anymore.  As far as I can see, current master even has a patch for error
> propagation in notifyd, which will let us drop one more!  And we arrive
> at the example configs.  We prefer syslog for several reasons
> (copytruncate rotation isn't pretty, decoupling possible I/O stalls) and
> we haven't got the /var/log/cluster legacy.  But more importantly, the
> knet default transport requires a nodelist instead of interfaces, unlike
> mcast udp.  The "ring" terminology might need a change as well,
> especially ring0_addr.  So I'd welcome an overhaul of the (now knet)
> example config, but I'm not personally qualified for doing that. :)
> 
> Finally, something totally unrelated: the libtotem_pg shared object
> isn't standalone anymore, it has several undefined symbols (icmap_get_*,
> stats_knet_add_member, etc) which are defined in the corosync binary.
> Why is it still a separate object then?
> 

I argued a while back for making it a static library or just building it
straight into corosync. It makes little sense to me having it as a
shared library any more - if it ever did. All it really achieves (IMHO)
is to make debugging more complicated than it ought to be.

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] short circuiting the corosync token timeout

2018-08-13 Thread Christine Caulfield

On 13/08/18 09:00, Jan Friesse wrote:
> Chris Walker napsal(a):
>> Hello,
>>
>> Before Pacemaker can declare a node as 'offline', the Corosync layer
>> must first declare that the node is no longer part of the cluster
>> after waiting a full token timeout.  For example, if I manually
>> STONITH a node with 'crm -F node fence node2', even if the fence
>> operation happens immediately, Corosync will still wait the full token
>> timeout before communicating to Pacemaker that node2 is offline.
>>
>> There are scenarios where it would be advantageous to short circuit
>> the Corosync token timeout since we know that a node is offline. For
>> example, if a node crashes and dumps a vmcore, it sends out packets
>> indicating that it's safely offline.  Or if a node is physically
>> removed from a chassis and an event is sent indicating that the node
>> is physically gone.  In these cases, there's no need to wait the full
>> token timeout; it would be best to declare the node unclean, STONITH
>> it, and move resources.
>>
>> Has anyone dealt with a scenario like this?  I have a version of
>> Corosync with a parameter that effectively expires the token and
>> forces the cluster to reconfigure, but this seems a bit heavy handed
>> and I'm wondering if there's a better way of going about this.
> 
> I'm not aware of such functionality. Closest you can get right now is to
> shutdown (cleanly) one of the nodes, this will force corosync to create
> new membership.
> 
> Anyway, I've filled GH issue
> https://github.com/corosync/corosync/issues/366


I'm intrigued as you why the token timeout is so long that it's quicker
to do a manual intervention than simply wait for it to expire?

Some of the earlier implementations of qdiskd and multipath required
long timeouts (though still only in the realm of 30 to 60 seconds) but I
thought even those had been fixed.

Bear in mind that this is a potentially dangerous operation so that any
'official' implementation will require the user to confirm their
intention - thus making it an even more time-consuming process.

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Upgrade corosync problem

2018-07-06 Thread Christine Caulfield

On 06/07/18 13:24, Salvatore D'angelo wrote:
> Hi All,
> 
> The option --ulimit memlock=536870912 worked fine.
> 
> I have now another strange issue. The upgrade without updating libqb
> (leaving the 0.16.0) worked fine.
> If after the upgrade I stop pacemaker and corosync, I download the
> latest libqb version:
> https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.gz
> build and install it everything works fine.
> 
> If I try to install in sequence (after the installation of old code):
> 
> libqb 1.0.3
> corosync 2.4.4
> pacemaker 1.1.18
> crmsh 3.0.1
> resource agents 4.1.1
> 
> when I try to start corosync I got the following error:
> *Starting Corosync Cluster Engine (corosync): /etc/init.d/corosync: line
> 99:  8470 Aborted                 $prog $COROSYNC_OPTIONS > /dev/null 2>&1*
> *[FAILED]*


Yes. you can't randomly swap in and out hand-compiled libqb versions.
Find one that works and stick to it. It's an annoying 'feature' of newer
linkers that we had to workaround in libqb. So if you rebuild libqb
1.0.3 then you will, in all likelihood, need to rebuild corosync to
match it.

Chrissie


> 
> if I launch corosync -f I got:
> *corosync: main.c:143: logsys_qb_init: Assertion `"implicit callsite
> section is populated, otherwise target's build is at fault, preventing
> reliable logging" && __start___verbose != __stop___verbose' failed.*
> 
> anything is logged (even in debug mode).
> 
> I do not understand why installing libqb during the normal upgrade
> process fails while if I upgrade it after the
> crmsh/pacemaker/corosync/resourceagents upgrade it works fine. 
> 
> On 3 Jul 2018, at 11:42, Christine Caulfield  <mailto:ccaul...@redhat.com>> wrote:
>>
>> On 03/07/18 07:53, Jan Pokorný wrote:
>>> On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
>>>> Today I tested the two suggestions you gave me. Here what I did. 
>>>> In the script where I create my 5 machines cluster (I use three
>>>> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
>>>> that we use for database backup and WAL files).
>>>>
>>>> FIRST TEST
>>>> ——
>>>> I added the —shm-size=512m to the “docker create” command. I noticed
>>>> that as soon as I start it the shm size is 512m and I didn’t need to
>>>> add the entry in /etc/fstab. However, I did it anyway:
>>>>
>>>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>>>
>>>> and then
>>>> mount -o remount /dev/shm
>>>>
>>>> Then I uninstalled all pieces of software (crmsh, resource agents,
>>>> corosync and pacemaker) and installed the new one.
>>>> Started corosync and pacemaker but same problem occurred.
>>>>
>>>> SECOND TEST
>>>> ———
>>>> stopped corosync and pacemaker
>>>> uninstalled corosync
>>>> build corosync with --enable-small-memory-footprint and installed it
>>>> starte corosync and pacemaker
>>>>
>>>> IT WORKED.
>>>>
>>>> I would like to understand now why it didn’t worked in first test
>>>> and why it worked in second. Which kind of memory is used too much
>>>> here? /dev/shm seems not the problem, I allocated 512m on all three
>>>> docker images (obviously on my single Mac) and enabled the container
>>>> option as you suggested. Am I missing something here?
>>>
>>> My suspicion then fully shifts towards "maximum number of bytes of
>>> memory that may be locked into RAM" per-process resource limit as
>>> raised in one of the most recent message ...
>>>
>>>> Now I want to use Docker for the moment only for test purpose so it
>>>> could be ok to use the --enable-small-memory-footprint, but there is
>>>> something I can do to have corosync working even without this
>>>> option?
>>>
>>> ... so try running the container the already suggested way:
>>>
>>>  docker run ... --ulimit memlock=33554432 ...
>>>
>>> or possibly higher (as a rule of thumb, keep doubling the accumulated
>>> value until some unreasonable amount is reached, like the equivalent
>>> of already used 512 MiB).
>>>
>>> Hope this helps.
>>
>> This makes a lot of sense to me. As Poki pointed out earlier, in
>> corosync 2.4.3 (I think) we fixed a regression in that caused corosync
>> NOT to be locked in RAM after it forked

Re: [ClusterLabs] Found libqb issue that affects pacemaker 1.1.18

2018-07-06 Thread Christine Caulfield

On 06/07/18 10:09, Salvatore D'angelo wrote:
> I closed the issue.
> Libqb uses tagging and people should not download the Source code (zip)
>  or Source
> code (tar.gz) .
> The following should be downloaded.
> libqb-1.0.3.tar.gz
> 
> 
> I thought it contained the binary files. I wasn’t aware of tagging
> system and that its was required to download that version of tar.gz file.
> 

It does say so at the bottom of the releases page. Maybe it should be at
the top :)

Chrissie

>> On 5 Jul 2018, at 17:35, Salvatore D'angelo > > wrote:
>>
>> Hi,
>>
>> I tried to build libqb 1.0.3 on a fresh machine and then corosync
>> 2.4.4 and pacemaker 1.1.18.
>> I found the following bug and filed against libqb GitHub:
>> https://github.com/ClusterLabs/libqb/issues/312
>>
>> for the moment I fixed it manually on my env. Anyone experienced this
>> issue?
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Upgrade corosync problem

2018-07-03 Thread Christine Caulfield

On 03/07/18 07:53, Jan Pokorný wrote:
> On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
>> Today I tested the two suggestions you gave me. Here what I did. 
>> In the script where I create my 5 machines cluster (I use three
>> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
>> that we use for database backup and WAL files).
>>
>> FIRST TEST
>> ——
>> I added the —shm-size=512m to the “docker create” command. I noticed
>> that as soon as I start it the shm size is 512m and I didn’t need to
>> add the entry in /etc/fstab. However, I did it anyway:
>>
>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>
>> and then
>> mount -o remount /dev/shm
>>
>> Then I uninstalled all pieces of software (crmsh, resource agents,
>> corosync and pacemaker) and installed the new one.
>> Started corosync and pacemaker but same problem occurred.
>>
>> SECOND TEST
>> ———
>> stopped corosync and pacemaker
>> uninstalled corosync
>> build corosync with --enable-small-memory-footprint and installed it
>> starte corosync and pacemaker
>>
>> IT WORKED.
>>
>> I would like to understand now why it didn’t worked in first test
>> and why it worked in second. Which kind of memory is used too much
>> here? /dev/shm seems not the problem, I allocated 512m on all three
>> docker images (obviously on my single Mac) and enabled the container
>> option as you suggested. Am I missing something here?
> 
> My suspicion then fully shifts towards "maximum number of bytes of
> memory that may be locked into RAM" per-process resource limit as
> raised in one of the most recent message ...
> 
>> Now I want to use Docker for the moment only for test purpose so it
>> could be ok to use the --enable-small-memory-footprint, but there is
>> something I can do to have corosync working even without this
>> option?
> 
> ... so try running the container the already suggested way:
> 
>   docker run ... --ulimit memlock=33554432 ...
> 
> or possibly higher (as a rule of thumb, keep doubling the accumulated
> value until some unreasonable amount is reached, like the equivalent
> of already used 512 MiB).
> 
> Hope this helps.

This makes a lot of sense to me. As Poki pointed out earlier, in
corosync 2.4.3 (I think) we fixed a regression in that caused corosync
NOT to be locked in RAM after it forked - which was causing potential
performance issues. So if you replace an earlier corosync with 2.4.3 or
later then it will use more locked memory than before.

Chrissie


> 
>> The reason I am asking this is that, in the future, it could be
>> possible we deploy in production our cluster in containerised way
>> (for the moment is just an idea). This will save a lot of time in
>> developing, maintaining and deploying our patch system. All
>> prerequisites and dependencies will be enclosed in container and if
>> IT team will do some maintenance on bare metal (i.e. install new
>> dependencies) it will not affects our containers. I do not see a lot
>> of performance drawbacks in using container. The point is to
>> understand if a containerised approach could save us lot of headache
>> about maintenance of this cluster without affect performance too
>> much. I am notice in Cloud environment this approach in a lot of
>> contexts.
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Upgrade corosync problem

2018-07-01 Thread Christine Caulfield

On 29/06/18 17:20, Jan Pokorný wrote:
> On 29/06/18 10:00 +0100, Christine Caulfield wrote:
>> On 27/06/18 08:35, Salvatore D'angelo wrote:
>>> One thing that I do not understand is that I tried to compare corosync
>>> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
>>> differences but I haven’t found anything related to the piece of code
>>> that affects the issue. The quorum tool.c and cfg.c are almost the same.
>>> Probably the issue is somewhere else.
>>>
>>
>> This might be asking a bit much, but would it be possible to try this
>> using Virtual Machines rather than Docker images? That would at least
>> eliminate a lot of complex variables.
> 
> Salvatore, you can ignore the part below, try following the "--shm"
> advice in other part of this thread.  Also the previous suggestion
> to compile corosync with --small-memory-footprint may be of help,
> but comes with other costs (expect lower throughput).
> 
> 
> Chrissie, I have a plausible explanation and if it's true, then the
> same will be reproduced wherever /dev/shm is small enough.
> 
> If I am right, then the offending commit is
> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
> (present since 2.4.3), and while it arranges things for the better
> in the context of prioritized, low jitter process, it all of
> a sudden prevents as-you-need memory acquisition from the system,
> meaning that the memory consumption constraints are checked immediately
> when the memory is claimed (as it must fit into dedicated physical
> memory in full).  Hence this impact we likely never realized may
> be perceived as a sort of a regression.
> 
> Since we can calculate the approximate requirements statically, might
> be worthy to add something like README.requirements, detailing how much
> space will be occupied for typical configurations at minimum, e.g.:
> 
> - standard + --small-memory-footprint configuration
> - 2 + 3 + X nodes (5?)
> - without any service on top + teamed with qnetd + teamed with
>   pacemaker atop (including just IPC channels between pacemaker
>   daemons and corosync's CPG service, indeed)
> 

That is possible explanation I suppose, yes.it's not something we can
sensibly revert because it was already fixing another regression!


I like the idea of documenting the /dev/shm requrements - that would
certainly help with other people using containers - Salvatore mentioned
earlier that there was nothing to guide him about the size needed. I'll
raise an issue in github to cover it. Your input on how to do it for
containers would also be helpful.

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Upgrade corosync problem

2018-06-29 Thread Christine Caulfield

On 27/06/18 08:35, Salvatore D'angelo wrote:
> Hi,
> 
> Thanks for reply and detailed explaination. I am not using the
> —network=host option.
> I have a docker image based on Ubuntu 14.04 where I only deploy this
> additional software:
> 
> *RUN apt-get update && apt-get install -y wget git xz-utils
> openssh-server \*
> *systemd-services make gcc pkg-config psmisc fuse libpython2.7
> libopenipmi0 \*
> *libdbus-glib-1-2 libsnmp30 libtimedate-perl libpcap0.8*
> 
> configure ssh with key pairs to communicate easily. The containers are
> created with these simple commands:
> 
> *docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device
> /dev/loop0 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish
> ${PG1_SSH_PORT}:22 --ip ${PG1_PUBLIC_IP} --name ${PG1_PRIVATE_NAME}
> --hostname ${PG1_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash*
> 
> *docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device
> /dev/loop1 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish
> ${PG2_SSH_PORT}:22 --ip ${PG2_PUBLIC_IP} --name ${PG2_PRIVATE_NAME}
> --hostname ${PG2_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash*
> 
> *docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device
> /dev/loop2 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish
> ${PG3_SSH_PORT}:22 --ip ${PG3_PUBLIC_IP} --name ${PG3_PRIVATE_NAME}
> --hostname ${PG3_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash*
> 
> /dev/fuse is used to configure glusterfs on two others nodes and
> /dev/loopX just to simulate better my bare metal env.
> 
> One thing that I do not understand is that I tried to compare corosync
> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
> differences but I haven’t found anything related to the piece of code
> that affects the issue. The quorum tool.c and cfg.c are almost the same.
> Probably the issue is somewhere else.
> 

This might be asking a bit much, but would it be possible to try this
using Virtual Machines rather than Docker images? That would at least
eliminate a lot of complex variables.

Chrissie


> 
>> On 27 Jun 2018, at 08:34, Jan Pokorný > > wrote:
>>
>> On 26/06/18 17:56 +0200, Salvatore D'angelo wrote:
>>> I did another test. I modified docker container in order to be able
>>> to run strace.
>>> Running strace corosync-quorumtool -ps I got the following:
>>
>>> [snipped]
>>> connect(5, {sa_family=AF_LOCAL, sun_path=@"cfg"}, 110) = 0
>>> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
>>> sendto(5,
>>> "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24,
>>> MSG_NOSIGNAL, NULL, 0) = 24
>>> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
>>> recvfrom(5, 0x7ffd73bd7ac0, 12328, 16640, 0, 0) = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5,
>>> revents=POLLIN}])
>>> recvfrom(5,
>>> "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\365\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0"...,
>>> 12328, MSG_WAITALL|MSG_NOSIGNAL, NULL, NULL) = 12328
>>> shutdown(5, SHUT_RDWR)  = 0
>>> close(5)    = 0
>>> write(2, "Cannot initialise CFG service\n", 30Cannot initialise CFG
>>> service) = 30
>>> [snipped]
>>
>> This just demonstrated the effect of already detailed server-side
>> error in the client, which communicates with the server just fine,
>> but as soon as the server hits the mmap-based problem, it bails
>> out the observed way, leaving client unsatisfied.
>>
>> Note one thing, abstract Unix sockets are being used for the
>> communication like this (observe the first line in the strace
>> output excerpt above), and if you happen to run container via
>> a docker command with --network=host, you may also be affected with
>> issues arising from abstract sockets not being isolated but rather
>> sharing the same namespace.  At least that was the case some years
>> back and what asked for a switch in underlying libqb library to
>> use strictly the file-backed sockets, where the isolation
>> semantics matches the intuition:
>>
>> https://lists.clusterlabs.org/pipermail/users/2017-May/013003.html
>>
>> + way to enable (presumably only for container environments, note
>> that there's no per process straightforward granularity):
>>
>> https://clusterlabs.github.io/libqb/1.0.2/doxygen/qb_ipc_overview.html
>> (scroll down to "IPC sockets (Linux only)")
>>
>> You may test that if you are using said --network=host switch.
>>
>>> I tried to understand what happen behind the scene but it is not easy
>>> for me.
>>> Hoping someone on this list can help.
>>
>> Containers are tricky, just as Ansible (as shown earlier on the list)
>> can be, when encumbered with false believes and/or misunderstandings.
>> Virtual machines may serve better wrt. insights for the later bare
>> metal deployments.
>>
>> -- 
>> Jan (Poki)
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield

On 26/06/18 12:16, Salvatore D'angelo wrote:
> libqb update to 1.0.3 but same issue.
> 
> I know corosync has also these dependencies nspr and nss3. I updated
> them using apt-get install, here the version installed:
> 
>    libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>    libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
> 
> but same problem.
> 
> I am working on Ubuntu 14.04 image and I know that packages could be
> quite old here. Are there new versions for these libraries?
> Where I can download them? I tried to search on google but results where
> quite confusing.
> 

It's pretty unlikely to be the crypto libraries. It's almost certainly
in libqb, with a small possibility that of corosync.  Which versions did
you have that worked (libqb and corosync) ?

Chrissie


> 
>> On 26 Jun 2018, at 12:27, Christine Caulfield > <mailto:ccaul...@redhat.com>> wrote:
>>
>> On 26/06/18 11:24, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> I have tried with:
>>> 0.16.0.real-1ubuntu4
>>> 0.16.0.real-1ubuntu5
>>>
>>> which version should I try?
>>
>>
>> Hmm both of those are actually quite old! maybe a newer one?
>>
>> Chrissie
>>
>>>
>>>> On 26 Jun 2018, at 12:03, Christine Caulfield >>> <mailto:ccaul...@redhat.com>
>>>> <mailto:ccaul...@redhat.com>> wrote:
>>>>
>>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>>>> Consider that the container is the same when corosync 2.3.5 run.
>>>>> If it is something related to the container probably the 2.4.4
>>>>> introduced a feature that has an impact on container.
>>>>> Should be something related to libqb according to the code.
>>>>> Anyone can help?
>>>>>
>>>>
>>>>
>>>> Have you tried downgrading libqb to the previous version to see if it
>>>> still happens?
>>>>
>>>> Chrissie
>>>>
>>>>>> On 26 Jun 2018, at 11:56, Christine Caulfield >>>>> <mailto:ccaul...@redhat.com>
>>>>>> <mailto:ccaul...@redhat.com>
>>>>>> <mailto:ccaul...@redhat.com>> wrote:
>>>>>>
>>>>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>>>>> Sorry after the command:
>>>>>>>
>>>>>>> corosync-quorumtool -ps
>>>>>>>
>>>>>>> the error in log are still visible. Looking at the source code it
>>>>>>> seems
>>>>>>> problem is at this line:
>>>>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>>>>>
>>>>>>>     if (quorum_initialize(&q_handle, &q_callbacks, &q_type) !=
>>>>>>> CS_OK) {
>>>>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>>>>> q_handle = 0;
>>>>>>> goto out;
>>>>>>> }
>>>>>>>
>>>>>>> if (corosync_cfg_initialize(&c_handle, &c_callbacks) != CS_OK) {
>>>>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>>>>> c_handle = 0;
>>>>>>> goto out;
>>>>>>> }
>>>>>>>
>>>>>>> The quorum_initialize function is defined here:
>>>>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>>>>>
>>>>>>> It seems interacts with libqb to allocate space on /dev/shm but
>>>>>>> something fails. I tried to update the libqb with apt-get install
>>>>>>> but no
>>>>>>> success.
>>>>>>>
>>>>>>> The same for second function:
>>>>>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>>>>>
>>>>>>> Now I am not an expert of libqb. I have the
>>>>>>> version 0.16.0.real-1ubuntu5.
>>>>>>>
>>>>>>> The folder /dev/shm has 777 permission like other nodes with older
>>>>>>> corosync and pacemaker that work fine. The only difference is that I
>>>>>>> only see files created by root, no one created by hacluster like
>>>>>>> other
>>>>>>> two nodes (probably because pacemaker didn’t

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield

On 26/06/18 11:24, Salvatore D'angelo wrote:
> Hi,
> 
> I have tried with:
> 0.16.0.real-1ubuntu4
> 0.16.0.real-1ubuntu5
> 
> which version should I try?


Hmm both of those are actually quite old! maybe a newer one?

Chrissie

> 
>> On 26 Jun 2018, at 12:03, Christine Caulfield > <mailto:ccaul...@redhat.com>> wrote:
>>
>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>> Consider that the container is the same when corosync 2.3.5 run.
>>> If it is something related to the container probably the 2.4.4
>>> introduced a feature that has an impact on container.
>>> Should be something related to libqb according to the code.
>>> Anyone can help?
>>>
>>
>>
>> Have you tried downgrading libqb to the previous version to see if it
>> still happens?
>>
>> Chrissie
>>
>>>> On 26 Jun 2018, at 11:56, Christine Caulfield >>> <mailto:ccaul...@redhat.com>
>>>> <mailto:ccaul...@redhat.com>> wrote:
>>>>
>>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>>> Sorry after the command:
>>>>>
>>>>> corosync-quorumtool -ps
>>>>>
>>>>> the error in log are still visible. Looking at the source code it seems
>>>>> problem is at this line:
>>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>>>
>>>>>     if (quorum_initialize(&q_handle, &q_callbacks, &q_type) != CS_OK) {
>>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>>> q_handle = 0;
>>>>> goto out;
>>>>> }
>>>>>
>>>>> if (corosync_cfg_initialize(&c_handle, &c_callbacks) != CS_OK) {
>>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>>> c_handle = 0;
>>>>> goto out;
>>>>> }
>>>>>
>>>>> The quorum_initialize function is defined here:
>>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>>>
>>>>> It seems interacts with libqb to allocate space on /dev/shm but
>>>>> something fails. I tried to update the libqb with apt-get install
>>>>> but no
>>>>> success.
>>>>>
>>>>> The same for second function:
>>>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>>>
>>>>> Now I am not an expert of libqb. I have the
>>>>> version 0.16.0.real-1ubuntu5.
>>>>>
>>>>> The folder /dev/shm has 777 permission like other nodes with older
>>>>> corosync and pacemaker that work fine. The only difference is that I
>>>>> only see files created by root, no one created by hacluster like other
>>>>> two nodes (probably because pacemaker didn’t start correctly).
>>>>>
>>>>> This is the analysis I have done so far.
>>>>> Any suggestion?
>>>>>
>>>>>
>>>>
>>>> Hmm. t seems very likely something to do with the way the container is
>>>> set up then - and I know nothing about containers. Sorry :/
>>>>
>>>> Can anyone else help here?
>>>>
>>>> Chrissie
>>>>
>>>>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>>>>>> mailto:sasadang...@gmail.com>
>>>>>> <mailto:sasadang...@gmail.com>
>>>>>> <mailto:sasadang...@gmail.com>> wrote:
>>>>>>
>>>>>> Yes, sorry you’re right I could find it by myself.
>>>>>> However, I did the following:
>>>>>>
>>>>>> 1. Added the line you suggested to /etc/fstab
>>>>>> 2. mount -o remount /dev/shm
>>>>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>> overlay          63G   11G   49G  19% /
>>>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>>>> osxfs           466G  158G  305G  35% /Users
>>>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>>>> *shm             512M   15M  498M   3% /dev/shm*
>>>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>>>> tmpfs           128M     0  128M   0

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield

On 26/06/18 11:00, Salvatore D'angelo wrote:
> Consider that the container is the same when corosync 2.3.5 run.
> If it is something related to the container probably the 2.4.4
> introduced a feature that has an impact on container.
> Should be something related to libqb according to the code.
> Anyone can help?
> 


Have you tried downgrading libqb to the previous version to see if it
still happens?

Chrissie

>> On 26 Jun 2018, at 11:56, Christine Caulfield > <mailto:ccaul...@redhat.com>> wrote:
>>
>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>> Sorry after the command:
>>>
>>> corosync-quorumtool -ps
>>>
>>> the error in log are still visible. Looking at the source code it seems
>>> problem is at this line:
>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>
>>>     if (quorum_initialize(&q_handle, &q_callbacks, &q_type) != CS_OK) {
>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>> q_handle = 0;
>>> goto out;
>>> }
>>>
>>> if (corosync_cfg_initialize(&c_handle, &c_callbacks) != CS_OK) {
>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>> c_handle = 0;
>>> goto out;
>>> }
>>>
>>> The quorum_initialize function is defined here:
>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>
>>> It seems interacts with libqb to allocate space on /dev/shm but
>>> something fails. I tried to update the libqb with apt-get install but no
>>> success.
>>>
>>> The same for second function:
>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>
>>> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
>>>
>>> The folder /dev/shm has 777 permission like other nodes with older
>>> corosync and pacemaker that work fine. The only difference is that I
>>> only see files created by root, no one created by hacluster like other
>>> two nodes (probably because pacemaker didn’t start correctly).
>>>
>>> This is the analysis I have done so far.
>>> Any suggestion?
>>>
>>>
>>
>> Hmm. t seems very likely something to do with the way the container is
>> set up then - and I know nothing about containers. Sorry :/
>>
>> Can anyone else help here?
>>
>> Chrissie
>>
>>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo >>> <mailto:sasadang...@gmail.com>
>>>> <mailto:sasadang...@gmail.com>> wrote:
>>>>
>>>> Yes, sorry you’re right I could find it by myself.
>>>> However, I did the following:
>>>>
>>>> 1. Added the line you suggested to /etc/fstab
>>>> 2. mount -o remount /dev/shm
>>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>> overlay          63G   11G   49G  19% /
>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>> osxfs           466G  158G  305G  35% /Users
>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>> *shm             512M   15M  498M   3% /dev/shm*
>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>> tmpfs           128M     0  128M   0% /tmp
>>>>
>>>> The errors in log went away. Consider that I remove the log file
>>>> before start corosync so it does not contains lines of previous
>>>> executions.
>>>> 
>>>>
>>>> But the command:
>>>> corosync-quorumtool -ps
>>>>
>>>> still give:
>>>> Cannot initialize QUORUM service
>>>>
>>>> Consider that few minutes before it gave me the message:
>>>> Cannot initialize CFG service
>>>>
>>>> I do not know the differences between CFG and QUORUM in this case.
>>>>
>>>> If I try to start pacemaker the service is OK but I see only pacemaker
>>>> and the Transport does not work if I try to run a cam command.
>>>> Any suggestion?
>>>>
>>>>
>>>>> On 26 Jun 2018, at 10:49, Christine Caulfield >>>> <mailto:ccaul...@redhat.com>
>>>>> <mailto:ccaul...@redhat.com>> wrote:
>>>>>
>>>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>>>>> Hi,
>&g

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield

On 26/06/18 10:35, Salvatore D'angelo wrote:
> Sorry after the command:
> 
> corosync-quorumtool -ps
> 
> the error in log are still visible. Looking at the source code it seems
> problem is at this line:
> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
> 
>     if (quorum_initialize(&q_handle, &q_callbacks, &q_type) != CS_OK) {
> fprintf(stderr, "Cannot initialize QUORUM service\n");
> q_handle = 0;
> goto out;
> }
> 
> if (corosync_cfg_initialize(&c_handle, &c_callbacks) != CS_OK) {
> fprintf(stderr, "Cannot initialise CFG service\n");
> c_handle = 0;
> goto out;
> }
> 
> The quorum_initialize function is defined here:
> https://github.com/corosync/corosync/blob/master/lib/quorum.c
> 
> It seems interacts with libqb to allocate space on /dev/shm but
> something fails. I tried to update the libqb with apt-get install but no
> success.
> 
> The same for second function:
> https://github.com/corosync/corosync/blob/master/lib/cfg.c
> 
> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
> 
> The folder /dev/shm has 777 permission like other nodes with older
> corosync and pacemaker that work fine. The only difference is that I
> only see files created by root, no one created by hacluster like other
> two nodes (probably because pacemaker didn’t start correctly).
> 
> This is the analysis I have done so far.
> Any suggestion?
> 
> 

Hmm. t seems very likely something to do with the way the container is
set up then - and I know nothing about containers. Sorry :/

Can anyone else help here?

Chrissie

>> On 26 Jun 2018, at 11:03, Salvatore D'angelo > <mailto:sasadang...@gmail.com>> wrote:
>>
>> Yes, sorry you’re right I could find it by myself.
>> However, I did the following:
>>
>> 1. Added the line you suggested to /etc/fstab
>> 2. mount -o remount /dev/shm
>> 3. Now I correctly see /dev/shm of 512M with df -h
>> Filesystem      Size  Used Avail Use% Mounted on
>> overlay          63G   11G   49G  19% /
>> tmpfs            64M  4.0K   64M   1% /dev
>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>> osxfs           466G  158G  305G  35% /Users
>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>> *shm             512M   15M  498M   3% /dev/shm*
>> tmpfs          1000M     0 1000M   0% /sys/firmware
>> tmpfs           128M     0  128M   0% /tmp
>>
>> The errors in log went away. Consider that I remove the log file
>> before start corosync so it does not contains lines of previous
>> executions.
>> 
>>
>> But the command:
>> corosync-quorumtool -ps
>>
>> still give:
>> Cannot initialize QUORUM service
>>
>> Consider that few minutes before it gave me the message:
>> Cannot initialize CFG service
>>
>> I do not know the differences between CFG and QUORUM in this case.
>>
>> If I try to start pacemaker the service is OK but I see only pacemaker
>> and the Transport does not work if I try to run a cam command.
>> Any suggestion?
>>
>>
>>> On 26 Jun 2018, at 10:49, Christine Caulfield >> <mailto:ccaul...@redhat.com>> wrote:
>>>
>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>>> Hi,
>>>>
>>>> Yes,
>>>>
>>>> I am reproducing only the required part for test. I think the original
>>>> system has a larger shm. The problem is that I do not know exactly how
>>>> to change it.
>>>> I tried the following steps, but I have the impression I didn’t
>>>> performed the right one:
>>>>
>>>> 1. remove everything under /tmp
>>>> 2. Added the following line to /etc/fstab
>>>> tmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
>>>>         0  0
>>>> 3. mount /tmp
>>>> 4. df -h
>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>> overlay          63G   11G   49G  19% /
>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>> osxfs           466G  158G  305G  35% /Users
>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>> shm              64M   11M   54M  16% /dev/shm
>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>> *tmpfs           128M     0  128M   0% /tmp*
>>>>
>>>> The errors are exactly the same.
>>>> I have the impression that I changed the wrong parameter. Probably I
>&

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield

On 26/06/18 09:40, Salvatore D'angelo wrote:
> Hi,
> 
> Yes,
> 
> I am reproducing only the required part for test. I think the original
> system has a larger shm. The problem is that I do not know exactly how
> to change it.
> I tried the following steps, but I have the impression I didn’t
> performed the right one:
> 
> 1. remove everything under /tmp
> 2. Added the following line to /etc/fstab
> tmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
>         0  0
> 3. mount /tmp
> 4. df -h
> Filesystem      Size  Used Avail Use% Mounted on
> overlay          63G   11G   49G  19% /
> tmpfs            64M  4.0K   64M   1% /dev
> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
> osxfs           466G  158G  305G  35% /Users
> /dev/sda1        63G   11G   49G  19% /etc/hosts
> shm              64M   11M   54M  16% /dev/shm
> tmpfs          1000M     0 1000M   0% /sys/firmware
> *tmpfs           128M     0  128M   0% /tmp*
> 
> The errors are exactly the same.
> I have the impression that I changed the wrong parameter. Probably I
> have to change:
> shm              64M   11M   54M  16% /dev/shm
> 
> but I do not know how to do that. Any suggestion?
> 

According to google, you just add a new line to /etc/fstab for /dev/shm

tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0

Chrissie

>> On 26 Jun 2018, at 09:48, Christine Caulfield > <mailto:ccaul...@redhat.com>> wrote:
>>
>> On 25/06/18 20:41, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> Let me add here one important detail. I use Docker for my test with 5
>>> containers deployed on my Mac.
>>> Basically the team that worked on this project installed the cluster
>>> on soft layer bare metal.
>>> The PostgreSQL cluster was hard to test and if a misconfiguration
>>> occurred recreate the cluster from scratch is not easy.
>>> Test it was a cumbersome if you consider that we access to the
>>> machines with a complex system hard to describe here.
>>> For this reason I ported the cluster on Docker for test purpose. I am
>>> not interested to have it working for months, I just need a proof of
>>> concept. 
>>>
>>> When the migration works I’ll port everything on bare metal where the
>>> size of resources are ambundant.  
>>>
>>> Now I have enough RAM and disk space on my Mac so if you tell me what
>>> should be an acceptable size for several days of running it is ok for me.
>>> It is ok also have commands to clean the shm when required.
>>> I know I can find them on Google but if you can suggest me these info
>>> I’ll appreciate. I have OS knowledge to do that but I would like to
>>> avoid days of guesswork and try and error if possible.
>>
>>
>> I would recommend at least 128MB of space on /dev/shm, 256MB if you can
>> spare it. My 'standard' system uses 75MB under normal running allowing
>> for one command-line query to run.
>>
>> If I read this right then you're reproducing a bare-metal system in
>> containers now? so the original systems will have a default /dev/shm
>> size which is probably much larger than your containers?
>>
>> I'm just checking here that we don't have a regression in memory usage
>> as Poki suggested.
>>
>> Chrissie
>>
>>>> On 25 Jun 2018, at 21:18, Jan Pokorný >>> <mailto:jpoko...@redhat.com>> wrote:
>>>>
>>>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>>>> Thanks for reply. I scratched my cluster and created it again and
>>>>> then migrated as before. This time I uninstalled pacemaker,
>>>>> corosync, crmsh and resource agents with make uninstall
>>>>>
>>>>> then I installed new packages. The problem is the same, when
>>>>> I launch:
>>>>> corosync-quorumtool -ps
>>>>>
>>>>> I got: Cannot initialize QUORUM service
>>>>>
>>>>> Here the log with debug enabled:
>>>>>
>>>>>
>>>>> [18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap
>>>>> on /dev/shm/qb-cfg-event-18020-18028-23-data
>>>>> [18019] pg3 corosyncerror   [QB    ]
>>>>> qb_rb_open:cfg-event-18020-18028-23: Resource temporarily
>>>>> unavailable (11)
>>>>> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
>>>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>>>> [18019] pg3 corosyncdebug

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield

On 25/06/18 20:41, Salvatore D'angelo wrote:
> Hi,
> 
> Let me add here one important detail. I use Docker for my test with 5 
> containers deployed on my Mac.
> Basically the team that worked on this project installed the cluster on soft 
> layer bare metal.
> The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
> recreate the cluster from scratch is not easy.
> Test it was a cumbersome if you consider that we access to the machines with 
> a complex system hard to describe here.
> For this reason I ported the cluster on Docker for test purpose. I am not 
> interested to have it working for months, I just need a proof of concept. 
> 
> When the migration works I’ll port everything on bare metal where the size of 
> resources are ambundant.  
> 
> Now I have enough RAM and disk space on my Mac so if you tell me what should 
> be an acceptable size for several days of running it is ok for me.
> It is ok also have commands to clean the shm when required.
> I know I can find them on Google but if you can suggest me these info I’ll 
> appreciate. I have OS knowledge to do that but I would like to avoid days of 
> guesswork and try and error if possible.


I would recommend at least 128MB of space on /dev/shm, 256MB if you can
spare it. My 'standard' system uses 75MB under normal running allowing
for one command-line query to run.

If I read this right then you're reproducing a bare-metal system in
containers now? so the original systems will have a default /dev/shm
size which is probably much larger than your containers?

I'm just checking here that we don't have a regression in memory usage
as Poki suggested.

Chrissie

>> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
>>
>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>> Thanks for reply. I scratched my cluster and created it again and
>>> then migrated as before. This time I uninstalled pacemaker,
>>> corosync, crmsh and resource agents with make uninstall
>>>
>>> then I installed new packages. The problem is the same, when
>>> I launch:
>>> corosync-quorumtool -ps
>>>
>>> I got: Cannot initialize QUORUM service
>>>
>>> Here the log with debug enabled:
>>>
>>>
>>> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
>>> /dev/shm/qb-cfg-event-18020-18028-23-data
>>> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
>>> Resource temporarily unavailable (11)
>>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>>> /dev/shm/qb-cfg-response-18020-18028-23-header
>>> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
>>> temporarily unavailable (11)
>>> [18019] pg3 corosyncerror   [QB] Error in connection setup 
>>> (18020-18028-23): Resource temporarily unavailable (11)
>>>
>>> I tried to check /dev/shm and I am not sure these are the right
>>> commands, however:
>>>
>>> df -h /dev/shm
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> shm  64M   16M   49M  24% /dev/shm
>>>
>>> ls /dev/shm
>>> qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
>>> qb-quorum-request-18020-18095-32-data
>>> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
>>> qb-quorum-request-18020-18095-32-header
>>>
>>> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
>>> corosync release?
>>
>> For a start, can you try configuring corosync with
>> --enable-small-memory-footprint switch?
>>
>> Hard to say why the space provisioned to /dev/shm is the direct
>> opposite of generous (per today's standards), but may be the result
>> of automatic HW adaptation, and if RAM is so scarce in your case,
>> the above build-time toggle might help.
>>
>> If not, then exponentially increasing size of /dev/shm space is
>> likely your best bet (I don't recommended fiddling with mlockall()
>> and similar measures in corosync).
>>
>> Of course, feel free to raise a regression if you have a reproducible
>> comparison between two corosync (plus possibly different libraries
>> like libqb) versions, one that works and one that won't, in
>> reproducible conditions (like this small /dev/shm, VM image, etc.).
>>
>> -- 
>> Jan (Poki)
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Christine Caulfield

On 22/06/18 11:23, Salvatore D'angelo wrote:
> Hi,
> Here the log:
> 
> 
> 
[17323] pg1 corosyncerror   [QB] couldn't create circular mmap on
/dev/shm/qb-cfg-event-17324-17334-23-data
[17323] pg1 corosyncerror   [QB]
qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB] Free'ing ringbuffer:
/dev/shm/qb-cfg-request-17324-17334-23-header
[17323] pg1 corosyncdebug   [QB] Free'ing ringbuffer:
/dev/shm/qb-cfg-response-17324-17334-23-header
[17323] pg1 corosyncerror   [QB] shm connection FAILED: Resource
temporarily unavailable (11)
[17323] pg1 corosyncerror   [QB] Error in connection setup
(17324-17334-23): Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB] qb_ipcs_disconnect(17324-17334-23)
state:0



is /dev/shm full?


Chrissie


> 
> 
>> On 22 Jun 2018, at 12:10, Christine Caulfield  wrote:
>>
>> On 22/06/18 10:39, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> Can you tell me exactly which log you need. I’ll provide you as soon as 
>>> possible.
>>>
>>> Regarding some settings, I am not the original author of this cluster. 
>>> People created it left the company I am working with and I inerithed the 
>>> code and sometime I do not know why some settings are used.
>>> The old versions of pacemaker, corosync,  crash and resource agents were 
>>> compiled and installed.
>>> I simply downloaded the new versions compiled and installed them. I didn’t 
>>> get any compliant during ./configure that usually checks for library 
>>> compatibility.
>>>
>>> To be honest I do not know if this is the right approach. Should I “make 
>>> unistall" old versions before installing the new one?
>>> Which is the suggested approach?
>>> Thank in advance for your help.
>>>
>>
>> OK fair enough!
>>
>> To be honest the best approach is almost always to get the latest
>> packages from the distributor rather than compile from source. That way
>> you can be more sure that upgrades will be more smoothly. Though, to be
>> honest, I'm not sure how good the Ubuntu packages are (they might be
>> great, they might not, I genuinely don't know)
>>
>> When building from source and if you don't know the provenance of the
>> previous version then I would recommend a 'make uninstall' first - or
>> removal of the packages if that's where they came from.
>>
>> One thing you should do is make sure that all the cluster nodes are
>> running the same version. If some are running older versions then nodes
>> could drop out for obscure reasons. We try and keep minor versions
>> on-wire compatible but it's always best to be cautious.
>>
>> The tidying of your corosync.conf wan wait for the moment, lets get
>> things mostly working first. If you enable debug logging in corosync.conf:
>>
>> logging {
>>to_syslog: yes
>>  debug: on
>> }
>>
>> Then see what happens and post the syslog file that has all of the
>> corosync messages in it, we'll take it from there.
>>
>> Chrissie
>>
>>>> On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:
>>>>
>>>> On 22/06/18 10:14, Salvatore D'angelo wrote:
>>>>> Hi Christine,
>>>>>
>>>>> Thanks for reply. Let me add few details. When I run the corosync
>>>>> service I se the corosync process running. If I stop it and run:
>>>>>
>>>>> corosync -f 
>>>>>
>>>>> I see three warnings:
>>>>> warning [MAIN  ] interface section bindnetaddr is used together with
>>>>> nodelist. Nodelist one is going to be used.
>>>>> warning [MAIN  ] Please migrate config file to nodelist.
>>>>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>>>>> permitted (1)
>>>>> warning [MAIN  ] Could not set priority -2147483648: Permission denied 
>>>>> (13)
>>>>>
>>>>> but I see node joined.
>>>>>
>>>>
>>>> Those certainly need fixing but are probably not the cause. Also why do
>>>> you have these values below set?
>>>>
>>>> max_network_delay: 100
>>>> retransmits_before_loss_const: 25
>>>> window_size: 150
>>>>
>>>> I'm not saying they are causing the trouble, but they aren't going to
>>>> help keep a stable cluster.
>>&g

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Christine Caulfield

On 22/06/18 10:39, Salvatore D'angelo wrote:
> Hi,
> 
> Can you tell me exactly which log you need. I’ll provide you as soon as 
> possible.
> 
> Regarding some settings, I am not the original author of this cluster. People 
> created it left the company I am working with and I inerithed the code and 
> sometime I do not know why some settings are used.
> The old versions of pacemaker, corosync,  crash and resource agents were 
> compiled and installed.
> I simply downloaded the new versions compiled and installed them. I didn’t 
> get any compliant during ./configure that usually checks for library 
> compatibility.
> 
> To be honest I do not know if this is the right approach. Should I “make 
> unistall" old versions before installing the new one?
> Which is the suggested approach?
> Thank in advance for your help.
> 

OK fair enough!

To be honest the best approach is almost always to get the latest
packages from the distributor rather than compile from source. That way
you can be more sure that upgrades will be more smoothly. Though, to be
honest, I'm not sure how good the Ubuntu packages are (they might be
great, they might not, I genuinely don't know)

When building from source and if you don't know the provenance of the
previous version then I would recommend a 'make uninstall' first - or
removal of the packages if that's where they came from.

One thing you should do is make sure that all the cluster nodes are
running the same version. If some are running older versions then nodes
could drop out for obscure reasons. We try and keep minor versions
on-wire compatible but it's always best to be cautious.

The tidying of your corosync.conf wan wait for the moment, lets get
things mostly working first. If you enable debug logging in corosync.conf:

logging {
to_syslog: yes
debug: on
}

Then see what happens and post the syslog file that has all of the
corosync messages in it, we'll take it from there.

Chrissie

>> On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:
>>
>> On 22/06/18 10:14, Salvatore D'angelo wrote:
>>> Hi Christine,
>>>
>>> Thanks for reply. Let me add few details. When I run the corosync
>>> service I se the corosync process running. If I stop it and run:
>>>
>>> corosync -f 
>>>
>>> I see three warnings:
>>> warning [MAIN  ] interface section bindnetaddr is used together with
>>> nodelist. Nodelist one is going to be used.
>>> warning [MAIN  ] Please migrate config file to nodelist.
>>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>>> permitted (1)
>>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>>>
>>> but I see node joined.
>>>
>>
>> Those certainly need fixing but are probably not the cause. Also why do
>> you have these values below set?
>>
>> max_network_delay: 100
>> retransmits_before_loss_const: 25
>> window_size: 150
>>
>> I'm not saying they are causing the trouble, but they aren't going to
>> help keep a stable cluster.
>>
>> Without more logs (full logs are always better than just the bits you
>> think are meaningful) I still can't be sure. it could easily be just
>> that you've overwritten a packaged version of corosync with your own
>> compiled one and they have different configure options or that the
>> libraries now don't match.
>>
>> Chrissie
>>
>>
>>> My corosync.conf file is below.
>>>
>>> With service corosync up and running I have the following output:
>>> *corosync-cfgtool -s*
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>> id= 10.0.0.11
>>> status= ring 0 active with no faults
>>> RING ID 1
>>> id= 192.168.0.11
>>> status= ring 1 active with no faults
>>>
>>> *corosync-cmapctl  | grep members*
>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>>> ip(192.168.0.11) 
>>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>>> ip(192.168.0.12) 
>>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>>>
>>> For the moment I have two nodes in my cluster (thir

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Christine Caulfield

On 22/06/18 10:14, Salvatore D'angelo wrote:
> Hi Christine,
> 
> Thanks for reply. Let me add few details. When I run the corosync
> service I se the corosync process running. If I stop it and run:
> 
> corosync -f 
> 
> I see three warnings:
> warning [MAIN  ] interface section bindnetaddr is used together with
> nodelist. Nodelist one is going to be used.
> warning [MAIN  ] Please migrate config file to nodelist.
> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
> permitted (1)
> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
> 
> but I see node joined.
> 

Those certainly need fixing but are probably not the cause. Also why do
you have these values below set?

max_network_delay: 100
retransmits_before_loss_const: 25
window_size: 150

I'm not saying they are causing the trouble, but they aren't going to
help keep a stable cluster.

Without more logs (full logs are always better than just the bits you
think are meaningful) I still can't be sure. it could easily be just
that you've overwritten a packaged version of corosync with your own
compiled one and they have different configure options or that the
libraries now don't match.

Chrissie


> My corosync.conf file is below.
> 
> With service corosync up and running I have the following output:
> *corosync-cfgtool -s*
> Printing ring status.
> Local node ID 1
> RING ID 0
> id= 10.0.0.11
> status= ring 0 active with no faults
> RING ID 1
> id= 192.168.0.11
> status= ring 1 active with no faults
> 
> *corosync-cmapctl  | grep members*
> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
> ip(192.168.0.11) 
> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
> ip(192.168.0.12) 
> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
> 
> For the moment I have two nodes in my cluster (third node and some
> issues and at the moment I did crm node standby on it).
> 
> Here the dependency I have installed for corosync (that works fine with
> pacemaker 1.1.14 and corosync 2.3.5):
>      libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>      libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>      libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>      libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>      libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>      libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>      libqb0_0.16.0.real-1ubuntu4_amd64.deb
> 
> *corosync.conf*
> -
> quorum {
>         provider: corosync_votequorum
>         expected_votes: 3
> }
> totem {
>         version: 2
>         crypto_cipher: none
>         crypto_hash: none
>         rrp_mode: passive
>         interface {
>                 ringnumber: 0
>                 bindnetaddr: 10.0.0.0
>                 mcastport: 5405
>                 ttl: 1
>         }
>         interface {
>                 ringnumber: 1
>                 bindnetaddr: 192.168.0.0
>                 mcastport: 5405
>                 ttl: 1
>         }
>         transport: udpu
>         max_network_delay: 100
>         retransmits_before_loss_const: 25
>         window_size: 150
> }
> nodelist {
>         node {
>                 ring0_addr: pg1
>                 ring1_addr: pg1p
>                 nodeid: 1
>         }
>         node {
>                 ring0_addr: pg2
>                 ring1_addr: pg2p
>                 nodeid: 2
>         }
>         node {
>                 ring0_addr: pg3
>                 ring1_addr: pg3p
>                 nodeid: 3
>         }
> }
> logging {
>         to_syslog: yes
> }
> 
> 
> 
> 
>> On 22 Jun 2018, at 09:24, Christine Caulfield > <mailto:ccaul...@redhat.com>> wrote:
>>
>> On 21/06/18 16:16, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>>> Pacemaker 1.1.14 -> 1.1.18
>>> Corosync 2.3.5 -> 2.4.4
>>> Crmsh 2.2.0 -> 3.0.1
>>> Resource agents 3.9.7 -> 4.1.1
>>>
>>> I started on a first node  (I am trying one node at a time upgrade).
>>> On a PostgreSQL slave node  I did:
>>>
>>> *crm node standby *
>>> *service pacemaker stop*
>>> *service corosync stop*
>>>
&

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Christine Caulfield

On 21/06/18 16:16, Salvatore D'angelo wrote:
> Hi,
> 
> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
> Pacemaker 1.1.14 -> 1.1.18
> Corosync 2.3.5 -> 2.4.4
> Crmsh 2.2.0 -> 3.0.1
> Resource agents 3.9.7 -> 4.1.1
> 
> I started on a first node  (I am trying one node at a time upgrade).
> On a PostgreSQL slave node  I did:
> 
> *crm node standby *
> *service pacemaker stop*
> *service corosync stop*
> 
> Then I build the tool above as described on their GitHub.com
>  page. 
> 
> *./autogen.sh (where required)*
> *./configure*
> *make (where required)*
> *make install*
> 
> Everything went ok. I expect new file overwrite old one. I left the
> dependency I had with old software because I noticed the .configure
> didn’t complain. 
> I started corosync.
> 
> *service corosync start*
> 
> To verify corosync work properly I used the following commands:
> *corosync-cfg-tool -s*
> *corosync-cmapctl | grep members*
> 
> Everything seemed ok and I verified my node joined the cluster (at least
> this is my impression).
> 
> Here I verified a problem. Doing the command:
> corosync-quorumtool -ps
> 
> I got the following problem:
> Cannot initialise CFG service
> 
That says that corosync is not running. Have a look in the log files to
see why it stopped. The pacemaker logs below are showing the same thing,
but we can't make any more guesses until we see what corosync itself is
doing. Enabling debug in corosync.conf will also help if more detail is
needed.

Also starting corosync with 'corosync -pf' on the command-line is often
a quick way of checking things are starting OK.

Chrissie


> If I try to start pacemaker, I only see pacemaker process running and
> pacemaker.log containing the following lines:
> 
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
> active directory to /var/lib/pacemaker/cores/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
> get_cluster_type:Detected an active 'corosync' cluster/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
> mcp_read_config:Reading configure for stack: corosync/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
> lha-fencing nagios  corosync-native atomic-attrd acls/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
> file size is: 18446744073709551615/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
> qb_ipcs_us_publish:server name: pacemakerd/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning:
> corosync_node_name:Could not connect to Cluster Configuration Database
> API, error CS_ERR_TRY_AGAIN/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
> corosync_node_name:Unable to get node name for nodeid 1/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
> not obtain a node name for corosync nodeid 1/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
> (null)/1 (1 total)/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
> has uuid 1/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
> is now online/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
> cluster_connect_quorum:Could not connect to the Quorum API: 2/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
> qb_ipcs_us_withdraw:withdrawing server sockets/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting pacemakerd/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
> crm_xml_cleanup:Cleaning up memory from libxml2/
> 
> *What is wrong in my procedure?*
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync-qdevice doesn't daemonize (or stay running)

2018-06-21 Thread Christine Caulfield

On 21/06/18 14:27, Christine Caulfield wrote:
> On 21/06/18 12:05, Jason Gauthier wrote:
>> On Thu, Jun 21, 2018 at 5:11 AM Christine Caulfield  
>> wrote:
>>>
>>> On 19/06/18 18:47, Jason Gauthier wrote:
>>>> On Tue, Jun 19, 2018 at 6:58 AM Christine Caulfield  
>>>> wrote:
>>>>>
>>>>> On 19/06/18 11:44, Jason Gauthier wrote:
>>>>>> On Tue, Jun 19, 2018 at 3:25 AM Christine Caulfield 
>>>>>>  wrote:
>>>>>>>
>>>>>>> On 19/06/18 02:46, Jason Gauthier wrote:
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>>I've just discovered corosync-qdevice and corosync-qnet.
>>>>>>>> (Thanks Ken Gaillot) . Set up was pretty quick.
>>>>>>>>
>>>>>>>> I enabled qnet off cluster.  I followed the steps presented by
>>>>>>>> corosync-qdevice-net-certutil.However, when running
>>>>>>>> corosync-qdevice it exits.  Even with -f -d there isn't a single
>>>>>>>> output presented.
>>>>>>>>
>>>>>>>
>>>>>>> It sounds like the first time you ran it (without -d -f)
>>>>>>> corosync-qdevice started up and daemonised itself. The second time you
>>>>>>> tried (with -d -f) it couldn't run because there was already one
>>>>>>> running. There's a good argument for it printing an error if it's
>>>>>>> already running I think!
>>>>>>>
>>>>>>
>>>>>> The process doesn't stay running.  I've showed in output of qnet below
>>>>>> that it launches, connected, and disconnects. I've rebooted several
>>>>>> times since then (testing stonith). I can provide strace output if
>>>>>> it's helpful.
>>>>>>
>>>>>
>>>>> yes please
>>>>
>>>> Attached!
>>>>
>>>
>>> That's very odd. I can see communication with the server and corosync in
>>> there (do it's doing something) but no logging at all. When I start
>>> qdevice on my systems it logs loads of messages even if it doesn't
>>> manage to contact the server. Do you have any logging entries in
>>> corosync.conf that might be stopping it?
>>
>> I haven't checked the corosync logs for any entries before, but I just
>> did.  There isn't anything logged.
>>
>>> Where did the binary come from? did you build it yourself or is it from
>>> a package? I wonder if it's got corrupted or is a bad version. Possibly
>>> linked against a 'dodgy' libqb - there have been some things going on
>>> there that could cause logging to go missing in some circumstances.
>>>
>>> Honza (the qdevice expert) is away at the moment, so I'm guessing a bit
>>> here anyway!
>>
>> Hmm. Interesting.  I installed the debian package.  When it didn't
>> work, I grabbed the source from github.  They both act the same way,
>> but if there is an underlying library issue then that will continue to
>> be a problem.
>>
>> It doesn't say much:
>> /usr/lib/x86_64-linux-gnu/libqb.so.0.18.1
>>
>>
> 
> I just tried this on my Debian VM and it does exactly the same as yours.
> So I think you should report it to the Debian maintainer as it doesn't
> happen on my Fedora or RHEL systems
> 

ahh more light here. I still don't understand why Debian doesn't log
to stderr, but I'm getting messages in /var/log/syslog (fedora is
different, that's why I missed them) about the security keys (on my
system). are you getting any system log errors on yours?

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync-qdevice doesn't daemonize (or stay running)

2018-06-21 Thread Christine Caulfield

On 21/06/18 12:05, Jason Gauthier wrote:
> On Thu, Jun 21, 2018 at 5:11 AM Christine Caulfield  
> wrote:
>>
>> On 19/06/18 18:47, Jason Gauthier wrote:
>>> On Tue, Jun 19, 2018 at 6:58 AM Christine Caulfield  
>>> wrote:
>>>>
>>>> On 19/06/18 11:44, Jason Gauthier wrote:
>>>>> On Tue, Jun 19, 2018 at 3:25 AM Christine Caulfield  
>>>>> wrote:
>>>>>>
>>>>>> On 19/06/18 02:46, Jason Gauthier wrote:
>>>>>>> Greetings,
>>>>>>>
>>>>>>>I've just discovered corosync-qdevice and corosync-qnet.
>>>>>>> (Thanks Ken Gaillot) . Set up was pretty quick.
>>>>>>>
>>>>>>> I enabled qnet off cluster.  I followed the steps presented by
>>>>>>> corosync-qdevice-net-certutil.However, when running
>>>>>>> corosync-qdevice it exits.  Even with -f -d there isn't a single
>>>>>>> output presented.
>>>>>>>
>>>>>>
>>>>>> It sounds like the first time you ran it (without -d -f)
>>>>>> corosync-qdevice started up and daemonised itself. The second time you
>>>>>> tried (with -d -f) it couldn't run because there was already one
>>>>>> running. There's a good argument for it printing an error if it's
>>>>>> already running I think!
>>>>>>
>>>>>
>>>>> The process doesn't stay running.  I've showed in output of qnet below
>>>>> that it launches, connected, and disconnects. I've rebooted several
>>>>> times since then (testing stonith). I can provide strace output if
>>>>> it's helpful.
>>>>>
>>>>
>>>> yes please
>>>
>>> Attached!
>>>
>>
>> That's very odd. I can see communication with the server and corosync in
>> there (do it's doing something) but no logging at all. When I start
>> qdevice on my systems it logs loads of messages even if it doesn't
>> manage to contact the server. Do you have any logging entries in
>> corosync.conf that might be stopping it?
> 
> I haven't checked the corosync logs for any entries before, but I just
> did.  There isn't anything logged.
> 
>> Where did the binary come from? did you build it yourself or is it from
>> a package? I wonder if it's got corrupted or is a bad version. Possibly
>> linked against a 'dodgy' libqb - there have been some things going on
>> there that could cause logging to go missing in some circumstances.
>>
>> Honza (the qdevice expert) is away at the moment, so I'm guessing a bit
>> here anyway!
> 
> Hmm. Interesting.  I installed the debian package.  When it didn't
> work, I grabbed the source from github.  They both act the same way,
> but if there is an underlying library issue then that will continue to
> be a problem.
> 
> It doesn't say much:
> /usr/lib/x86_64-linux-gnu/libqb.so.0.18.1
> 
> 

I just tried this on my Debian VM and it does exactly the same as yours.
So I think you should report it to the Debian maintainer as it doesn't
happen on my Fedora or RHEL systems

Chrissie


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync-qdevice doesn't daemonize (or stay running)

2018-06-21 Thread Christine Caulfield

On 19/06/18 18:47, Jason Gauthier wrote:
> On Tue, Jun 19, 2018 at 6:58 AM Christine Caulfield  
> wrote:
>>
>> On 19/06/18 11:44, Jason Gauthier wrote:
>>> On Tue, Jun 19, 2018 at 3:25 AM Christine Caulfield  
>>> wrote:
>>>>
>>>> On 19/06/18 02:46, Jason Gauthier wrote:
>>>>> Greetings,
>>>>>
>>>>>I've just discovered corosync-qdevice and corosync-qnet.
>>>>> (Thanks Ken Gaillot) . Set up was pretty quick.
>>>>>
>>>>> I enabled qnet off cluster.  I followed the steps presented by
>>>>> corosync-qdevice-net-certutil.However, when running
>>>>> corosync-qdevice it exits.  Even with -f -d there isn't a single
>>>>> output presented.
>>>>>
>>>>
>>>> It sounds like the first time you ran it (without -d -f)
>>>> corosync-qdevice started up and daemonised itself. The second time you
>>>> tried (with -d -f) it couldn't run because there was already one
>>>> running. There's a good argument for it printing an error if it's
>>>> already running I think!
>>>>
>>>
>>> The process doesn't stay running.  I've showed in output of qnet below
>>> that it launches, connected, and disconnects. I've rebooted several
>>> times since then (testing stonith). I can provide strace output if
>>> it's helpful.
>>>
>>
>> yes please
> 
> Attached!
> 

That's very odd. I can see communication with the server and corosync in
there (do it's doing something) but no logging at all. When I start
qdevice on my systems it logs loads of messages even if it doesn't
manage to contact the server. Do you have any logging entries in
corosync.conf that might be stopping it?

Where did the binary come from? did you build it yourself or is it from
a package? I wonder if it's got corrupted or is a bad version. Possibly
linked against a 'dodgy' libqb - there have been some things going on
there that could cause logging to go missing in some circumstances.

Honza (the qdevice expert) is away at the moment, so I'm guessing a bit
here anyway!

Chrissie

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync-qdevice doesn't daemonize (or stay running)

2018-06-19 Thread Christine Caulfield

On 19/06/18 11:44, Jason Gauthier wrote:
> On Tue, Jun 19, 2018 at 3:25 AM Christine Caulfield  
> wrote:
>>
>> On 19/06/18 02:46, Jason Gauthier wrote:
>>> Greetings,
>>>
>>>I've just discovered corosync-qdevice and corosync-qnet.
>>> (Thanks Ken Gaillot) . Set up was pretty quick.
>>>
>>> I enabled qnet off cluster.  I followed the steps presented by
>>> corosync-qdevice-net-certutil.However, when running
>>> corosync-qdevice it exits.  Even with -f -d there isn't a single
>>> output presented.
>>>
>>
>> It sounds like the first time you ran it (without -d -f)
>> corosync-qdevice started up and daemonised itself. The second time you
>> tried (with -d -f) it couldn't run because there was already one
>> running. There's a good argument for it printing an error if it's
>> already running I think!
>>
> 
> The process doesn't stay running.  I've showed in output of qnet below
> that it launches, connected, and disconnects. I've rebooted several
> times since then (testing stonith). I can provide strace output if
> it's helpful.
> 

yes please

Chrissie


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync-qdevice doesn't daemonize (or stay running)

2018-06-19 Thread Christine Caulfield

On 19/06/18 02:46, Jason Gauthier wrote:
> Greetings,
> 
>I've just discovered corosync-qdevice and corosync-qnet.
> (Thanks Ken Gaillot) . Set up was pretty quick.
> 
> I enabled qnet off cluster.  I followed the steps presented by
> corosync-qdevice-net-certutil.However, when running
> corosync-qdevice it exits.  Even with -f -d there isn't a single
> output presented.
> 

It sounds like the first time you ran it (without -d -f)
corosync-qdevice started up and daemonised itself. The second time you
tried (with -d -f) it couldn't run because there was already one
running. There's a good argument for it printing an error if it's
already running I think!

Chrissie

> But, if I run qnet with -f -d I can see the qdevices are connecting.
> 
> Jun 18 21:19:32 debug   Initializing nss
> Jun 18 21:19:32 debug   Initializing local socket
> Jun 18 21:19:32 debug   Creating listening socket
> Jun 18 21:19:32 debug   Registering algorithms
> Jun 18 21:19:32 debug   QNetd ready to provide service
> Jun 18 21:19:36 debug   New client connected
> Jun 18 21:19:36 debug cluster name = zeta
> Jun 18 21:19:36 debug tls started = 1
> Jun 18 21:19:36 debug tls peer certificate verified = 1
> Jun 18 21:19:36 debug node_id = 1084772368
> Jun 18 21:19:36 debug pointer = 0x55b1b0416d70
> Jun 18 21:19:36 debug addr_str = :::192.168.80.16:51024
> Jun 18 21:19:36 debug ring id = (40a85010.88ac)
> Jun 18 21:19:36 debug cluster dump:
> Jun 18 21:19:36 debug   client = :::192.168.80.16:51024,
> node_id = 1084772368
> Jun 18 21:19:36 debug   Client :::192.168.80.16:51024 (cluster
> zeta, node_id 1084772368) sent initial node list.
> Jun 18 21:19:36 debug msg seq num 4
> Jun 18 21:19:36 debug node list:
> Jun 18 21:19:36 error   ffsplit: Received empty config node list for
> client :::192.168.80.16:51024
> Jun 18 21:19:36 error   Algorithm returned error code. Sending error reply.
> Jun 18 21:19:36 debug   Client :::192.168.80.16:51024 (cluster
> zeta, node_id 1084772368) sent membership node list.
> Jun 18 21:19:36 debug msg seq num 5
> Jun 18 21:19:36 debug ring id = (40a85010.88ac)
> Jun 18 21:19:36 debug node list:
> Jun 18 21:19:36 debug   node_id = 1084772368, data_center_id = 0,
> node_state = not set
> Jun 18 21:19:36 debug   node_id = 1084772369, data_center_id = 0,
> node_state = not set
> Jun 18 21:19:36 debug   Algorithm result vote is Ask later
> Jun 18 21:19:36 debug   Client :::192.168.80.16:51024 (cluster
> zeta, node_id 1084772368) sent quorum node list.
> Jun 18 21:19:36 debug msg seq num 6
> Jun 18 21:19:36 debug quorate = 1
> Jun 18 21:19:36 debug node list:
> Jun 18 21:19:36 debug   node_id = 1084772368, data_center_id = 0,
> node_state = member
> Jun 18 21:19:36 debug   node_id = 1084772369, data_center_id = 0,
> node_state = member
> Jun 18 21:19:36 debug   Algorithm result vote is No change
> Jun 18 21:19:36 debug   Client closed connection
> Jun 18 21:19:36 debug   Client :::192.168.80.16:51024
> (init_received 1, cluster zeta, node_id 1084772368) disconnect
> Jun 18 21:19:36 debug   ffsplit: Membership for cluster zeta is now stable
> Jun 18 21:19:36 debug   ffsplit: No quorate partition was selected
> Jun 18 21:19:36 debug   ffsplit: No client gets NACK
> Jun 18 21:19:36 debug   ffsplit: No client gets ACK
> 
> Since it's categorized as a daemon, I thought this would stay running,
> and keep a constant connection.
> 
> corosyn.conf quorum look like
> quorum {
> # Enable and configure quorum subsystem (default: off)
> # see also corosync.conf.5 and votequorum.5
> #   two_node: 1
> provider: corosync_votequorum
> expected_votes: 3
> device {
> votes: 1
> model: net
> net {
>   host: delta
>   }
> }
> }
> 
> Thanks!
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync not able to form cluster

2018-06-08 Thread Christine Caulfield

On 07/06/18 18:32, Prasad Nagaraj wrote:
> Hi Christine - Got it:)
> 
> I have collected few seconds of debug logs from all nodes after startup.
> Please find them attached.
> Please let me know if this will help us to identify rootcause.
> 

The problem is on the node coro.4 - it never gets out of the JOIN

"Jun 07 16:55:37 corosync [TOTEM ] entering GATHER state from 11."

process so something is wrong on that node, either a rogue routing table
entry, dangling iptables rule or even a broken NIC.

Chrissie

> Thanks!
> 
> On Thu, Jun 7, 2018 at 8:43 PM, Christine Caulfield  <mailto:ccaul...@redhat.com>> wrote:
> 
> On 07/06/18 15:53, Prasad Nagaraj wrote:
> > Hi - As you can see in the corosync.conf details - i have already kept
> > debug: on
> > 
> 
> But only in the (disabled) AMF subsystem, not for corosync as a whole :)
> 
>     logger_subsys {
>     subsys: AMF
>     debug: on
>         }
> 
> 
> Chrissie
> 
> 
> > 
> > On Thu, 7 Jun 2018, 8:03 pm Christine Caulfield,  <mailto:ccaul...@redhat.com>
> > <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>> wrote:
> >
> >     On 07/06/18 15:24, Prasad Nagaraj wrote:
> >     >
> >     > No iptables or otherwise firewalls are setup on these nodes.
> >     >
> >     > One observation is that each node sends messages on with its
> own ring
> >     > sequence number which is not converging.. I have seen that
> in a good
> >     > cluster, when nodes respond with same sequence number, the
> >     membership is
> >     > automatically formed. But in our case, that is not the case.
> >     >
> >
> >     That's just a side-effect of the cluster not forming. It's not
> causing
> >     it. Can you enable full corosync debugging (just add debug:on
> to the end
> >     of the logging {} stanza) and see if that has any more useful
> >     information (I only need the corosync bits, not the pcmk ones)
> >
> >     Chrissie
> >
> >     > Example: we can see that one node sends
> >     > Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update:
> >     Transitional
> >     > membership event on ring 71084: memb=1, new=0, lost=0
> >     > .
> >     > Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update:
> >     Transitional
> >     > membership event on ring 71096: memb=1, new=0, lost=0
> >     > Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update:
> Stable
> >     > membership event on ring 71096: memb=1, new=0, lost=0
> >     >
> >     > other node sends messages with its own numbers
> >     > Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update:
> >     Transitional
> >     > membership event on ring 71088: memb=1, new=0, lost=0
> >     > Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update:
> Stable
> >     > membership event on ring 71088: memb=1, new=0, lost=0
> >     > ...
> >     > Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update:
> >     Transitional
> >     > membership event on ring 71100: memb=1, new=0, lost=0
> >     > Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update:
> Stable
> >     > membership event on ring 71100: memb=1, new=0, lost=0
> >     >
> >     > Any idea why this happens, and why the seq. numbers from
> different
> >     nodes
> >     > are not converging ?
> >     >
> >     > Thanks!
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > ___
> >     > Users mailing list: Users@clusterlabs.org
> <mailto:Users@clusterlabs.org>
> >     <mailto:Users@clusterlabs.org <mailto:Users@clusterlabs.org>>
> >     > https://lists.clusterlabs.org/mailman/listinfo/users
> <https://lists.clusterlabs.org/mailman/listinfo/users>
> >     >
> >     > Project Home: http://www.clusterlabs.org
> >     > Getting started:
> >     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> >     > Bugs: http://bugs.clusterlabs.org
> >     >

Re: [ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Christine Caulfield

On 07/06/18 15:53, Prasad Nagaraj wrote:
> Hi - As you can see in the corosync.conf details - i have already kept
> debug: on
> 

But only in the (disabled) AMF subsystem, not for corosync as a whole :)

logger_subsys {
subsys: AMF
debug: on
}


Chrissie


> 
> On Thu, 7 Jun 2018, 8:03 pm Christine Caulfield,  <mailto:ccaul...@redhat.com>> wrote:
> 
> On 07/06/18 15:24, Prasad Nagaraj wrote:
> >
> > No iptables or otherwise firewalls are setup on these nodes.
> >
> > One observation is that each node sends messages on with its own ring
> > sequence number which is not converging.. I have seen that in a good
> > cluster, when nodes respond with same sequence number, the
> membership is
> > automatically formed. But in our case, that is not the case.
> >
> 
> That's just a side-effect of the cluster not forming. It's not causing
> it. Can you enable full corosync debugging (just add debug:on to the end
> of the logging {} stanza) and see if that has any more useful
> information (I only need the corosync bits, not the pcmk ones)
> 
> Chrissie
> 
> > Example: we can see that one node sends
> > Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update:
> Transitional
> > membership event on ring 71084: memb=1, new=0, lost=0
> > .
> > Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update:
> Transitional
> > membership event on ring 71096: memb=1, new=0, lost=0
> > Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 71096: memb=1, new=0, lost=0
> >
> > other node sends messages with its own numbers
> > Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update:
> Transitional
> > membership event on ring 71088: memb=1, new=0, lost=0
> > Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 71088: memb=1, new=0, lost=0
> > ...
> > Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update:
> Transitional
> > membership event on ring 71100: memb=1, new=0, lost=0
> > Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 71100: memb=1, new=0, lost=0
> >
> > Any idea why this happens, and why the seq. numbers from different
> nodes
> > are not converging ?
> >
> > Thanks!
> >
> >
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> <mailto:Users@clusterlabs.org>
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> ___
> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Christine Caulfield

On 07/06/18 15:24, Prasad Nagaraj wrote:
> 
> No iptables or otherwise firewalls are setup on these nodes.
> 
> One observation is that each node sends messages on with its own ring
> sequence number which is not converging.. I have seen that in a good
> cluster, when nodes respond with same sequence number, the membership is
> automatically formed. But in our case, that is not the case.
> 

That's just a side-effect of the cluster not forming. It's not causing
it. Can you enable full corosync debugging (just add debug:on to the end
of the logging {} stanza) and see if that has any more useful
information (I only need the corosync bits, not the pcmk ones)

Chrissie

> Example: we can see that one node sends
> Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71084: memb=1, new=0, lost=0
> .
> Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71096: memb=1, new=0, lost=0
> Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 71096: memb=1, new=0, lost=0
> 
> other node sends messages with its own numbers
> Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71088: memb=1, new=0, lost=0
> Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 71088: memb=1, new=0, lost=0
> ...
> Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71100: memb=1, new=0, lost=0
> Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 71100: memb=1, new=0, lost=0
> 
> Any idea why this happens, and why the seq. numbers from different nodes
> are not converging ?
> 
> Thanks!
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Christine Caulfield

; 10:25:30.647968 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP,
> length 332
> 10:25:30.672207 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP,
> length 376
> 10:25:30.684604 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP,
> length 332
> 10:25:30.707733 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP,
> length 332
> 10:25:30.707760 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP,
> length 332
> 10:25:30.731354 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP,
> length 376
> 10:25:30.744345 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP,
> length 332
> 10:25:30.767456 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP,
> length 332
> 10:25:30.767483 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP,
> length 332
> 10:25:30.791532 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP,
> length 376
> 10:25:30.804432 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP,
> length 332
> 10:25:30.827539 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP,
> length 332
> 10:25:30.827563 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP,
> length 332
> 10:25:30.850832 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP,
> length 376
> 10:25:30.863531 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP,
> length 332
> 10:25:30.886664 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP,
> length 332
> 10:25:30.886691 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP,
> length 332
> 10:25:30.910820 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP,
> length 376
> 10:25:30.923403 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP,
> length 332
> 10:25:30.946507 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP,
> length 332
> 10:25:30.946531 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP,
> length 332
> 10:25:30.970931 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP,
> length 376
> 10:25:30.983055 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP,
> length 332
> 10:25:31.006306 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP,
> length 332
> 10:25:31.006339 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP,
> length 332
> 10:25:31.030207 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP,
> length 376
> 
> 
> 
> And here is the lsof output for each node.
> lsof -i | grep corosync
> corosync  47873      root   10u  IPv4 1193147      0t0  UDP
> 172.22.0.4:netsupport
> corosync  47873      root   13u  IPv4 1193151      0t0  UDP
> 172.22.0.4:45846 <http://172.22.0.4:45846>
> corosync  47873      root   14u  IPv4 1193152      0t0  UDP
> 172.22.0.4:34060 <http://172.22.0.4:34060>
> corosync  47873      root   15u  IPv4 1193153      0t0  UDP
> 172.22.0.4:40755 <http://172.22.0.4:40755>
> 
> lsof -i | grep corosync
> corosync  11039      root   10u  IPv4   54862      0t0  UDP
> 172.22.0.13:netsupport
> corosync  11039      root   13u  IPv4   54869      0t0  UDP
> 172.22.0.13:50468 <http://172.22.0.13:50468>
> corosync  11039      root   14u  IPv4   54870      0t0  UDP
> 172.22.0.13:57332 <http://172.22.0.13:57332>
> corosync  11039      root   15u  IPv4   54871      0t0  UDP
> 172.22.0.13:46460 <http://172.22.0.13:46460>
> 
>  lsof -i | grep corosync
> corosync  75188      root   10u  IPv4 1582737      0t0  UDP
> 172.22.0.11:netsupport
> corosync  75188      root   13u  IPv4 1582741      0t0  UDP
> 172.22.0.11:54545 <http://172.22.0.11:54545>
> corosync  75188      root   14u  IPv4 1582742      0t0  UDP
> 172.22.0.11:53213 <http://172.22.0.11:53213>
> corosync  75188      root   15u  IPv4 1582743      0t0  UDP
> 172.22.0.11:44864 <http://172.22.0.11:44864>
> 
> 
> Thanks!
> 
> 
> 
> On Thu, Jun 7, 2018 at 3:33 PM, Christine Caulfield
> mailto:ccaul...@redhat.com>> wrote:
> 
> On 07/06/18 09:21, Prasad Nagaraj wrote:
> > Hi - I am running corosync on  3 nodes of CentOS release 6.9 
> (Final).
> > Corosync version is  corosync-1.4.7.
> > The nodes are not seeing each other and not able to form 
> memberships.
> > What I see is continuous message about " A processor joined or left 
> the
> > membership and a new membership was formed."
> > For example:on node:  vm2883711991 
> > 
> 
> I can'

Re: [ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Christine Caulfield

On 07/06/18 09:21, Prasad Nagaraj wrote:
> Hi - I am running corosync on  3 nodes of CentOS release 6.9 (Final).
> Corosync version is  corosync-1.4.7.
> The nodes are not seeing each other and not able to form memberships.
> What I see is continuous message about " A processor joined or left the
> membership and a new membership was formed."
> For example:on node:  vm2883711991 
> 

I can't draw any conclusions from the logs, we'd need to see what
corosync though it was binding to and the IP addresses of the hosts.

Have a look at the start of the logs and see if they match what you'd
expect (ie are similar to the ones on the working clusters), Also check
using lsof, to see what addresses corosync is bound to. tcpdump on port
5405 will show you if traffic is leaving the nodes and being received.

Also check firewall settings and make sure the nodes can ping each other.

If you're still stumped them feel free to post more info here for us to
look at, though if you have that configuration working on other nodes it
might be something in your environment

Chrissie


> 
> Jun 07 07:54:52 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> vm2883711991 184555180
> Jun 07 07:54:52 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 07 07:54:52 corosync [CPG   ] chosen downlist: sender r(0)
> ip(172.22.0.11) ; members(old:1 left:0)
> Jun 07 07:54:52 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71084: memb=1, new=0, lost=0
> Jun 07 07:55:04 corosync [pcmk  ] info: pcmk_peer_update: memb:
> vm2883711991 184555180
> Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 71084: memb=1, new=0, lost=0
> Jun 07 07:55:04 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> vm2883711991 184555180
> Jun 07 07:55:04 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 07 07:55:04 corosync [CPG   ] chosen downlist: sender r(0)
> ip(172.22.0.11) ; members(old:1 left:0)
> Jun 07 07:55:04 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71096: memb=1, new=0, lost=0
> Jun 07 07:55:16 corosync [pcmk  ] info: pcmk_peer_update: memb:
> vm2883711991 184555180
> Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 71096: memb=1, new=0, lost=0
> Jun 07 07:55:16 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> vm2883711991 184555180
> Jun 07 07:55:16 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 07 07:55:16 corosync [CPG   ] chosen downlist: sender r(0)
> ip(172.22.0.11) ; members(old:1 left:0)
> Jun 07 07:55:16 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Jun 07 07:55:28 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71108: memb=1, new=0, lost=0
> Jun 07 07:55:28 corosync [pcmk  ] info: pcmk_peer_update: memb:
> vm2883711991 184555180
> Jun 07 07:55:28 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 71108: memb=1, new=0, lost=0
> Jun 07 07:55:28 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> vm2883711991 184555180
> Jun 07 07:55:28 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 07 07:55:28 corosync [CPG   ] chosen downlist: sender r(0)
> ip(172.22.0.11) ; members(old:1 left:0)
> Jun 07 07:55:28 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Jun 07 07:55:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71120: memb=1, new=0, lost=0
> Jun 07 07:55:40 corosync [pcmk  ] info: pcmk_peer_update: memb:
> vm2883711991 184555180
> Jun 07 07:55:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 71120: memb=1, new=0, lost=0
> Jun 07 07:55:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> vm2883711991 184555180
> Jun 07 07:55:40 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 07 07:55:40 corosync [CPG   ] chosen downlist: sender r(0)
> ip(172.22.0.11) ; members(old:1 left:0)
> Jun 07 07:55:40 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Jun 07 07:55:52 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 71132: memb=1, new=0, lost=0
> Jun 07 07:55:52 corosync [pcmk  ] info: pcmk_peer_update: memb:
> vm2883711991 184555180
> Jun 07 07:55:52 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 71132: memb=1, new=0, lost=0
> Jun 07 07:55:52 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> vm2883711991 184555180
> Jun 07 07:55:52 corosync [TOTEM ] A pr

Re: [ClusterLabs] Failure of preferred node in a 2 node cluster

2018-04-29 Thread Christine Caulfield

On 29/04/18 13:22, Andrei Borzenkov wrote:
> 29.04.2018 04:19, Wei Shan пишет:
>> Hi,
>>
>> I'm using Redhat Cluster Suite 7with watchdog timer based fence agent. I
>> understand this is a really bad setup but this is what the end-user wants.
>>
>> ATB => auto_tie_breaker
>>
>> "When the auto_tie_breaker is used in even-number member clusters, then the
>> failure of the partition containing the auto_tie_breaker_node (by default
>> the node with lowest ID) will cause other partition to become inquorate and
>> it will self-fence. In 2-node clusters with auto_tie_breaker this means
>> that failure of node favoured by auto_tie_breaker_node (typically nodeid 1)
>> will result in reboot of other node (typically nodeid 2) that detects the
>> inquorate state. If this is undesirable then corosync-qdevice can be used
>> instead of the auto_tie_breaker to provide additional vote to quorum making
>> behaviour closer to odd-number member clusters."
>>
> 
> That's not what upstream corosync manual pages says. Corosync itself
> won't initiate self-fencing, it just marks node as being out of quorum.
> What happens later depends on higher layers like pacemaker. Pacemaker
> can be configured to commit suicide, but can also be configured to
> ignore quorum completely. I am not familiar with details how RHCS
> behaves by default.
> 
> I just tested on vanilla corosync+pacemaker (openSUSE Tumbleweed) and
> nothing happens when I kill lowest node in two-node configuration.
> 

That is the expected behaviour for a 2 node ATB cluster. If the
preferred node is not available then the remaining node will stall until
it comes back again. It sound odd, but that's what happens. A preferred
node is a preferred node. If it can move from one to the other when it
fails then it's not a preferred node ... it's just a node :)

If you need full resilient failover for 2 nodes then qdevice is more
likely what you need.

Chrissie


> If your cluster nodes are configured to commit suicide, what happens
> after reboot depends on at least wait_for_all corosync setting. With
> wait_for_all=1 (default in two_node) and without a) ignoring quorum
> state and b) having fencing resource pacemaker on your node will wait
> indefinitely after reboot because partner is not available.
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Announcing the first ClusterLabs video karaoke contest!

2018-04-03 Thread Christine Caulfield

On 03/04/18 07:14, Klaus Wenninger wrote:
> On 04/02/2018 02:57 AM, Digimer wrote:
>> On 2018-04-01 05:30 PM, Ken Gaillot wrote:
>>> In honor of the recent 10th anniversary of the first public release of
>>> Pacemaker, ClusterLabs is proud to announce its first video karaoke
>>> contest!
>>>
>>> To participate, simply record video of yourself singing karaoke to this
>>> tune:
>>>
>>>   https://www.youtube.com/watch?v=r7TADGV2fLI
>>>
>>> using these lyrics:
>>>
>>>   Sometimes it's hard to be a sysop
>>>   Running all your jobs on just one host.
>>>   You'll have bad times
>>>   And it'll have downtimes,
>>>   Doin' things that you don't understand.
>>>   But if you need it, you'll cluster it,
>>>   even though logs are hard to understand.
>>>   And if you built it, Oh be proud of it,
>>>   'Cause after all it's just a node.
>>>   Standby your node,
>>>   Because you want to upgrade
>>>   And watch your resource migrate.
>>>   Five nines are bold and lovely.
>>>   Standby your node,
>>>   And serve the world your resource.
>>>   Keep serving all the things you can.
>>>   Standby your node.
>>>   Standby your node,
>>>   And show the world your uptime.
>>>   Keep serving all the things you can.
>>>   Standby your node.
>>>
>>> Users list members will vote on all submissions, and the winner will
>>> receive a COMPLETE SET of all available ClusterLabs swag!*
>> Ah crap, I'm already sobering up... Probably for the best; No known HA
>> cluster would survive my singing.
>>
> Let me chime in on that.
> My best way of honoring the anniversary is probably not to sing ;-)

Likewise - Can't sing, won't sing. (it would be nice if more people
obeyed this mantra TBH)

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-13 Thread Christine Caulfield

On 09/03/18 16:26, Jan Friesse wrote:
> Thomas,
> 
>> Hi,
>>
>> On 3/7/18 1:41 PM, Jan Friesse wrote:
>>> Thomas,
>>>
 First thanks for your answer!

 On 3/7/18 11:16 AM, Jan Friesse wrote:
> 
> ...
> 
>> TotemConfchgCallback: ringid (1.1436)
>> active processors 3: 1 2 3
>> EXIT
>> Finalize  result is 1 (should be 1)
>>
>>
>> Hope I did both test right, but as it reproduces multiple times
>> with testcpg, our cpg usage in our filesystem, this seems like
>> valid tested, not just an single occurrence.
> 
> I've tested it too and yes, you are 100% right. Bug is there and it's
> pretty easy to reproduce when node with lowest nodeid is paused. It's
> slightly harder when node with higher nodeid is paused.
> 
> Most of the clusters are using power fencing, so they simply never sees
> this problem. That may be also the reason why it wasn't reported long
> time ago (this bug exists virtually at least since OpenAIS Whitetank).
> So really nice work with finding this bug.
> 
> What I'm not entirely sure is what may be best way to solve this
> problem. What I'm sure is, that it's going to be "fun" :(
> 
> Lets start with very high level of possible solutions:
> - "Ignore the problem". CPG behaves more or less correctly. "Current"
> membership really didn't changed so it doesn't make too much sense to
> inform about change. It's possible to use cpg_totem_confchg_fn_t to find
> out when ringid changes. I'm adding this solution just for completeness,
> because I don't prefer it at all.
> - cpg_confchg_fn_t adds all left and back joined into left/join list
> - cpg will sends extra cpg_confchg_fn_t call about left and joined
> nodes. I would prefer this solution simply because it makes cpg behavior
> equal in all situations.
> 
> Which of the options you would prefer? Same question also for @Ken (->
> what would you prefer for PCMK) and @Chrissie.
> 


The last option makes most sense to me too - it's more consistent and
'what you would expect' I think.

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [corosync] Document on configuring corosync3 with knet

2018-03-02 Thread Christine Caulfield

On 16/01/18 13:46, Christine Caulfield wrote:
> Hi All,
> 
> To get people started with the new things going on with kronosnet and
> corosync3, I've written a document which explains what you can do with
> the new configuration options, how to set up multiple links and much,
> much more.
> 
> It might be helpful for people who want to write configuration tools for
> the new software or even proper documentation as well as users.
> 
> warning: contains humour.
> http://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf
> 

I've updated this document to v1.1 with some new information about the
requirement to give nodes a 'name' in the nodelist to remove ambiguities
caused by having fully dynamic ringX_addrs.


Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [corosync] Document on configuring corosync3 with knet

2018-01-16 Thread Christine Caulfield

Hi All,

To get people started with the new things going on with kronosnet and
corosync3, I've written a document which explains what you can do with
the new configuration options, how to set up multiple links and much,
much more.

It might be helpful for people who want to write configuration tools for
the new software or even proper documentation as well as users.

warning: contains humour.
http://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Announce] libqb 1.0.3 release

2017-12-21 Thread Christine Caulfield

We are pleased to announce the release of libqb 1.0.3


Source code is available at:
https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.xz


This is mainly a bug-fix release to 1.0.2

Christine Caulfield (6):
tests: Fix signal handling in check_ipc.c
test: Disable test_max_dgram_size() test as it often breaks on CI
sign tarballs
config: Fix check for fdatasync
tests: make qb logging under check always dispose the memory
warnings cleanup: fix initialiser warning on RHEL7

Jan Pokorný (35):
Fix typos: in{ -> s}tance, d{e -> i}stinguished
Low: loop: don't bring runtime down for a trivial API misuse
Fix typo: repeat{ivi -> ed}ly
build: release.mk: simplify/generalize GPG signing rule
build: release.mk: fix no-release conflict (implied-required version)
build: release.mk: move soft guard for no GPG key up the supply chain
build: release.mk: simplify the default goal, declare .PHONY targets
build: release.mk: reflect current release publishing practice
Doc tweaking (#261)
Low hanging bits (#264)
Typo fix + qb blackbox(8) tweaks and extension + gitignore follow-up (#262)
log: use fdatasync instead of fsync where possible (#263)
maint: make -devel package dependency on the main package arch-qualified
Med: qblog.h: better explanation + behaviour of QB_LOG_INIT_DATA
build: configure: check section boundary symbols present in the test
tests: new sort of tests dubbed "functional", cover linker vs. logging
tests: add a script to generate callsite-heavy logging client...
Med: add extra run-time (client, libqb) checks that logging will work
High: bare fix for libqb logging not working with ld.bfd/binutils 2.29+
Low: fix internal object symbol's leak & expose run-time lib version
doc: qblog.h: syslog rarely appropriate for ordinary programs
doc: qblog.h: further logging setup related tweaks
build: release.mk: deal with trailing whitespace-to-comment-delimiter
warnings cleanup: log: Wextra -> Wimplicit-fallthrough (GCC7+)
warnings cleanup: Wshift-overflow: trigger arithmetic conv. to unsigned
maint: replace 0x constants with UNIT32_MAX
warnings cleanup: hdb+loop_timerlist: Wsign-compare: (canary?) variables
maint: array: avoid magic constants, expose some in the API
warnings cleanup: Wsign-compare: array: int32_t -> size_t
warnings cleanup: Wsign-compare: hdb: uint32_t <-> int32_t
warnings cleanup: Wsign-compare: log_format: int32_t -> size_t
warnings cleanup: Wformat: sign-correct PRIu32 specifiers as appropriate
warnings cleanup: Wunused-function: leave the test commented out
warnings cleanup: give up on some warning classes for now
maint: fix "make maintainer-clean" not working in tests/functional

Kazunori INOUE (1):
configure: define AS_VAR_COPY (#267)

Michael Jones (2):
Adds additional warnings
Adds no-format-nonliteral

jonesmz (1):
Point the link to the Linux kernel coding style document to the right
place (#256)

wferi (2):
Fix spelling: optvat -> optval (#270)
configure: bail out early if POSIX threads support is not detected (#272)

yann-morin-1998 (1):
configure: fix CLOCK_MONOTONIC check for cross-compilation (#269)

Please used the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

The documentation at github.io still shows 1.0.2. I'll get this fixed in
the new year as that build is still broken.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-12 Thread Christine Caulfield

On 12/10/17 11:54, Jan Friesse wrote:
> Jonathan,
> 
>>
>>
>> On 12/10/17 07:48, Jan Friesse wrote:
>>> Jonathan,
>>> I believe main "problem" is votequorum ability to work during sync
>>> phase (votequorum is only one service with this ability, see
>>> votequorum_overview.8 section VIRTUAL SYNCHRONY)...
>>>
 Hi ClusterLabs,

 I'm seeing a race condition in corosync where votequorum can have
 incorrect membership info when a node joins the cluster then leaves
 very
 soon after.

 I'm on corosync-2.3.4 plus my patch
> 
> Finally noticed ^^^ 2.3.4 is really old and as long as it is not some
> patched version, I wouldn't recommend to use it. Can you give a try to
> current needle?
> 
 https://github.com/corosync/corosync/pull/248. That patch makes the
 problem readily reproducible but the bug was already present.

 Here's the scenario. I have two hosts, cluster1 and cluster2. The
 corosync.conf on cluster2 is:

  totem {
    version: 2
    cluster_name: test
    config_version: 2
    transport: udpu
  }
  nodelist {
    node {
  nodeid: 1
  ring0_addr: cluster1
    }
    node {
  nodeid: 2
  ring0_addr: cluster2
    }
  }
  quorum {
    provider: corosync_votequorum
    auto_tie_breaker: 1
  }
  logging {
    to_syslog: yes
  }

 The corosync.conf on cluster1 is the same except with
 "config_version: 1".

 I start corosync on cluster2. When I start corosync on cluster1, it
 joins and then immediately leaves due to the lower config_version.
 (Previously corosync on cluster2 would also exit but with
 https://github.com/corosync/corosync/pull/248 it remains alive.)

 But often at this point, cluster1's disappearance is not reflected in
 the votequorum info on cluster2:
>>>
>>> ... Is this permanent (= until new node join/leave it , or it will fix
>>> itself over (short) time? If this is permanent, it's a bug. If it
>>> fixes itself it's result of votequorum not being virtual synchronous.
>>
>> Yes, it's permanent. After several minutes of waiting, votequorum still
>> reports "total votes: 2" even though there's only one member.
> 
> 
> That's bad. I've tried following setup:
> 
> - Both nodes with current needle
> - Your config
> - Second node is just running corosync
> - First node is running following command:
>   while true;do corosync -f; ssh node2 'corosync-quorumtool | grep Total
> | grep 1' || exit 1;done
> 
> Running it for quite a while and I'm unable to reproduce the bug. Sadly
> I'm unable to reproduce the bug even with 2.3.4. Do you think that
> reproducer is correct?
> 

I can't reproduce it either.

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Introducing the Anvil! Intelligent Availability platform

2017-07-06 Thread Christine Caulfield

On 05/07/17 14:55, Ken Gaillot wrote:
> Wow! I'm looking forward to the September summit talk.
> 



Me too! Congratulations on the release :)

Chrissie



> On 07/05/2017 01:52 AM, Digimer wrote:
>> Hi all,
>>
>>   I suspect by now, many of you here have heard me talk about the Anvil!
>> intelligent availability platform. Today, I am proud to announce that it
>> is ready for general use!
>>
>> https://github.com/ClusterLabs/striker/releases/tag/v2.0.0
>>
>>   I started five years ago with an idea of building an "Availability
>> Appliance". A single machine where any part could be failed, removed and
>> replaced without needing a maintenance window. A system with no single
>> point of failure anywhere wrapped behind a very simple interface.
>>
>>   The underlying architecture that provides this redundancy was laid
>> down years ago as an early tutorial and has been field tested all over
>> North America and around the world in the years since. In that time, the
>> Anvil! platform has demonstrated over 99.% availability!
>>
>>   Starting back then, the goal was to write the web interface that made
>> it easy to use the Anvil! platform. Then, about two years ago, I decided
>> that an Anvil! could be much, much more than just an appliance.
>>
>>   It could think for itself.
>>
>>   Today, I would like to announce version 2.0.0. This releases
>> introduces the ScanCore "decision engine". ScanCore can be thought of as
>> a sort of "Layer 3" availability platform. Where Corosync provides
>> membership and communications, with Pacemaker (and rgmanager) sitting on
>> top monitoring applications and handling fault detection and recovery,
>> ScanCore sits on top of both, gathering disparate data, analyzing it and
>> making "big picture" decisions on how to best protect the hosted servers.
>>
>>   Examples;
>>
>> 1. All servers are on node 1, and node 1 suffers a cooling fan failure.
>> ScanCore compares against node 2's health, waits a period of time in
>> case it is a transient fault and the autonomously live-migrates the
>> servers to node 2. Later, node 2 suffers a drive failure, degrading the
>> underlying RAID array. ScanCore can then compare the relative risks of a
>> failed fan versus a degraded RAID array, determine that the failed fan
>> is less risky and automatically migrate the servers back to node 1. If a
>> hot-spare kicks in and the array returns to an Optimal state, ScanCore
>> will again migrate the servers back to node 2. When node 1's fan failure
>> is finally repaired, the servers stay on node 2 as there is no benefit
>> to migrating as now both nodes are equally healthy.
>>
>> 2. Input power is lost to one UPS, but not the second UPS. ScanCore
>> knows that good power is available and, so, doesn't react in any way. If
>> input power is lost to both UPSes, however, then ScanCore will decide
>> that the greatest risk the server availability is no longer unexpected
>> component failure, but instead depleting the batteries. Given this, it
>> will decide that the best option to protect the hosted servers is to
>> shed load and maximize run time. if the power stays out for too long,
>> then ScanCore will determine hard off is imminent, and decide to
>> gracefully shut down all hosted servers, withdraw and power off. Later,
>> when power returns, the Striker dashboards will monitor the charge rate
>> of the UPSes and as soon as it is safe to do so, restart the nodes and
>> restore full redundancy.
>>
>> 3. Similar to case 2, ScanCore can gather temperature data from multiple
>> sources and use this data to distinguish localized cooling failures from
>> environmental cooling failures, like the loss of an HVAC or AC system.
>> If the former case, ScanCore will migrate servers off and, if critical
>> temperatures are reached, shut down systems before hardware damage can
>> occur. In the later case, ScanCore will decide that minimizing thermal
>> output is the best way to protect hosted servers and, so, will shed load
>> to accomplish this. If necessary to avoid damage, ScanCore will perform
>> a full shut down. Once ScanCore (on the low-powered Striker dashboards)
>> determines thermal levels are safe again, it will restart the nodes and
>> restore full redundancy.
>>
>>   All of this intelligence is of little use, of course, if it is hard to
>> build and maintain an Anvil! system. Perhaps the greatest lesson learned
>> from our old tutorial was that the barrier to entry had to be reduced
>> dramatically.
>>
>> https://www.alteeve.com/w/Build_an_m2_Anvil!
>>
>>   So, this release also dramatically simplifies how easy it is to go
>> from bare iron to provisioned, protected servers. Even with no
>> experience in availability at all, a tech should be able to go from iron
>> in boxes to provision servers in one or two days. Almost all steps have
>> been automated, which serves the core goal of maximum reliability by
>> minimizing the chances for human error.
>>
>>   This version also introduces the abilit

Re: [ClusterLabs] how to sync data using cmap between cluster

2017-05-25 Thread Christine Caulfield

On 25/05/17 15:48, Rui Feng wrote:
> Hi,
> 
>   I have a test based on corosync 2.3.4, and find the data stored by
> cmap( corosync-cmapctl -s test i8 1) which can't be sync to other
> node.
>   Could somebody give some comment or solution for it, thanks!
> 
>

cmap isn't replicated across the cluster. If you need data replication
then you'll have to use some other method.

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [announce] libqb 1.0.2 release

2017-05-19 Thread Christine Caulfield

I am pleased to announce the 1.0.2 release of libqb


This is mainly a bug-fix release to 1.0.1. There is one new feature
added and that is the option to use filesystem sockets (as opposed to
the more usual abstract sockets) on Linux.

CI: make travis watch for the issue
CI: travis: fix dh -> du + add "lsblk -f" diagnostics
tests: better diagnose test_max_dgram_size test failures
Fix typos: synchonization -> synchronization, paramaters -> parameters
configure: help string cleanup
configure: LTLIBOBJS is also a Make variable
configure: restrict -ldl to where it's actually needed
configure: restrict pthreads to where it's actually needed
configure: restrict socket lib to where it's actually needed
configure: restrict -nsl lib to where it's actually needed
build: drop allegedly no longer intrusive syslog-tests opt-in switch
CI: travis: fix du -> df and capture it also directly from test
ringbuffer: Return error from peek if RB is corrupted.
tests: Fix qb_rb_chunk_peek test so it's consistent with qb_rb_read
loop: don't override external signal handlers
loop: Also set signals changed in qb_loop_signal_mod() back to SIG_DFL
doc: clarify thread-safety (or not) in IPC doc
test: Fix random number generation in IPC tests
Allow Linux to use filesystem sockets
memleak: ipc_socket: properly dispose local-scoped strndup values
memleak: ipc_socket: properly dispose inter-function strdup values
build: follow-up on introducing custom m4 macros
build: Require c99 language support or newe

Huge thanks you to all of the people who have contributed to this release.

Chrissie

The current release tarball is here:
https://github.com/ClusterLabs/libqb/releases/download/v1.0.2/libqb-1.0.2.tar.gz

The github repository is here:
https://github.com/ClusterLabs/libqb

Please report bugs and issues in bugzilla:
https://bugzilla.redhat.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: 2-Node Cluster Pointless?

2017-04-18 Thread Christine Caulfield

On 18/04/17 15:02, Digimer wrote:
> On 18/04/17 10:00 AM, Digimer wrote:
>> On 18/04/17 03:47 AM, Ulrich Windl wrote:
>> Digimer  schrieb am 16.04.2017 um 20:17 in Nachricht
>>> <12cde13f-8bad-a2f1-6834-960ff3afc...@alteeve.ca>:
 On 16/04/17 01:53 PM, Eric Robinson wrote:
> I was reading in "Clusters from Scratch" where Beekhof states, "Some would
>>>
 argue that two-node clusters are always pointless, but that is an argument 
 for another time." Is there a page or thread where this argument has been 
 fleshed out? Most of my dozen clusters are 2 nodes. I hate to think they're
>>>
 pointless.  
>
> --
> Eric Robinson

 There is a belief that you can't build a reliable cluster without
 quorum. I am of the mind that you *can* build a very reliable 2-node
 cluster. In fact, every cluster our company has deployed, going back
 over five years, has been 2-node and have had exception uptimes.

 The confusion comes from the belief that quorum is required and stonith
 is option. The reality is the opposite. I'll come back to this in a minute.

 In a two-node cluster, you have two concerns;

 1. If communication between the nodes fail, but both nodes are alive,
 how do you avoid a split brain?
>>>
>>> By killing one of the two parties.
>>>

 2. If you have a two node cluster and enable cluster startup on boot,
 how do you avoid a fence loop?
>>>
>>> I think the problem in the question is using "you" instead of "it" ;-)
>>> Pacemaker assumes all problems that cause STONITH will be solved by STONITH.
>>> That's not always true (e.g. configuration errors). Maybe a node's failcount
>>> should not be reset if the node was fenced.
>>> So you'll avoid a fencing loop, but might end in a state where no resources
>>> are running. IMHO I'd prefer that over a fencing loop.
>>>

 Many answer #1 by saying "you need a quorum node to break the tie". In
 some cases, this works, but only when all nodes are behaving in a
 predictable manner.
>>>
>>> All software relies on the fact that it behaves in a predictable manner, 
>>> BTW.
>>> The problem is not "the predictable manner for all nodes", but the 
>>> predictable
>>> manner for the cluster.
>>>

 Many answer #2 by saying "well, with three nodes, if a node boots and
 can't talk to either other node, it is inquorate and won't do anything".
>>>
>>> "wan't do anything" is also wrong: I must go offline without killing others,
>>> preferrably.
>>>
 This is a valid mechanism, but it is not the only one.

 So let me answer these from a 2-node perspective;

 1. You use stonith and the faster node lives, the slower node dies. From
>>>
>>> Isn't there a possibility that both nodes shoot each other? Is there a
>>> guarantee that there will always be one faster node?
>>>
 the moment of comms failure, the cluster blocks (needed with quorum,
 too) and doesn't restore operation until the (slower) peer is in a known
 state; Off. You can bias this by setting a fence delay against your
 preferred node. So say node 1 is the node that normally hosts your
 services, then you add 'delay="15"' to node 1's fence method. This tells
 node 2 to wait 15 seconds before fencing node 1. If both nodes are
 alive, node 2 will be fenced before the timer expires.
>>>
>>> Can only the DC issue fencing?
>>>

 2. In Corosync v2+, there is a 'wait_for_all' option that tells a node
 to not do anything until it is able to talk to the peer node. So in the
 case of a fence after a comms break, the node that reboots will come up,
 fail to reach the survivor node and do nothing more. Perfect.
>>>
>>> Does "do nothing more" mean continuously polling for other nodes?
>>>

 Now let me come back to quorum vs. stonith;

 Said simply; Quorum is a tool for when everything is working. Fencing is
 a tool for when things go wrong.
>>>
>>> I'd say: Quorum is the tool to decide who'll be alive and who's going to 
>>> die,
>>> and STONITH is the tool to make nodes die. If everything is working you need
>>> neither quorum nor STONITH.
>>>

 Lets assume that your cluster is working find, then for whatever reason,
 node 1 hangs hard. At the time of the freeze, it was hosting a virtual
 IP and an NFS service. Node 2 declares node 1 lost after a period of
 time and decides it needs to take over;
>>>
>>> In case node 1 is DC, isn't a selection for a new DC coming first, and the 
>>> new
>>> DC doing the STONITH?
>>>
>>>

 In the 3-node scenario, without stonith, node 2 reforms a cluster with
 node 3 (quorum node), decides that it is quorate, starts its NFS server
 and takes over the virtual IP. So far, so good... Until node 1 comes out
>>>
>>> Again if node 1 was DC, it's not that simple.
>>>
 of its hang. At that moment, node 1 has no idea time has passed. It has
>>>
>>> You assume no fen

Re: [ClusterLabs] 2-Node Cluster Pointless?

2017-04-18 Thread Christine Caulfield


> 
> This isn't the first time this has come up, so I decided to elaborate on
> this email by writing an article on the topic.
> 
> It's a first-draft so there are likely spelling/grammar mistakes.
> However, the body is done.
> 
> https://www.alteeve.com/w/The_2-Node_Myth
> 

An excellent article. One small point I noticed though. "Fabric Fencing"
more usually refers to SCSI reservation fencing (hence it does not
isolate the node from the cluster, just shared storage) - often  over
Fibre-channel (hence "fabric") though I believe iSCSI supports it too.

While this doesn't take the node out of the cluster it does prevent
damage to shared storage and allows non-clustered applications to
continue working if it's just the cluster network interconnect that has
failed.

IMO SCSI fencing should never be used on a 2 node cluster for reasons
you have already described very clearly.

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-29 Thread Christine Caulfield

On 24/03/17 20:44, Seth Reid wrote:
> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> production yet because I'm having a problem during fencing. When I
> disable the network interface of any one machine,


If you mean by using ifdown or similar then ... don't do that. A proper
test would be to either physically pull the cable to set up some
iptables rules to block traffic.

Just taking the interface down causes corosync to do odd things.

Chrissie


 the disabled machines
> is properly fenced leaving me, briefly, with a two node cluster. A
> second node is then fenced off immediately, and the remaining node
> appears to try to fence itself off. This leave two nodes with
> corosync/pacemaker stopped, and the remaining machine still in the
> cluster but showing an offline node and an UNCLEAN node. What can be
> causing this behavior?
> 
> Each machine has a dedicated network interface for the cluster, and
> there is a vlan on the switch devoted to just these interfaces.
> In the following, I disabled the interface on node id 2 (b014). Node 1
> (b013) is fenced as well. Node 2 (b015) is still up.
> 
> Logs from b013:
> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> /dev/null && debian-sa1 1 1)
> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> failed, forming new configuration.
> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> forming new configuration.
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership
> (192.168.100.13:576 ) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive
> the leave message. failed: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> (192.168.100.13:576 ) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
> leave message. failed: 2
> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
> the membership list
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with
> id=2 and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
> nodedown time 1490387717 fence_all dlm_stonith
> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to
> node 2
> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
> 2/(null): 0 in progress, 0 completed
> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
> kicked: reboot
> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to
> node 3
> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to
> node 1
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
> lockspaces
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
> code=exited, status=255/n/a
> Mar 24 16:35:18 b013 cib[2220]:error: Connection to the CPG API
> failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered failed
> state.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to cib_rw failed
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
> 'exit-code'.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to
> cib_rw[0x560754147990] closed (I/O condition=17)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Main process exited,
> code=exited, status=107/n/a
> Mar 24 16:35:18 b013 pacemakerd[2187]:error: Connection to the CPG
> API failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Unit entered failed

Re: [ClusterLabs] corosync dead loop in segfault handler

2017-03-14 Thread Christine Caulfield

On 11/03/17 01:32, cys wrote:
> At 2017-03-09 18:25:59, "Christine Caulfield"  wrote:
>> Thanks. Oddly that looks like a totally different incident to the core
>> file we had last time. That seemed to be in a node state transition
>> whereas this is in stable running. The last thing to happen was an IPC
>> connection which indicates that libqb might be possibly involved. I
>> recently identified a bug in libqb that's triggered by using it for
>> multithreaded IPC access, but the only Red Hat software that does that
>> is clvmd and the use pattern in the black box output is not clvmd. So
>> unless you have some custom-written multi-threaded software that uses
>> libcmap extensively (do you?) then I'm none-the-wiser I'm afraid :/
>>
> 
> Sorry. I made a mistake. It's not infloop. Corosync was just consuming a lot 
> cpu.
> And we don't have custom software that uses libcmap.
> 

OK thanks. Did corosync recover from the problem or did you have to
restart it?

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync cannot acquire quorum

2017-03-13 Thread Christine Caulfield

On 11/03/17 02:50, cys wrote:
> We have a cluster containing 3 nodes(nodeA, nodeB, nodeC). 
> After nodeA is taken offline(by ifdown, this may be not right?),

ifdown isn't right, no. you need to do a physical cable pull or use
iptables to simulate loss of traffic, ifdown does odd things to corosync!

Chrissie

 nodeC
> cannot acquire quorum while nodeB can.
> NodeC: corosync-quorumtool -s
> Quorum information
> --
> Date: Sat Mar 11 10:42:22 2017
> Quorum provider:  corosync_votequorum
> Nodes2
> Node ID:  2
> Ring ID:  2/24
> Quorate:  No
> 
> Votequorum information
> --
> Expected votes:   3
> Highest expected: 3
> Total votes:  2
> Quorum:  2 Activity blocked
> Flags:WaitForAll 
> 
> Membership information
> --
> Nodeid  Votes Name
> 
>  2  1 68.68.68.15 (local)
>  3  1 68.68.68.16
> 
> NodeB: corosync-quorumtools -s
> Quorum information
> --
> Date: Sat Mar 11 10:45:44 2017
> Quorum provider:  corosync_votequorum
> Nodes:2
> Node ID:  3
> Ring ID:  2/24
> Quorate:  Yes
> 
> Votequorum information
> --
> Expected votes:   3
> Highest expected: 3
> Total votes:  2
> Quorum:  2  
> Flags:Quorate 
> 
> Membership information
> --
> Nodeid  Votes Name
>  2  1 68.68.68.15
>  3  1 68.68.68.16 (local)
> 
> So what's the problem?
> Thanks.
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync dead loop in segfault handler

2017-03-09 Thread Christine Caulfield

On 08/03/17 11:04, cys wrote:
> At 2017-02-21 00:24:33, "Christine Caulfield"  wrote:
>> Thanks, I can read that core now. It's something odd happening in the
>> sync() code that I can't quite diagnose without the blackbox. We've only
>> ever seen crashes like that when there's been network corruption or
>> on-wire incompatibilities. Has it happened before?
>>
>> Chrissie
>>
> 
> We caught another infloop today. Here is the blackbox in attachment.
> 

Thanks. Oddly that looks like a totally different incident to the core
file we had last time. That seemed to be in a node state transition
whereas this is in stable running. The last thing to happen was an IPC
connection which indicates that libqb might be possibly involved. I
recently identified a bug in libqb that's triggered by using it for
multithreaded IPC access, but the only Red Hat software that does that
is clvmd and the use pattern in the black box output is not clvmd. So
unless you have some custom-written multi-threaded software that uses
libcmap extensively (do you?) then I'm none-the-wiser I'm afraid :/

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Q: cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error retrying

2017-03-03 Thread Christine Caulfield

On 03/03/17 12:59, Ulrich Windl wrote:
> Hello!
> 
> After Update and reboot of 2nd of three nodes (SLES11 SP4) I see a 
> "cluster-dlm[4494]: setup_cpg_daemon: daemon cpg_join error retrying" message 
> when I expected the node to joint the cluster. What can be the reasons for 
> this?
> In fact this seems to have killed cluster communication, because I saw that 
> "DLM start" timed out. The other nodes were unable to use DLM during that 
> time (while the node could not join).
> 
> I saw that corosync starts before the firewall in SLES11 SP4; maybe that's a 
> problem.
> 

Could be. It sounds like something hasn't started properly and that's
most usually caused by either the network being down or ports
unavailable. This can cause corosync to not know its local node name (or
match what's in the config file) or DLM to fail to start.

> I tried an "rcopenais stop" of the problem node, which in tun caused a node 
> fence (DLM stop timed out, too), and then the other nodes were able to 
> communicate again. During boot the problem node was able to join the cluster 
> as before. In the meantime I had also updated the third node without a 
> problem, so it looks like a rare race condition to me.
> ANy insights?
> 
> Could the problem be related to one of these messages?
> crmd[3656]:   notice: get_node_name: Could not obtain a node name for classic 
> openais (with plugin) nodeid 739512321
> corosync[3646]:  [pcmk  ] info: update_member: 0x64bc90 Node 739512325 
> ((null)) born on: 3352
> stonith-ng[3652]:   notice: get_node_name: Could not obtain a node name for 
> classic openais (with plugin) nodeid 739512321
> crmd[3656]:   notice: get_node_name: Could not obtain a node name for classic 
> openais (with plugin) nodeid 739512330
> cib[3651]:   notice: get_node_name: Could not obtain a node name for classic 
> openais (with plugin) nodeid 739512321
> cib[3651]:   notice: crm_update_peer_state: plugin_handle_membership: Node 
> (null)[739512321] - state is now member (was (null))
> 
> crmd: info: crm_get_peer: Created entry 
> 8a7d6859-5ab1-404b-95a0-ba28064763fb/0x7a81f0 for node (null)/739512321 (2 
> total)
> crmd: info: crm_get_peer: Cannot obtain a UUID for node 
> 739512321/(null)
> crmd: info: crm_update_peer:  plugin_handle_membership: Node (null): 
> id=739512321 state=member addr=r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)  (new) 
> votes=0 born=0 seen=3352 proc=
> 


Those messages are all effect rather than cause so it's hard to say.

If the cluster starts up when you attempt it manually after the system
is booted, then it's probably a startup race with something. Network
Manager is often a culprit here, though I don't know SLES.


Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync dead loop in segfault handler

2017-02-20 Thread Christine Caulfield

On 16/02/17 12:18, cys wrote:
> If you need other packages, let me know.
> 



Thanks, I can read that core now. It's something odd happening in the
sync() code that I can't quite diagnose without the blackbox. We've only
ever seen crashes like that when there's been network corruption or
on-wire incompatibilities. Has it happened before?

Chrissie

> At 2017-02-16 19:38:03, "Christine Caulfield"  wrote:
>> On 16/02/17 09:31, cys wrote:
>>> The attachment includes coredump and logs just before corosync went wrong.
>>>
>>> The packages we use:
>>> corosync-2.3.4-7.el7_2.1.x86_64
>>> corosynclib-2.3.4-7.el7_2.1.x86_64
>>> libqb-0.17.1-2.el7.1.x86_64
>>>
>>> But they are not available any more at mirror.centos.org. If you can't find 
>>> them anywhere, I can send you the RPMs.
>>> The debuginfo packages can be downloaded from 
>>> http://debuginfo.centos.org/7/x86_64/.
>>>
>>
>> Can you send me the RPMs please? I tried the RHEL ones with the same
>> version number but they don't work (it was worth a try!)
>>
>> Thanks
>> Chrissie
>>
>>
>>> Unfortunately corosync was restarted yesterday, and I can't get  the 
>>> blackbox dump covering the day the incident occurred.
>>>
>>> At 2017-02-16 16:00:05, "Christine Caulfield"  wrote:
>>>> On 16/02/17 03:51, cys wrote:
>>>>> At 2017-02-15 23:13:08, "Christine Caulfield"  wrote:
>>>>>>
>>>>>> Yes, it seems that some corosync SEGVs trigger this obscure bug in
>>>>>> libqb. I've chased a few possible causes and none have been fruitful.
>>>>>>
>>>>>> If you get this then corosync has crashed, and this other bug is masking
>>>>>> the actual diagnostics - I know, helpful :/
>>>>>>
>>>>>> It's on my list
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>
>>>>> Thanks.
>>>>> I think you have noticed that my_service_list[3] is invalid.
>>>>> About the SEGV, do you need additional information? coredump or logs?
>>>>>
>>>>
>>>> A blackbox dump and (if possible) coredump would be very useful if you
>>>> can get them. thank you.
>>>>
>>>> Chrissie
>>


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync dead loop in segfault handler

2017-02-16 Thread Christine Caulfield

On 16/02/17 09:31, cys wrote:
> The attachment includes coredump and logs just before corosync went wrong.
> 
> The packages we use:
> corosync-2.3.4-7.el7_2.1.x86_64
> corosynclib-2.3.4-7.el7_2.1.x86_64
> libqb-0.17.1-2.el7.1.x86_64
> 
> But they are not available any more at mirror.centos.org. If you can't find 
> them anywhere, I can send you the RPMs.
> The debuginfo packages can be downloaded from 
> http://debuginfo.centos.org/7/x86_64/.
> 

Can you send me the RPMs please? I tried the RHEL ones with the same
version number but they don't work (it was worth a try!)

Thanks
Chrissie


> Unfortunately corosync was restarted yesterday, and I can't get  the blackbox 
> dump covering the day the incident occurred.
> 
> At 2017-02-16 16:00:05, "Christine Caulfield"  wrote:
>> On 16/02/17 03:51, cys wrote:
>>> At 2017-02-15 23:13:08, "Christine Caulfield"  wrote:
>>>>
>>>> Yes, it seems that some corosync SEGVs trigger this obscure bug in
>>>> libqb. I've chased a few possible causes and none have been fruitful.
>>>>
>>>> If you get this then corosync has crashed, and this other bug is masking
>>>> the actual diagnostics - I know, helpful :/
>>>>
>>>> It's on my list
>>>>
>>>> Chrissie
>>>>
>>>
>>> Thanks.
>>> I think you have noticed that my_service_list[3] is invalid.
>>> About the SEGV, do you need additional information? coredump or logs?
>>>
>>
>> A blackbox dump and (if possible) coredump would be very useful if you
>> can get them. thank you.
>>
>> Chrissie


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync dead loop in segfault handler

2017-02-16 Thread Christine Caulfield

On 16/02/17 03:51, cys wrote:
> At 2017-02-15 23:13:08, "Christine Caulfield"  wrote:
>>
>> Yes, it seems that some corosync SEGVs trigger this obscure bug in
>> libqb. I've chased a few possible causes and none have been fruitful.
>>
>> If you get this then corosync has crashed, and this other bug is masking
>> the actual diagnostics - I know, helpful :/
>>
>> It's on my list
>>
>> Chrissie
>>
> 
> Thanks.
> I think you have noticed that my_service_list[3] is invalid.
> About the SEGV, do you need additional information? coredump or logs?
> 

A blackbox dump and (if possible) coredump would be very useful if you
can get them. thank you.

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync dead loop in segfault handler

2017-02-15 Thread Christine Caulfield

On 15/02/17 14:50, Jan Friesse wrote:
>> Hi all,
>>
>> Corosync Cluster Engine, version '2.3.4'
>> Copyright (c) 2006-2009 Red Hat, Inc.
>>
>> Today I found corosync consuming 100% cpu. Strace showed following:
>>
>> write(7, "\v\0\0\0", 4) = -1 EAGAIN (Resource
>> temporarily unavailable)
>> write(7, "\v\0\0\0", 4) = -1 EAGAIN (Resource
>> temporarily unavailable)
>>
>> Then I used gcore to get the coredump.
>>
>> (gdb) bt
>> #0  0x7f038b74b1cd in write () from /lib64/libpthread.so.0
>> #1  0x7f038b9656ed in _handle_real_signal_ (signal_num=> out>, si=, context=) at loop_poll.c:474
>> #2  
>> #3  0x in ?? ()
>> #4  0x7f038c220a3d in schedwrk_processor (context=)
>> at sync.c:551
>> #5  0x7f038c23042b in schedwrk_do (type=,
>> context=0x6a12d5630001) at schedwrk.c:77
>> #6  0x7f038bdd49f7 in token_callbacks_execute
>> (type=TOTEM_CALLBACK_TOKEN_SENT, instance=) at
>> totemsrp.c:3493
>> #7  message_handler_orf_token (instance=,
>> msg=, endian_conversion_needed=,
>> msg_len=) at totemsrp.c:3894
>> #8  0x7f038bdd65a5 in message_handler_orf_token
>> (instance=, msg=, msg_len=> out>, endian_conversion_needed=) at totemsrp.c:3609
>> #9  0x7f038bdcdfb9 in rrp_deliver_fn (context=0x7f038d541840,
>> msg=0x7f038d541af8, msg_len=70) at totemrrp.c:1941
>> #10 0x7f038bdca01e in net_deliver_fn (fd=,
>> revents=, data=0x7f038d541a90) at totemudpu.c:499
>> #11 0x7f038b96576f in _poll_dispatch_and_take_back_
>> (item=0x7f038d4fe168, p=) at loop_poll.c:108
>> #12 0x7f038b965300 in qb_loop_run_level (level=0x7f038d4fde08) at
>> loop.c:43
>> #13 qb_loop_run (lp=) at loop.c:210
>> #14 0x7f038c21b6d0 in main (argc=, argv=> out>, envp=) at main.c:1383
>>
>> (gdb) f 1
>> #1  0x7f038b9656ed in _handle_real_signal_ (signal_num=> out>, si=, context=) at loop_poll.c:474
>> 474 res = write(pipe_fds[1], &sig, sizeof(int32_t));
>> (gdb) info locals
>> sig = 11
>> res = 
>> __func__ = "_handle_real_signal_"
>> (gdb) f 4
>> #4  0x7f038c220a3d in schedwrk_processor (context=)
>> at sync.c:551
>> 551
>> my_service_list[my_processing_idx].sync_init (my_trans_list,
>> (gdb) p my_processing_idx
>> $31 = 3
>> (gdb) p my_service_list[3]
>> $32 = {service_id = 0, sync_init = 0x0, sync_abort = 0x0, sync_process
>> = 0x0, sync_activate = 0x0, state = PROCESS, name = '\000 > times>}
>>
>> So it seems  corosync dead looping in segfault handler.
>> I have not found any related changelog in the release notes after 2.3.4.
>>
>> Can anyone help please?
> 
> Yep. It looks like (for some reason) signal pipe was not processed and
> libqb _handle_real_signal_ is looping. Corosync really cannot do
> anything about it. It looks like regular libqb bug, so even you can't do
> anything with it. CCing Chrissie so she is aware.
> 

Yes, it seems that some corosync SEGVs trigger this obscure bug in
libqb. I've chased a few possible causes and none have been fruitful.

If you get this then corosync has crashed, and this other bug is masking
the actual diagnostics - I know, helpful :/

It's on my list

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync maximum nodes

2017-01-30 Thread Christine Caulfield

On 27/01/17 09:43, Гюльнара Невежина wrote:
> Hello!
> I'm very sorry to disturb you with such question but I can't find
> information if there is maximum nodes' limit in corosync? I've found a
> bug report https://bugzilla.redhat.com/show_bug.cgi?id=905296#c5 with
> "Corosync has hardcoded maximum number of nodes to 64" but it was posted
> 4 years ago.. 
> If anybody knows how many nodes I can add to future HA cluster? 
> 


Even at 64 nodes, corosync needs some tuning to make it reliable. If you
want to go above around 32 nodes then pacemaker-remote is probably the
least stressful (and recommended) way of doing it.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Remote/

Chrissie

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] libqb 1.0.1 release

2016-11-24 Thread Christine Caulfield

I am very pleased to announce the 1.0.1 release of libqb

This is a bugfix release with mainly lots of small amendments.

Low: ipc_shm: fix superfluous NULL check
log: Don't overwrite valid tags
Low: further avoid magic in qblog.h by using named constants
Low: log: check for appropriate space when serializing a char
Low: sanitize import of  symbols
Low: sanitize import of  symbols
Low: further sanitize qbipc[cs].h public headers wrt. includes
Med: log_thread: logt_wthread_lock is vital for logging thread
Low: unix: new qb_sys_unlink_or_truncate{,_at} helpers
log: Add missing z,j, & t types to the logger
Med: rb: use new qb_rb_close_helper able to resort to file truncating
Low: log: check for appropriate space when serializing a char
API: introduce alternative, header-based versioning
API: header-based versioning: s/PATCH/MICRO
Low: explain mysterious lines in a public header (qblog.h)
tests: refactor test case defs using versatile add_tcase macro
tests: SIGSTOP cannot be caught, blocked, or ignored
defs: add wrappers over preprocessor operators
build: be more restrictive about QB_HAVE_ATTRIBUTE_SECTION
Add some Hurd support
build: use latest git-version-gen from gnulib (rev. 6118065)
build: persuade git-version-gen vMAJOR.MINOR tags just miss .0
tests: ensure verbose output on failure w/ more recent automake
tests: make clang-friendly (avoid using run-time VLAs)
CI: make travis use also clang compiler (for good measure)
low:fixed:Spelling error of failure in qbhdb.h
Fix typo: qblog.h: q{g -> b}_log_filter_ctl
docs: qbdefs.h: description must directly follow @file
maint: qb-blackbox man page should accompany the binary
Build: configure: do not check for unused "sched" functions
Maint: typo + unused functions checked in configure
tests: resources: check for proper names of leftover processes
doc: elaborate more on thread safety as it's not so pure
log: Remove check for HAVE_SCHED_GET_PRIORITY_MAX
tests: start stdlib failures injection effort with unlink{,at} + test
build: ensure check_SCRIPTS are distributed
build: ensure debug make flags are not derived when unsuitable
build: allow for git -> automatic COPR builds integration
doc: README: add a status badge+link for the COPR builds


Huge thanks you to all of the people who have contributed to this release.

Chrissie

The current release tarball is here:
https://github.com/ClusterLabs/libqb/releases/download/v1.0.1/libqb-1.0.1.tar.gz

The github repository is here:
https://github.com/ClusterLabs/libqb

Please report bugs and issues in bugzilla:
https://bugzilla.redhat.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

1 2 >

1 - 100 of 142 matches

Mail list logo