Re: [ClusterLabs] pacemaker-controld getting respawned

2020-01-06 Thread Ken Gaillot
On Fri, 2020-01-03 at 13:23 +, S Sathish S wrote:
> Hi Team,
>  
> Pacemaker-controld process is getting restarted frequently reason for
> failure disconnect from CIB/Internal Error (or) high cpu on the
> system, same has been recorded in our system logs, Please find the
> pacemaker and corosync version installed on the system.  
>  
> Kindly let us know why we are getting below error on the system.
>  
> corosync-2.4.4 à  https://github.com/corosync/corosync/tree/v2.4.4
> pacemaker-2.0.2 à 
> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2
>  
> [root@vmc0621 ~]# ps -eo pid,lstart,cmd  | grep -iE
> 'corosync|pacemaker' | grep -v grep
> 2039 Wed Dec 25 15:56:15 2019 corosync
> 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f
> 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based
> 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced
> 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd
> 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd
> 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-
> schedulerd
> 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-
> controld
>  
>  
> In system message logs :
>  
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update
> 4419 failed: Timer expired (-62)
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update
> 4420 failed: Timer expired (-62)

This means that the controller is not getting a response back from the
CIB manager (pacemaker-based) within a reasonable time. If the DC can't
record the status of nodes, it can't make correct decisions, so it has
no choice but to exit (which should lead another node to fence it).

The default timeout is the number of active nodes in the cluster times
10 seconds, with a minimum of 30 seconds. That's a lot of time, so I
would be concerned if the CIB isn't responsive for that long.

The logs from pacemaker-based before this point might be helpful,
although if it's not getting scheduled any CPU time there wouldn't be
any indication of that.

It is possible to set the timeout explicitly using the PCMK_cib_timeout
environment variable, but the underlying problem would be likely to
cause other issues.

> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input
> I_ERROR received in state S_IDLE from crmd_node_update_complete
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: State
> transition S_IDLE -> S_RECOVERY
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Fast-
> tracking shutdown in response to errors
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Not voting
> in election, we're in state S_RECOVERY
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input
> I_ERROR received in state S_RECOVERY from node_list_update_callback
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input
> I_TERMINATE received in state S_RECOVERY from do_recover
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Stopped 0
> recurring operations at shutdown (12 remaining)
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:241 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:261 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:249 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:258 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:253 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:250 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:244 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_OCC:237 (XXX_monitor_1) incomplete at shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:264 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:270 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:238 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring
> action XXX_vmc0621:267 (XXX_vmc0621_monitor_1) incomplete at
> shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: 12 resources
> were active at shutdown
> Dec 30 10:02:37 vmc0621 pacemaker-controld[75

[ClusterLabs] Primary/secondary setup with MariaDB

2020-01-06 Thread Kees de Jong
Hi,




A month ago I created an issue about the MariaDB promotable resources
in Pacemaker [1]. I didn't receive a reply yet.

What's the recommended method to do a fail over for MariaDB with
Pacemaker? I've also seen solutions with DRBD. But if primary/secondary
setup should work as well, then I prefer that solution. Furthermore, if
I'm doing anything wrong with the `pcs` command, then please let me
know as well.

The replication works fine, however, Pacemaker doesn't promote the
secondary when the primary goes down.


[1] https://github.com/ClusterLabs/resource-agents/issues/1441





-- 
Met vriendelijke groet,
Kees de Jong

De informatie opgenomen in deze e-mail kan vertrouwelijk zijn en is
uitsluitend bestemd voor de geadresseerde(n). Indien u deze e-mail
onterecht ontvangt, wordt u verzocht de inhoud niet te gebruiken en de
afzender direct te informeren door de e-mail te retourneren. Aan deze
e-mail inclusief de bijlagen kunnen geen rechten ontleend worden,
tenzij schriftelijk anders wordt overeengekomen.
--
The information contained in this e-mail may be confidential and is
intended to be exclusively for the addressee(s). Should you receive
this e-mail unintentionally, please do not use the contents herein and
notify the sender immediately by return e-mail. This e-mail including
the attachments are not legally binding, unless otherwise agreed upon
in writing.
--
OpenPGP fingerprint: 0x0E45C98AB51428E6


signature.asc
Description: This is a digitally signed message part
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [Announce] libqb 1.9.0 released

2020-01-06 Thread Christine Caulfield
Hi Yan,

I'm just back from the break, I'll look into the issues you've raised.
That's why we do release-candicates :)

Chrissie

On 13/12/2019 15:00, Yan Gao wrote:
> Hi Christine,
> 
> Congratulations and thanks for the release!
> 
> As previously brought from: 
> https://github.com/ClusterLabs/libqb/issues/338#issuecomment-503155816
> 
> , the master branch has this too:
> 
> https://github.com/ClusterLabs/libqb/commit/6a4067c1d1764d93d255eccecfd8bf9f43cb0b4d
> 
> , but doesn't seem to have:
> 
> https://github.com/ClusterLabs/libqb/pull/349
> 
> Does it mean the master branch is somehow not impacted by the issues, or 
> some other solutions are being sought there? Thanks.
> 
> Regards,
>Yan
> 
> 
> 
> On 12/12/19 5:37 PM, christine caulfield wrote:
>> We are pleased to announce the release of libqb 1.9.0 - this is a 
>> release candidate for a future 2.0 release
>>
>>
>> Source code is available at:
>> https://github.com/ClusterLabs/libqb/releases/download/1.9.0/libqb-1.9.0.tar.xz
>>  
>>
>>
>> Please use the signed .tar.gz or .tar.xz files with the version number
>> in rather than the github-generated "Source Code" ones.
>>
>> There are a small number of new features:
>>
>>      high resolution logging (millisecond timestamps)
>>      systemd journal logging
>>      re-opening of log files under program control
>>
>> and many bug fixes.
>>
>> I've also removed the linker shenanigans that caused so much trouble 
>> with compatibility in the past, which is the main reason for making this 
>> 2.0.0 rather than 1.0.6
>>
>> Thanks to all the many people that made this possible.
>>
>> Chrissie
>>
>> shortlog:
>>
>> Chrissie Caulfield (27):
>> tests: Improve test isolation (#298)
>> test: Fix 'make distcheck' (#303)
>> ipc_shm: Don't truncate SHM files of an active server (#307)
>> Allow customisable log line length (#292)
>> log: Use RTLD_NOOPEN when checking symbols (#310)
>> UPDATED: doc (ABI comparison) and various other fixes (#324)
>> logging: Remove linker 'magic' and just use statics for logging 
>> callsites (#322)
>> log: Add option to re-open a log file (#326)
>> log: Add configure-time option to use systemd journal instead of syslog 
>> (#327)
>> Add the option of hi-res (millisecond) timestamps (#329)
>> log: Remove more dead code from linker callsites (#331)
>> tests: Shorted deadlock test names (#372)
>> make: Remove splint tests (#374)
>> skiplist: fix use-after-free in the skiplist traversal
>> skiplist: Fix previous skiplist fix
>> tests: allow blackbox-segfault.sh to run out-of-tree
>> ipc: use O_EXCL on SHM files, and randomize the names
>> ipc: fixes
>> ipc: use O_EXCL when opening IPC files
>> ipc: Use mkdtemp for more secure IPC files
>> ipc: Use mkdtemp for more secure IPC files
>> version: update version-info for 1.0.4 release
>> version: bump soname for 1.0.5 release
>> ipc: fix force-filesystem-sockets
>> tests: Speed up IPC tests, especially on FreeBSD
>> ipc: Remove kqueue EOF log message
>> lib: Fix some minor warnings from newer compilers
>>
>> Daniel Black (4):
>> tests: blackbox-segfault test - remove residual core files
>> CI: travis: show logs of test failures
>> build: split hack for splint to work on non-x86 architectures
>> build: dpkg-architecture on trusty (cf. Travis CI) uses -q{NAME}
>>
>> Fabio M. Di Nitto (8):
>> tests: use RUNPATH instead of RPATH consistently (#309)
>> [build] fix supported compiler warning detection (#330)
>> [test-rpm] build test binaries by default
>> [tests] export SOCKETDIR from tests/Makefile.am
>> [tests] allow installation of test suite
>> [tests] enable building / shipping of libqb-tests.rpm
>> [tests] first pass at fixing test execution
>> [build] add --with-sanitizers= option for sanitizer builds (#366)
>>
>> Ferenc Wágner (8):
>> Fix spelling: plaform -> platform
>> Fix garbled Doxygen markup
>> Errors are represented as negative values
>> Allow group access to the IPC directory
>> Make it impossible to truncate or overflow the connection description
>> Let remote_tempdir() assume a NUL-terminated name
>> doc: qbarray.h: remove stray asterisk and parentheses
>> doc: qbarray: reword comment about index partitioning
>>
>> Jan Friesse (2):
>> ipc: Fix named socket unlink on FreeBSD
>> ipc: Always initialize response struct
>>
>> Jan Pokorný (15):
>> build: fix configure script neglecting, re-enable out-of-tree builds
>> build: configure: fix non-portable '\s' and '//{q}' in sed expression
>> build: allow for being consumed in a (non-endorsed) form of snapshots
>> build: configure: fix "snapshot consumption" feature on FreeBSD
>> tests: ipc: avoid problems when UNIX_PATH_MAX (108) limits is hit
>> tests: ipc: speed the suite up with avoiding expendable sleep(3)s
>> tests: ipc: allow for easier tests debugging by discerning PIDs/roles
>> tests: ipc: refactor/split test_ipc_dispatch part into client_dispatch
>> tests: ipc: check deadlock-like situation due to mixing priorities
>> IPC: server: avoid temporary channel priority loss, up to