[Pacemaker] solaris problem
Hi folks, I'm trying to build test HA cluster on Solaris 5.11 using libqb 0.14.4, corosync 2.3.0 and pacemaker 1.1.8, and I'm facing a strange problem while starting pacemaker. Log shows the following errors: Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: main: Failed to allocate lrmd server. shutting down Mar 25 09:21:26 [33722]pengine:error: mainloop_add_ipc_server: Could not start pengine IPC server: Unknown error (-48) Mar 25 09:21:26 [33722]pengine:error: main: Couldn't start IPC server Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit: Child process lrmd exited (pid=33720, rc=255) Mar 25 09:21:26 [33721] attrd:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/attrd): Permission denied (13) Mar 25 09:21:26 [33721] attrd:error: mainloop_add_ipc_server: Could not start attrd IPC server: Unknown error (-13) Mar 25 09:21:26 [33721] attrd:error: main: Could not start IPC server Mar 25 09:21:26 [33721] attrd:error: main: Aborting startup Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit: Child process pengine exited (pid=33722, rc=1) Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit: Child process attrd exited (pid=33721, rc=100) Mar 25 09:21:26 [33718]cib:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/cib_ro): Permission denied (13) Mar 25 09:21:26 [33718]cib:error: mainloop_add_ipc_server: Could not start cib_ro IPC server: Unknown error (-13) Mar 25 09:21:26 [33718]cib:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/cib_rw): Permission denied (13) Mar 25 09:21:26 [33718]cib:error: mainloop_add_ipc_server: Could not start cib_rw IPC server: Unknown
[Pacemaker] Patrik Rapposch is out of the office
Ich werde ab 25.03.2013 nicht im Büro sein. Ich kehre zurück am 27.03.2013. Sehr geehrte Damen und Herren, ich bin bis einschließlich 27.03 auf Dienstreise. Trotzdem versuche ich Ihr Anliegen so schnell als möglich zu beantworten. Bitte setzen Sie immer ksi.network in Kopie. Please note, that I am on a business trip till 27.03.13. Always use ksi.netw...@knapp.com, which ensures that one of our network adminsitrators takes care of your interest. If you need operating system support please contact ksi...@knapp-systems.com. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] solaris problem
With solaris/openindiana you should use this setting export PCMK_ipc_type=socket Andreas -Ursprüngliche Nachricht- Von: Andrei Belov [mailto:defana...@gmail.com] Gesendet: Montag, 25. März 2013 10:43 An: pacemaker@oss.clusterlabs.org Betreff: [Pacemaker] solaris problem Hi folks, I'm trying to build test HA cluster on Solaris 5.11 using libqb 0.14.4, corosync 2.3.0 and pacemaker 1.1.8, and I'm facing a strange problem while starting pacemaker. Log shows the following errors: Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: mainloop_add_ipc_server: Could not start lrmd IPC server: Unknown error (-48) Mar 25 09:21:26 [33720] lrmd:error: try_server_create:New IPC server could not be created because another lrmd process exists, sending shutdown command to old lrmd process. Mar 25 09:21:26 [33720] lrmd:error: main: Failed to allocate lrmd server. shutting down Mar 25 09:21:26 [33722]pengine:error: mainloop_add_ipc_server: Could not start pengine IPC server: Unknown error (-48) Mar 25 09:21:26 [33722]pengine:error: main: Couldn't start IPC server Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit: Child process lrmd exited (pid=33720, rc=255) Mar 25 09:21:26 [33721] attrd:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/attrd): Permission denied (13) Mar 25 09:21:26 [33721] attrd:error: mainloop_add_ipc_server: Could not start attrd IPC server: Unknown error (-13) Mar 25 09:21:26 [33721] attrd:error: main: Could not start IPC server Mar 25 09:21:26 [33721] attrd:error: main: Aborting startup Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit: Child process pengine exited (pid=33722, rc=1) Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit: Child process attrd exited (pid=33721, rc=100) Mar 25 09:21:26 [33718]cib:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/cib_ro): Permission denied (13) Mar 25 09:21:26 [33718]cib:error: mainloop_add_ipc_server: Could not start cib_ro IPC
Re: [Pacemaker] solaris problem
Andreas, just tried PCMK_ipc_type=socket pacemaker -fV - a bunch of additional event_send errors appeared: Mar 25 11:15:55 [33641] ha1 corosync error [MAIN ] event_send retuned -32, expected 256! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 217! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 219! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 256! Mar 25 11:15:55 [53980]pengine:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/pengine): Permission denied (13) Mar 25 11:15:55 [53980]pengine:error: mainloop_add_ipc_server: Could not start pengine IPC server: Unknown error (-13) Mar 25 11:15:55 [53980]pengine:error: main: Couldn't start IPC server Mar 25 11:15:55 [53975] pacemakerd:error: pcmk_child_exit: Child process pengine exited (pid=53980, rc=1) Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 256! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [53979] attrd:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/attrd): Permission denied (13) Mar 25 11:15:55 [53979] attrd:error: mainloop_add_ipc_server: Could not start attrd IPC server: Unknown error (-13) Mar 25 11:15:55 [53979] attrd:error: main: Could not start IPC server Mar 25 11:15:55 [53979] attrd:error: main: Aborting startup Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [53975] pacemakerd:error: pcmk_child_exit: Child process attrd exited (pid=53979, rc=100) Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 256! Mar 25 11:15:55 [53976]cib:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/cib_ro): Permission denied (13) Mar 25 11:15:55 [53976]cib:error: mainloop_add_ipc_server: Could not start cib_ro IPC server: Unknown error (-13) Mar 25 11:15:55 [53976]cib:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/cib_rw): Permission denied (13) Mar 25 11:15:55 [53976]cib:error: mainloop_add_ipc_server: Could not start cib_rw IPC server: Unknown error (-13) Mar 25 11:15:55 [53976]cib:error: qb_ipcs_us_publish: Could not bind AF_UNIX (/var/run/cib_shm): Permission denied (13) Mar 25 11:15:55 [53976]cib:error: mainloop_add_ipc_server: Could not start cib_shm IPC server: Unknown error (-13) Mar 25 11:15:55 [53976]cib:error: cib_init: Couldnt start all IPC channels, exiting. Mar 25 11:15:55 [53975] pacemakerd:error: pcmk_child_exit: Child process cib exited (pid=53976, rc=255) Mar 25 11:15:55 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 223! Mar 25 11:16:04 [53977] stonith-ng:error: setup_cib:Could not connect to the CIB service: -134 fd7fc421a0b0 Mar 25 11:16:04 [33641] ha1 corosync error [SERV ] event_send retuned -32, expected 217! Mar 25 11:16:04 [53975] pacemakerd: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error # fgrep 32 /usr/include/sys/errno.h #define EPIPE 32 /* Broken pipe */ On Mar 25, 2013, at 13:55 , Grüninger, Andreas (LGL Extern) andreas.gruenin...@lgl.bwl.de wrote: With solaris/openindiana you should use this setting export PCMK_ipc_type=socket Andreas -Ursprüngliche Nachricht- Von: Andrei Belov [mailto:defana...@gmail.com] Gesendet: Montag, 25. März 2013 10:43 An: pacemaker@oss.clusterlabs.org Betreff: [Pacemaker] solaris problem Hi folks, I'm trying to build test HA cluster on Solaris 5.11 using libqb 0.14.4, corosync 2.3.0 and pacemaker 1.1.8, and I'm facing a strange problem while starting pacemaker. Log shows the following errors: Mar 25 09:21:26 [33720] lrmd:
Re: [Pacemaker] CMAN, corosync pacemaker
On 2013-03-21T15:28:17, Leon Fauster leonfaus...@googlemail.com wrote: I believe the preferred pacemaker based HA configuration in RHEL 6.4 uses all three packages and the preferred configuration in SLES11 SP2 is just corosync/pacemaker (I do not believe CMAN is even available in SLE-HAE). Why the different approaches and what is the advantage of each configuration? cman is part of the RH cluster suite - the motivations of the different ha approaches for both vendors are for sure more based in strategic/marketing needs. There's also a technical point of cman/pacemaker not being available when SLE HA 11 was created, and it's not possible to do a rolling upgrade to this configuration. Hence, cman will not appear in SLE HA 11. If there is a technical benefit to the approach, we will reevaluate this for SLE HA 12. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Linking lib/cib and lib/pengine to each other?
23.03.2013 08:27, Viacheslav Dubrovskyi пишет: Hi. I'm building a package for my distributive. Everything is built, but the package does not pass our internal tests. I get errors like this: verify-elf: ERROR: ./usr/lib/libpe_status.so.4.1.0: undefined symbol: get_object_root It mean, that libpe_status.so not linked with libcib.so where defined get_object_root. I can easy fix it adding libpe_status_la_LIBADD = $(top_builddir)/lib/cib/libcib.la in lib/pengine/Makefile.am But for this I need build libcib before lib/pengine. And it's impossible too, because libcib used symbols from lib/pengine. So we have situation, when two library must be linked to each other. And this is very bad because then the in fact it should be one library. Or symbols should be put in a third library, such as common. Can anyone comment on this situation? Patch for fix this error. -- WBR, Viacheslav Dubrovskyi diff --git a/Makefile.am b/Makefile.am index 4f742e4..fdf19eb 100644 --- a/Makefile.am +++ b/Makefile.am @@ -23,7 +23,7 @@ EXTRA_DIST = autogen.sh ConfigureMe README.in libltdl.tar m4/gnulib MAINTAINERCLEANFILES= Makefile.in aclocal.m4 configure DRF/config-h.in \ DRF/stamp-h.in libtool.m4 ltdl.m4 libltdl.tar -CORE = $(LIBLTDL_DIR) replace include lib mcp pengine cib crmd fencing lrmd tools xml +CORE = $(LIBLTDL_DIR) replace include lib mcp cib pengine crmd fencing lrmd tools xml SUBDIRS = $(CORE) cts extra doc doc_DATA = AUTHORS COPYING COPYING.LIB diff --git a/lib/Makefile.am b/lib/Makefile.am index 5563819..4ebd91b 100644 --- a/lib/Makefile.am +++ b/lib/Makefile.am @@ -39,7 +39,7 @@ clean-local: rm -f *.pc ## Subdirectories... -SUBDIRS = gnu common pengine transition cib fencing services lrmd cluster +SUBDIRS = gnu common cib pengine transition fencing services lrmd cluster DIST_SUBDIRS = $(SUBDIRS) ais if BUILD_CS_PLUGIN diff --git a/lib/cib/Makefile.am b/lib/cib/Makefile.am index 6ab02fc..c73a329 100644 --- a/lib/cib/Makefile.am +++ b/lib/cib/Makefile.am @@ -32,11 +32,14 @@ if ENABLE_ACL libcib_la_SOURCES += cib_acl.c endif -libcib_la_LDFLAGS = -version-info 3:0:0 $(top_builddir)/lib/common/libcrmcommon.la $(CRYPTOLIB) \ - $(top_builddir)/lib/pengine/libpe_rules.la - +libcib_la_LDFLAGS = -version-info 3:0:0 -L$(top_builddir)/lib/pengine/.libs +libcib_la_LIBADD= $(CRYPTOLIB) $(top_builddir)/lib/pengine/libpe_rules.la $(top_builddir)/lib/common/libcrmcommon.la libcib_la_CFLAGS = -I$(top_srcdir) +libcib_la_DEPENDENCIES = libpe_rules +libpe_rules: + make -C ../../lib/pengine libpe_rules.la + clean-generic: rm -f *.log *.debug *.xml *~ diff --git a/lib/pengine/Makefile.am b/lib/pengine/Makefile.am index 9cb2392..a173522 100644 --- a/lib/pengine/Makefile.am +++ b/lib/pengine/Makefile.am @@ -28,10 +28,11 @@ noinst_HEADERS = unpack.h variant.h libpe_rules_la_LDFLAGS = -version-info 2:2:0 libpe_rules_la_SOURCES = rules.c common.c +libpe_rules_la_LIBADD = $(top_builddir)/lib/common/libcrmcommon.la libpe_status_la_LDFLAGS = -version-info 5:0:1 libpe_status_la_SOURCES = status.c unpack.c utils.c complex.c native.c group.c clone.c rules.c common.c -libpe_status_la_LIBADD = @CURSESLIBS@ +libpe_status_la_LIBADD = @CURSESLIBS@ $(top_builddir)/lib/common/libcrmcommon.la $(top_builddir)/lib/cib/libcib.la clean-generic: rm -f *.log *.debug *~ diff --git a/lib/services/Makefile.am b/lib/services/Makefile.am index 3ee3347..ef8fbc3 100644 --- a/lib/services/Makefile.am +++ b/lib/services/Makefile.am @@ -26,7 +26,7 @@ noinst_HEADERS = upstart.h systemd.h services_private.h libcrmservice_la_SOURCES = services.c services_linux.c libcrmservice_la_LDFLAGS = -version-info 1:0:0 libcrmservice_la_CFLAGS = $(GIO_CFLAGS) -libcrmservice_la_LIBADD = $(GIO_LIBS) +libcrmservice_la_LIBADD = $(GIO_LIBS) $(top_builddir)/lib/common/libcrmcommon.la if BUILD_UPSTART libcrmservice_la_SOURCES += upstart.c diff --git a/lib/transition/Makefile.am b/lib/transition/Makefile.am index 49c7113..7279c59 100644 --- a/lib/transition/Makefile.am +++ b/lib/transition/Makefile.am @@ -29,6 +29,7 @@ libtransitioner_la_SOURCES = unpack.c graph.c utils.c libtransitioner_la_LDFLAGS = -version-info 2:0:0 libtransitioner_la_CFLAGS = -I$(top_builddir) +libtransitioner_la_LIBADD = $(top_builddir)/lib/common/libcrmcommon.la clean-generic: rm -f *~ ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] solaris problem
I've rebuilt libqb using separated SOCKETDIR (/var/run/qb), and set hacluster:haclient ownership to this dir. After that pacemakerd has been successfully started with all its childs: [root@ha1 /var/run/qb]# pacemakerd -fV Could not establish pacemakerd connection: Connection refused (146) info: crm_ipc_connect: Could not establish pacemakerd connection: Connection refused (146) info: get_cluster_type: Detected an active 'corosync' cluster info: read_config: Reading configure for stack: corosync notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log notice: main: Starting Pacemaker 1.1.8 (Build: 1f8858c): ncurses libqb-logging libqb-ipc upstart systemd corosync-native info: main: Maximum core file size is: 18446744073709551613 info: qb_ipcs_us_publish: server name: pacemakerd notice: update_node_processes:48de70 Node 182452614 now known as ha1, was: info: start_child: Forked child 60719 for process cib info: start_child: Forked child 60720 for process stonith-ng info: start_child: Forked child 60721 for process lrmd info: start_child: Forked child 60722 for process attrd info: start_child: Forked child 60723 for process pengine info: start_child: Forked child 60724 for process crmd info: main: Starting mainloop [root@ha1 /var/run/qb]# ls -l total 0 srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 attrd srwxrwxrwx 1 root root 0 Mar 25 11:43 cfg srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_ro srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_rw srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_shm srwxrwxrwx 1 root root 0 Mar 25 11:43 cmap srwxrwxrwx 1 root root 0 Mar 25 11:43 cpg srwxrwxrwx 1 root root 0 Mar 25 11:50 lrmd srwxrwxrwx 1 root root 0 Mar 25 11:50 pacemakerd srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 pengine srwxrwxrwx 1 root root 0 Mar 25 11:43 quorum srwxrwxrwx 1 root root 0 Mar 25 11:50 stonith-ng However, libqb still can not create some files in /var/run due to insufficient permissions: Mar 25 11:50:45 [60719]cib: info: init_cs_connection_once: Connection to 'corosync': established Mar 25 11:50:45 [60719]cib: info: crm_get_peer: Node 182452614 is now known as ha1 Mar 25 11:50:45 [60719]cib: info: crm_get_peer: Node 182452614 has uuid 182452614 Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish: server name: cib_ro Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish: server name: cib_rw Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish: server name: cib_shm Mar 25 11:50:45 [60719]cib: info: cib_init: Starting cib mainloop Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership: Joined[0.0] cib.182452614 Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership: Member[0.0] cib.182452614 Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership: Member[0.1] cib.182452614 Mar 25 11:50:46 [60719]cib:error: qb_sys_mmap_file_open: couldn't open file /var/run/qb-cib_rw-control-60719-60720-15: Permission denied (13) Mar 25 11:50:46 [60719]cib:error: qb_ipcs_us_connect: couldn't create file for mmap (60719-60720-15): Permission denied (13) Mar 25 11:50:46 [60719]cib:error: handle_new_connection:Invalid IPC credentials (60719-60720-15). Mar 25 11:50:46 [60720] stonith-ng: info: crm_ipc_connect: Could not establish cib_rw connection: Permission denied (13) Mar 25 11:50:46 [60719]cib:error: qb_sys_mmap_file_open: couldn't open file /var/run/qb-cib_shm-control-60719-60724-16: Permission denied (13) Mar 25 11:50:46 [60719]cib:error: qb_ipcs_us_connect: couldn't create file for mmap (60719-60724-16): Permission denied (13) Mar 25 11:50:46 [60719]cib:error: handle_new_connection:Invalid IPC credentials (60719-60724-16). Mar 25 11:50:46 [60724] crmd: info: crm_ipc_connect: Could not establish cib_shm connection: Permission denied (13) Mar 25 11:50:46 [60724] crmd: info: do_cib_control: Could not connect to the CIB service: Transport endpoint is not connected Mar 25 11:50:46 [60724] crmd: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry If someone has working setup on Linux with corosync 2.x, libqb and pacemaker 1.1.x - I'd be very appreciated for sharing some information about a places which libqb uses for its special socket files. Thanks in advance! (Can we say now that this problem is libqb-related, not pacemaker?) On Mar 25, 2013, at 15:30 , Andrei Belov defana...@gmail.com wrote: Andreas, just tried PCMK_ipc_type=socket pacemaker -fV - a bunch of additional event_send errors appeared: Mar 25 11:15:55 [33641] ha1 corosync error [MAIN ] event_send
[Pacemaker] DRBD+LVM+NFS problems
Hi, I'm currently trying create a two node redundant NFS setup on CentOS 6.4 using pacemaker and crmsh. I use this Document as a starting poing: https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html The first issue is that using these instructions I get the cluster up and running but the moment I try to stop the pacemaker service on the current master node several resources just fail and everything goes pear-shaped. Since the problem seems to relate to the nfs bits in the configuration I removed these in order to get to a minimal working setup and then add things piece by piece in order to find the source of the problem. Now I am at a point where I basically have only DRBD+LVM+Filesystems+IPAddr2 configured and now LVM seems to act up. I can start the cluster and everything is fine but the moment I stop pacemaker on the master i end up with this as a status: === Node nfs2: standby Online: [ nfs1 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ nfs1 ] Stopped: [ p_drbd_nfs:1 ] Failed actions: p_lvm_nfs_start_0 (node=nfs1, call=505, rc=1, status=complete): unknown error === and in the log on nfs1 I see: LVM(p_lvm_nfs)[7515]: 2013/03/25_12:34:21 ERROR: device-mapper: reload ioctl on failed: Invalid argument device-mapper: reload ioctl on failed: Invalid argument 2 logical volume(s) in volume group nfs now active However a lvs in this state shows: [root@nfs1 ~]# lvs LV VGAttr LSize Pool Origin Data% Move Log web1nfs -wi-- 2,00g web2nfs -wi-- 2,00g lv_root vg_nfs1.local -wi-ao--- 2,45g lv_swap vg_nfs1.local -wi-ao--- 256,00m So the volume group is present. My current configuration looks like this: node nfs1 \ attributes standby=off node nfs2 \ attributes standby=on primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=nfs \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_fs_web1 ocf:heartbeat:Filesystem \ params device=/dev/nfs/web1 \ directory=/srv/nfs/web1 \ fstype=ext4 \ op monitor interval=10s primitive p_fs_web2 ocf:heartbeat:Filesystem \ params device=/dev/nfs/web2 \ directory=/srv/nfs/web2 \ fstype=ext4 \ op monitor interval=10s primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ params ip=10.99.0.142 cidr_netmask=24 \ op monitor interval=30s primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname=nfs \ op monitor interval=30s group g_nfs p_lvm_nfs p_fs_web1 p_fs_web2 p_ip_nfs ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 \ master-node-max=1 \ clone-max=2 \ clone-node-max=1 \ notify=true colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master property $id=cib-bootstrap-options \ dc-version=1.1.8-7.el6-394e906 \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1364212090 \ maintenance-mode=false rsc_defaults $id=rsc_defaults-options \ resource-stickiness=100 Any ideas why this isn't working? Regards, Dennis ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD+LVM+NFS problems
I just found the following in the dmesg output which might or might not add to understanding the problem: device-mapper: table: 253:2: linear: dm-linear: Device lookup failed device-mapper: ioctl: error adding target to table Regards, Dennis On 25.03.2013 13:04, Dennis Jacobfeuerborn wrote: Hi, I'm currently trying create a two node redundant NFS setup on CentOS 6.4 using pacemaker and crmsh. I use this Document as a starting poing: https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html The first issue is that using these instructions I get the cluster up and running but the moment I try to stop the pacemaker service on the current master node several resources just fail and everything goes pear-shaped. Since the problem seems to relate to the nfs bits in the configuration I removed these in order to get to a minimal working setup and then add things piece by piece in order to find the source of the problem. Now I am at a point where I basically have only DRBD+LVM+Filesystems+IPAddr2 configured and now LVM seems to act up. I can start the cluster and everything is fine but the moment I stop pacemaker on the master i end up with this as a status: === Node nfs2: standby Online: [ nfs1 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ nfs1 ] Stopped: [ p_drbd_nfs:1 ] Failed actions: p_lvm_nfs_start_0 (node=nfs1, call=505, rc=1, status=complete): unknown error === and in the log on nfs1 I see: LVM(p_lvm_nfs)[7515]:2013/03/25_12:34:21 ERROR: device-mapper: reload ioctl on failed: Invalid argument device-mapper: reload ioctl on failed: Invalid argument 2 logical volume(s) in volume group nfs now active However a lvs in this state shows: [root@nfs1 ~]# lvs LV VGAttr LSize Pool Origin Data% Move Log web1nfs -wi-- 2,00g web2nfs -wi-- 2,00g lv_root vg_nfs1.local -wi-ao--- 2,45g lv_swap vg_nfs1.local -wi-ao--- 256,00m So the volume group is present. My current configuration looks like this: node nfs1 \ attributes standby=off node nfs2 \ attributes standby=on primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=nfs \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_fs_web1 ocf:heartbeat:Filesystem \ params device=/dev/nfs/web1 \ directory=/srv/nfs/web1 \ fstype=ext4 \ op monitor interval=10s primitive p_fs_web2 ocf:heartbeat:Filesystem \ params device=/dev/nfs/web2 \ directory=/srv/nfs/web2 \ fstype=ext4 \ op monitor interval=10s primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ params ip=10.99.0.142 cidr_netmask=24 \ op monitor interval=30s primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname=nfs \ op monitor interval=30s group g_nfs p_lvm_nfs p_fs_web1 p_fs_web2 p_ip_nfs ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 \ master-node-max=1 \ clone-max=2 \ clone-node-max=1 \ notify=true colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master property $id=cib-bootstrap-options \ dc-version=1.1.8-7.el6-394e906 \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1364212090 \ maintenance-mode=false rsc_defaults $id=rsc_defaults-options \ resource-stickiness=100 Any ideas why this isn't working? Regards, Dennis ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] racing crm commands... last write wins?
On Wed, Mar 20, 2013 at 10:40:10AM -0700, Bob Haxo wrote: Regarding the replace triggering a DC election ... which is causing issues with scripted installs ... how do I determine which crm commands will NOT trigger this election? It seems like every configure commit could possible result in new election. But I'm not sure what does it depend on. I need a way of avoiding this election while installing. I'm finding that when repeating the scripted install with the same commands, sometimes the DC election gets triggered and sometimes it does not. With the same configuration updates? With the DC election, these messages get logged, followed by the whole xml version of the configuration. Call cib_replace failed (-62): Timer expired This is a problem connecting to the cib process, i.e. it's not related to a configuration update (as it cannot proceed anyway). ERROR: 55: could not replace cib INFO: 55: offending xml: configuration Any suggestions for avoiding replacing rather than incrementally modifying the configuration? Not right now. The configuration update process in crmsh needs to be modified. Thanks, Dejan Thanks, Bob Haxo SGI On Mon, 2013-03-04 at 17:25 +0100, Lars Marowsky-Bree wrote: On 2013-03-04T17:14:28, Dejan Muhamedagic deja...@fastmail.fm wrote: Thought so at the time, yes. And I do think it cleaned up a few things, we just need to improve it. The full CIB replace also seems to trigger an election ... I think that used to happen in Heartbeat clusters but got fixed in the meantime, the details are a bit foggy. No, if you look at the current logs on the DC, you'll also see this happening. I think it's the replace of the node section that triggers it. Then most of the logic in crmsh would remain unchanged (i.e., it'd still operate on whole CIBs only), but the way how it passes it on to Pacemaker would improve. I hope. crmsh currently doesn't keep the original copy of the CIB. Right, but that should be a simple thing to add and prototype quickly. (Says he who isn't going to be the one doing it ;-) Anyway, this approach is worth investigating. Thanks, let me know how it goes! Regards, Lars ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] stonith and avoiding split brain in two nodes cluster
Hello, I am newbie with pacemaker (and, generally, with ha clusters). I have configured a two nodes cluster. Both nodes are virtual machines (vmware esx) and use a shared storage (provided by a SAN, although access to the SAN is from esx infrastructure and VM consider it as scsi disk). I have configured clvm so logical volumes are only active in one of the nodes. Now I need some help with the stonith configuration to avoid data corrumption. Since I'm using ESX virtual machines, I think I won't have any problem using external/vcenter stonith plugin to shutdown virtual machines. My problem is how to avoid split brain situation with this configuration, without configuring a 3rd node. I have read about quorum disks, external/sbd stonith plugin and other references, but I'm too confused with all this. For example, [1] mention techniques to improve quorum with scsi reserve or quorum daemon, but it didn't point to how to do this pacemaker. Or [2] talks about external/sbd. Any help? PS: I have attached my corosync.conf and crm configure show outputs [1] http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html [2] http://www.gossamer-threads.com/lists/linuxha/pacemaker/78887 -- Angel L. Mateo Martínez Sección de Telemática Área de Tecnologías de la Información y las Comunicaciones Aplicadas (ATICA) http://www.um.es/atica Tfo: 868889150 Fax: 86337 # Please read the openais.conf.5 manual page totem { version: 2 # How long before declaring a token lost (ms) token: 3000 # How many token retransmits before forming a new configuration token_retransmits_before_loss_const: 10 # How long to wait for join messages in the membership protocol (ms) join: 60 # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms) consensus: 3600 # Turn off the virtual synchrony filter vsftype: none # Number of messages that may be sent by one processor on receipt of the token max_messages: 20 # Limit generated nodeids to 31-bits (positive signed integers) clear_node_high_bit: yes # Disable encryption secauth: off # How many threads to use for encryption/decryption threads: 0 # Optionally assign a fixed node id (integer) # nodeid: 1234 # This specifies the mode of redundant ring, which may be none, active, or passive. rrp_mode: none interface { # The following values need to be set based on your environment ringnumber: 0 bindnetaddr: 155.54.211.160 mcastaddr: 226.94.1.1 mcastport: 5405 } } amf { mode: disabled } service { # Load the Pacemaker Cluster Resource Manager ver: 1 name: pacemaker } aisexec { user: root group: root } logging { fileline: off to_stderr: yes to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 } } node myotis51 node myotis52 primitive clvm ocf:lvm2:clvmd \ params daemon_timeout=30 \ meta target-role=Started primitive dlm ocf:pacemaker:controld \ meta target-role=Started primitive vg_users1 ocf:heartbeat:LVM \ params volgrpname=UsersDisk exclusive=yes \ op monitor interval=60 timeout=60 group dlm-clvm dlm clvm clone dlm-clvm-clone dlm-clvm \ meta interleave=true ordered=true target-role=Started location cli-prefer-vg_users1 vg_users1 \ rule $id=cli-prefer-rule-vg_users1 inf: #uname eq myotis52 property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1364212376 rsc_defaults $id=rsc-options \ resource-stickiness=100 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster
I have a production cluster, using two vm on esx cluster, for stonith i'm using sbd, everything work fine 2013/3/25 emmanuel segura emi2f...@gmail.com I have a production cluster, using two vm on esx cluster, for stonith i'm using sbd, everything work find 2013/3/25 Angel L. Mateo ama...@um.es Hello, I am newbie with pacemaker (and, generally, with ha clusters). I have configured a two nodes cluster. Both nodes are virtual machines (vmware esx) and use a shared storage (provided by a SAN, although access to the SAN is from esx infrastructure and VM consider it as scsi disk). I have configured clvm so logical volumes are only active in one of the nodes. Now I need some help with the stonith configuration to avoid data corrumption. Since I'm using ESX virtual machines, I think I won't have any problem using external/vcenter stonith plugin to shutdown virtual machines. My problem is how to avoid split brain situation with this configuration, without configuring a 3rd node. I have read about quorum disks, external/sbd stonith plugin and other references, but I'm too confused with all this. For example, [1] mention techniques to improve quorum with scsi reserve or quorum daemon, but it didn't point to how to do this pacemaker. Or [2] talks about external/sbd. Any help? PS: I have attached my corosync.conf and crm configure show outputs [1] http://techthoughts.typepad.**com/managing_computers/2007/** 10/split-brain-quo.htmlhttp://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html [2] http://www.gossamer-threads.**com/lists/linuxha/pacemaker/**78887http://www.gossamer-threads.com/lists/linuxha/pacemaker/78887 -- Angel L. Mateo Martínez Sección de Telemática Área de Tecnologías de la Información y las Comunicaciones Aplicadas (ATICA) http://www.um.es/atica Tfo: 868889150 Fax: 86337 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster
I have a production cluster, using two vm on esx cluster, for stonith i'm using sbd, everything work find 2013/3/25 Angel L. Mateo ama...@um.es Hello, I am newbie with pacemaker (and, generally, with ha clusters). I have configured a two nodes cluster. Both nodes are virtual machines (vmware esx) and use a shared storage (provided by a SAN, although access to the SAN is from esx infrastructure and VM consider it as scsi disk). I have configured clvm so logical volumes are only active in one of the nodes. Now I need some help with the stonith configuration to avoid data corrumption. Since I'm using ESX virtual machines, I think I won't have any problem using external/vcenter stonith plugin to shutdown virtual machines. My problem is how to avoid split brain situation with this configuration, without configuring a 3rd node. I have read about quorum disks, external/sbd stonith plugin and other references, but I'm too confused with all this. For example, [1] mention techniques to improve quorum with scsi reserve or quorum daemon, but it didn't point to how to do this pacemaker. Or [2] talks about external/sbd. Any help? PS: I have attached my corosync.conf and crm configure show outputs [1] http://techthoughts.typepad.**com/managing_computers/2007/** 10/split-brain-quo.htmlhttp://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html [2] http://www.gossamer-threads.**com/lists/linuxha/pacemaker/**78887http://www.gossamer-threads.com/lists/linuxha/pacemaker/78887 -- Angel L. Mateo Martínez Sección de Telemática Área de Tecnologías de la Información y las Comunicaciones Aplicadas (ATICA) http://www.um.es/atica Tfo: 868889150 Fax: 86337 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] solaris problem
Ok, I fixed this issue with the following patch against libqb 0.14.4: --- lib/unix.c.orig 2013-03-25 12:30:50.445762231 + +++ lib/unix.c 2013-03-25 12:49:59.322276376 + @@ -83,7 +83,7 @@ #if defined(QB_LINUX) || defined(QB_CYGWIN) snprintf(path, PATH_MAX, /dev/shm/%s, file); #else - snprintf(path, PATH_MAX, LOCALSTATEDIR /run/%s, file); + snprintf(path, PATH_MAX, %s/%s, SOCKETDIR, file); is_absolute = path; #endif } @@ -91,7 +91,7 @@ if (fd 0 !is_absolute) { qb_util_perror(LOG_ERR, couldn't open file %s, path); - snprintf(path, PATH_MAX, LOCALSTATEDIR /run/%s, file); + snprintf(path, PATH_MAX, %s/%s, SOCKETDIR, file); fd = open_mmap_file(path, file_flags); if (fd 0) { res = -errno; libqb was configured with --with-socket-dir=/var/run/qb, /var/run/qb owned by hacluster:haclient - this configuration works fine with both corosync 2.3.0 and pacemaker 1.1.8. Though I'm not sure that libqb is the right place to touch - maybe it'd be better to add some enhancements to pacemaker's lib/common/mainloop.c, mainloop_add_ipc_server() ? Cheers. On Mar 25, 2013, at 16:01 , Andrei Belov defana...@gmail.com wrote: I've rebuilt libqb using separated SOCKETDIR (/var/run/qb), and set hacluster:haclient ownership to this dir. After that pacemakerd has been successfully started with all its childs: [root@ha1 /var/run/qb]# pacemakerd -fV Could not establish pacemakerd connection: Connection refused (146) info: crm_ipc_connect: Could not establish pacemakerd connection: Connection refused (146) info: get_cluster_type: Detected an active 'corosync' cluster info: read_config: Reading configure for stack: corosync notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log notice: main: Starting Pacemaker 1.1.8 (Build: 1f8858c): ncurses libqb-logging libqb-ipc upstart systemd corosync-native info: main: Maximum core file size is: 18446744073709551613 info: qb_ipcs_us_publish: server name: pacemakerd notice: update_node_processes:48de70 Node 182452614 now known as ha1, was: info: start_child: Forked child 60719 for process cib info: start_child: Forked child 60720 for process stonith-ng info: start_child: Forked child 60721 for process lrmd info: start_child: Forked child 60722 for process attrd info: start_child: Forked child 60723 for process pengine info: start_child: Forked child 60724 for process crmd info: main: Starting mainloop [root@ha1 /var/run/qb]# ls -l total 0 srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 attrd srwxrwxrwx 1 root root 0 Mar 25 11:43 cfg srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_ro srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_rw srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_shm srwxrwxrwx 1 root root 0 Mar 25 11:43 cmap srwxrwxrwx 1 root root 0 Mar 25 11:43 cpg srwxrwxrwx 1 root root 0 Mar 25 11:50 lrmd srwxrwxrwx 1 root root 0 Mar 25 11:50 pacemakerd srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 pengine srwxrwxrwx 1 root root 0 Mar 25 11:43 quorum srwxrwxrwx 1 root root 0 Mar 25 11:50 stonith-ng However, libqb still can not create some files in /var/run due to insufficient permissions: Mar 25 11:50:45 [60719]cib: info: init_cs_connection_once: Connection to 'corosync': established Mar 25 11:50:45 [60719]cib: info: crm_get_peer: Node 182452614 is now known as ha1 Mar 25 11:50:45 [60719]cib: info: crm_get_peer: Node 182452614 has uuid 182452614 Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish: server name: cib_ro Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish: server name: cib_rw Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish: server name: cib_shm Mar 25 11:50:45 [60719]cib: info: cib_init: Starting cib mainloop Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership: Joined[0.0] cib.182452614 Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership: Member[0.0] cib.182452614 Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership: Member[0.1] cib.182452614 Mar 25 11:50:46 [60719]cib:error: qb_sys_mmap_file_open: couldn't open file /var/run/qb-cib_rw-control-60719-60720-15: Permission denied (13) Mar 25 11:50:46 [60719]cib:error: qb_ipcs_us_connect: couldn't create file for mmap (60719-60720-15): Permission denied (13) Mar 25 11:50:46 [60719]cib:error: handle_new_connection: Invalid IPC credentials (60719-60720-15). Mar 25 11:50:46 [60720] stonith-ng: info: crm_ipc_connect: Could not establish cib_rw connection:
Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster
On Mon, 25 Mar 2013 13:54:22 +0100 My problem is how to avoid split brain situation with this configuration, without configuring a 3rd node. I have read about quorum disks, external/sbd stonith plugin and other references, but I'm too confused with all this. For example, [1] mention techniques to improve quorum with scsi reserve or quorum daemon, but it didn't point to how to do this pacemaker. Or [2] talks about external/sbd. Any help? With corosync 2.2 (2.1 too, I guess) you can use, in corosync.conf: quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 } Corosync will then manage quorum for the two-node cluster and Pacemaker can use that. You still need proper fencing to enforce the quorum (both for pacemaker and the storage layer – dlm in case you use clvmd), but no extra quorum node is needed. There is one more thing, though: you need two nodes active to boot the cluster, but then when one fails (and is fenced) the other may continue, keeping quorum. Greets, Jacek ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster
Jacek Konieczny jaj...@jajcus.net escribió: On Mon, 25 Mar 2013 13:54:22 +0100 My problem is how to avoid split brain situation with this configuration, without configuring a 3rd node. I have read about quorum disks, external/sbd stonith plugin and other references, but I'm too confused with all this. For example, [1] mention techniques to improve quorum with scsi reserve or quorum daemon, but it didn't point to how to do this pacemaker. Or [2] talks about external/sbd. Any help? With corosync 2.2 (2.1 too, I guess) you can use, in corosync.conf: quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 } Corosync will then manage quorum for the two-node cluster and Pacemaker I'm using corosync 1.1 which is the one provided with my distribution (ubuntu 12.04). I could also use cman. can use that. You still need proper fencing to enforce the quorum (both for pacemaker and the storage layer – dlm in case you use clvmd), but no extra quorum node is needed. I hace configured a dlm resource usted with clvm. One doubt... With this configuration, how split brain problem is handled? There is one more thing, though: you need two nodes active to boot the cluster, but then when one fails (and is fenced) the other may continue, keeping quorum. Greets, Jacek -- Enviado desde mi teléfono Android con K-9 Mail. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] solaris problem
Andreas, thank you for sharing this link and your start script! My goal is to make possible building those tools using more convenient way of NetBSD's pkgsrc system. Perhaps using something like --localstatedir=${VARBASE}/cluster for both libqb, corosync and pacemaker, and setting the appropriate permissions to /var/cluster will solve the problem. Thanks again! On Mar 25, 2013, at 20:35 , Grüninger, Andreas (LGL Extern) andreas.gruenin...@lgl.bwl.de wrote: Andrei There is no need to make this change. I described in http://grueni.github.com/libqb/ how I compiled libqb and the other programs. LOCALSTATEDIR should be defined with ./configure. Please look a Compile Corosync in my description. I guess your start scripts should be changed. We use this as start script called by the smf instance ## #!/usr/bin/bash # Start/stop HACluster service # . /lib/svc/share/smf_include.sh ## Tracing mit debug version # PCMK_trace_files=1 # PCMK_trace_functions=1 # PCMK_trace_formats=1 # PCMK_trace_tags=1 export PCMK_ipc_type=socket CLUSTER_USER=hacluster COROSYNC=corosync PACEMAKERD=pacemakerd PACEMAKER_PROCESSES=pacemaker APPPATH=/opt/ha/sbin/ SLEEPINTERVALL=10 SLEEPCOUNT=5 SLEPT=0 killapp() { pid=`pgrep -f $1` if [ x$pid != x ]; then kill -9 $pid fi return 0 } start0() { stop0 su ${CLUSTER_USER} -c ${APPPATH}${COROSYNC} sleep $sleep0 su ${CLUSTER_USER} -c ${APPPATH}${PACEMAKERD} return 0 } stop0() { # first try, graceful shutdown pid=`pgrep -U ${CLUSTER_USER} -f ${PACEMAKERD}` if [ x$pid != x ]; then ${APPPATH}${PACEMAKERD} --shutdown sleep $SLEEPINTERVALL fi # second try, kill the rest killapp ${APPPATH}${COROSYNC} killapp ${PACEMAKER_PROCESSES} return 0 } let sleep0=$SLEEPINTERVALL/2 case $1 in 'start') start0 ;; 'restart') stop0 start0 ;; 'stop') stop0 ;; *) echo Usage: -bash { start | stop | restart} exit 1 ;; esac exit 0 ### Andreas -Ursprüngliche Nachricht- Von: Andrei Belov [mailto:defana...@gmail.com] Gesendet: Montag, 25. März 2013 15:08 An: The Pacemaker cluster resource manager Betreff: Re: [Pacemaker] solaris problem Ok, I fixed this issue with the following patch against libqb 0.14.4: --- lib/unix.c.orig 2013-03-25 12:30:50.445762231 + +++ lib/unix.c 2013-03-25 12:49:59.322276376 + @@ -83,7 +83,7 @@ #if defined(QB_LINUX) || defined(QB_CYGWIN) snprintf(path, PATH_MAX, /dev/shm/%s, file); #else - snprintf(path, PATH_MAX, LOCALSTATEDIR /run/%s, file); + snprintf(path, PATH_MAX, %s/%s, SOCKETDIR, file); is_absolute = path; #endif } @@ -91,7 +91,7 @@ if (fd 0 !is_absolute) { qb_util_perror(LOG_ERR, couldn't open file %s, path); - snprintf(path, PATH_MAX, LOCALSTATEDIR /run/%s, file); + snprintf(path, PATH_MAX, %s/%s, SOCKETDIR, file); fd = open_mmap_file(path, file_flags); if (fd 0) { res = -errno; libqb was configured with --with-socket-dir=/var/run/qb, /var/run/qb owned by hacluster:haclient - this configuration works fine with both corosync 2.3.0 and pacemaker 1.1.8. Though I'm not sure that libqb is the right place to touch - maybe it'd be better to add some enhancements to pacemaker's lib/common/mainloop.c, mainloop_add_ipc_server() ? Cheers. On Mar 25, 2013, at 16:01 , Andrei Belov defana...@gmail.com wrote: I've rebuilt libqb using separated SOCKETDIR (/var/run/qb), and set hacluster:haclient ownership to this dir. After that pacemakerd has been successfully started with all its childs: [root@ha1 /var/run/qb]# pacemakerd -fV Could not establish pacemakerd connection: Connection refused (146) info: crm_ipc_connect: Could not establish pacemakerd connection: Connection refused (146) info: get_cluster_type: Detected an active 'corosync' cluster info: read_config: Reading configure for stack: corosync notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log notice: main: Starting Pacemaker 1.1.8 (Build: 1f8858c): ncurses libqb-logging libqb-ipc upstart systemd corosync-native info: main: Maximum core file size is: 18446744073709551613 info: qb_ipcs_us_publish: server name: pacemakerd notice: update_node_processes:48de70 Node 182452614 now known as ha1, was: info: start_child: Forked child 60719 for process cib info: start_child: Forked child 60720 for process stonith-ng info: start_child: Forked child 60721 for process lrmd info: start_child: Forked child 60722 for process attrd info:
Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster
On Mon, 25 Mar 2013 20:01:28 +0100 Angel L. Mateo ama...@um.es wrote: quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 } Corosync will then manage quorum for the two-node cluster and Pacemaker I'm using corosync 1.1 which is the one provided with my distribution (ubuntu 12.04). I could also use cman. I don't think corosync 1.1 can do that, but I guess in this case cman should be able provide this functionality. can use that. You still need proper fencing to enforce the quorum (both for pacemaker and the storage layer – dlm in case you use clvmd), but no extra quorum node is needed. I hace configured a dlm resource usted with clvm. One doubt... With this configuration, how split brain problem is handled? The first node to notice that the other is unreachable will fence (kill) the other, making sure it is the only one operating on the shared data. Even though it is only half of the node, the cluster is considered quorate as the other node is known not to be running any cluster resources. When the fenced node reboots its cluster stack starts, but with no quorum until it comminicates with the surviving node again. So no cluster services are started there until both nodes communicate properly and the proper quorum is recovered. Greets, Jacek ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] OCF Resource agent promote question
Hi Steve, On 2013-03-25 18:44, Steven Bambling wrote: All, I'm trying to work on a OCF resource agent that uses postgresql streaming replication. I'm running into a few issues that I hope might be answered or at least some pointers given to steer me in the right direction. Why are you not using the existing pgsql RA? It is capable of doing synchronous and asynchronous replication and it is known to work fine. Best regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now 1. A quick way of obtaining a list of Online nodes in the cluster that a resource will be able to migrate to. I've accomplished it with some grep and see but its not pretty or fast. # time pcs status | grep Online | sed -e s/.*\[\(.*\)\]/\1/ | sed 's/ //' p1.example.net http://p1.example.net p2.example.net http://p2.example.net real0m2.797s user0m0.084s sys0m0.024s Once I get a list of active/online nodes in the cluster my thinking was to use PSQL to get the current xlog location and lag or each of the remaining nodes and compare them. If the node has a greater log position and/or less lag it will be given a greater master preference. 2. How to force a monitor/probe before a promote is run on ALL nodes to make sure that the master preference is up to date before migrating/failing over the resource. - I was thinking that maybe during the promote call it could get the log location and lag from each of the nodes via an psql call ( like above) and then force the resource to a specific node. Is there a way to do this and does it sound like a sane idea ? The start of my RA is located here suggestions and comments 100% welcome https://github.com/smbambling/pgsqlsr/blob/master/pgsqlsr v/r STEVE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Resource is Too Active (on both nodes)
On 2013-03-22 21:35, Mohica Jasha wrote: Hey, I have two cluster nodes. I have a service process which is prone to crash and takes a very long time to start. Since the service process takes a long time to start I have the service process running on both nodes, but only the active node with the virtual IP serves the incoming requests. On both nodes, I have a cron job which periodically checks if the service process is up and if not it starts the service. I want pacemaker to periodically check if the service is down on the active node and if so, it switches the virtual IP to the second node (without starting or stopping the my service) I have the following configuration: primitive clusterIP ocf:heartbeat:IPaddr2 \ params ip=10.0.1.247 \ op monitor interval=10s timeout=20s primitive serviceMonitoring ocf:serviceMonitoring:serviceMonitoring params op monitor interval=10s timeout=20s colocation HACluster inf: serviceMonitoring clusterIP order serviceMonitoring-after-clusterIP inf: clusterIP serviceMonitoring My serviceMonitoring resource doesn't do anything other than checking the state of the service process. I get the following in the log file: Mar 05 15:07:59 [1543] ha1 pengine: notice: unpack_rsc_op: Operation monitor found resource serviceMonitoring active on ha2 Mar 05 15:07:59 [1543] ha1 pengine: notice: unpack_rsc_op: Operation monitor found resource serviceMonitoring active on ha1 Mar 05 15:07:59 [1543] ha1 pengine:error: native_create_actions: Resource serviceMonitoring (ocf:: serviceMonitoring) is active on 2 nodes attempting recovery Mar 05 15:07:59 [1543] ha1 pengine: warning: native_create_actions: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. So it seems that pacemaker calls the monitor method of the serviceMonitoring resource on both nodes. Yes, it does a probing of the resources on all nodes ... clone your serviceMonitoring resource and set it into unmanaged mode, that should give you the desired behavior ... or simply clone it and let Pacemaker do the complete management and go without your cron-check-restart magic. Regards, Andreas Any idea how I can fix this? Thanks, Mohica ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Need help with Pacemaker? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] issues when installing on pxe booted environment
On 2013-03-22 19:31, John White wrote: Hello Folks, We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe. There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far. We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors). Any ideas would be greatly appreciated: Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync' Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine That /var/run/crm directory is available and owned by hacluster.haclient ... and writable by at least the hacluster user? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100) Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00110312 (was 00111312) Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding fd=4 to mainloop Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: Connection to 'corosync': established Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry for node n0014.lustre/247988234 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node n0014.lustre now has id: 247988234 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 is now known as n0014.lustre Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 0x995530 connected: 1 children Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng mainloop Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: a02c0f19a00c1eb2527ad38f146ebc0834814558 Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_LOG Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_STARTUP Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal Handlers Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM objects Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00110312 (new) Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_CIB_START
Re: [Pacemaker] pacemaker node stuck offline
On 2013-03-22 03:39, pacema...@feystorm.net wrote: On 03/21/2013 11:15 AM, Andreas Kurz wrote: On 2013-03-21 14:31, Patrick Hemmer wrote: I've got a 2-node cluster where it seems last night one of the nodes went offline, and I can't see any reason why. Attached are the logs from the 2 nodes (the relevant timeframe seems to be 2013-03-21 between 06:05 and 06:10). This is on ubuntu 12.04 Looks like your non-redundant cluster-communication was interrupted at around that time for whatever reason and your cluster split-brained. Does the drbd-replication use a different network-connection? If yes, why not using it for a redundant ring setup ... and you should use STONITH. I also wonder why you have defined expected_votes='1' in your cluster.conf. Regards, Andreas But shouldn't it have recovered? The node shows as OFFLINE, even though it's clearly communicating with the rest of the cluster. What is the procedure for getting the node back online. Anything other than bouncing pacemaker? Looks like the cluster has some troubles trying to rejoin the two DCs after the split-brain. Try to stop cman/Pacemaker on i-3307d96b and clean there the /var/lib/heartbeat/crm directory so it starts with an empty configuration and receives the latest updates from i-a706d8ff. Unfortunately no to the different network connection for drbd. These are 2 EC2 instances, so redundant connections aren't available. Though since it is EC2, I could set up a STONITH to whack the other instance. The only problem here would be a race condition. The EC2 api for shutting down or rebooting an instance isn't instantaneous. Both nodes could end up sending the signal to reboot the other node. Yeah, you would need to add a very generous start-timeout to the monitor operation of the stonith primitive ... but it works ;-) As for expected_votes=1, it's because it's a two-node cluster. Though I apparently forgot to set the `two_node` attribute :-( Those two parameters should not be needed for a cman/pacemaker cluster, you can tell pacemaker to ignore loss of quorum. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD+LVM+NFS problems
I have now reduced the configuration further and removed LVM from the picture. Still the cluster fails when I set the master node to standby. What's interesting is that things get fixed when I issue a simple cleanup for the filesystem resource. This is what my current config looks like: node nfs1 \ attributes standby=off node nfs2 primitive p_drbd_web1 ocf:linbit:drbd \ params drbd_resource=web1 \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_fs_web1 ocf:heartbeat:Filesystem \ params device=/dev/drbd0 \ directory=/srv/nfs/web1 fstype=ext4 \ op monitor interval=10s ms ms_drbd_web1 p_drbd_web1 \ meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 notify=true colocation c_web1_on_drbd inf: ms_drbd_web1:Master p_fs_web1 order o_drbd_before_web1 inf: ms_drbd_web1:promote p_fs_web1 property $id=cib-bootstrap-options \ dc-version=1.1.8-7.el6-394e906 \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1364259713 \ maintenance-mode=false rsc_defaults $id=rsc-options \ resource-stickiness=100 I cannot figure out what is wrong with this configuration. Regards, Dennis On 25.03.2013 13:09, Dennis Jacobfeuerborn wrote: I just found the following in the dmesg output which might or might not add to understanding the problem: device-mapper: table: 253:2: linear: dm-linear: Device lookup failed device-mapper: ioctl: error adding target to table Regards, Dennis On 25.03.2013 13:04, Dennis Jacobfeuerborn wrote: Hi, I'm currently trying create a two node redundant NFS setup on CentOS 6.4 using pacemaker and crmsh. I use this Document as a starting poing: https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html The first issue is that using these instructions I get the cluster up and running but the moment I try to stop the pacemaker service on the current master node several resources just fail and everything goes pear-shaped. Since the problem seems to relate to the nfs bits in the configuration I removed these in order to get to a minimal working setup and then add things piece by piece in order to find the source of the problem. Now I am at a point where I basically have only DRBD+LVM+Filesystems+IPAddr2 configured and now LVM seems to act up. I can start the cluster and everything is fine but the moment I stop pacemaker on the master i end up with this as a status: === Node nfs2: standby Online: [ nfs1 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ nfs1 ] Stopped: [ p_drbd_nfs:1 ] Failed actions: p_lvm_nfs_start_0 (node=nfs1, call=505, rc=1, status=complete): unknown error === and in the log on nfs1 I see: LVM(p_lvm_nfs)[7515]:2013/03/25_12:34:21 ERROR: device-mapper: reload ioctl on failed: Invalid argument device-mapper: reload ioctl on failed: Invalid argument 2 logical volume(s) in volume group nfs now active However a lvs in this state shows: [root@nfs1 ~]# lvs LV VGAttr LSize Pool Origin Data% Move Log web1nfs -wi-- 2,00g web2nfs -wi-- 2,00g lv_root vg_nfs1.local -wi-ao--- 2,45g lv_swap vg_nfs1.local -wi-ao--- 256,00m So the volume group is present. My current configuration looks like this: node nfs1 \ attributes standby=off node nfs2 \ attributes standby=on primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=nfs \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_fs_web1 ocf:heartbeat:Filesystem \ params device=/dev/nfs/web1 \ directory=/srv/nfs/web1 \ fstype=ext4 \ op monitor interval=10s primitive p_fs_web2 ocf:heartbeat:Filesystem \ params device=/dev/nfs/web2 \ directory=/srv/nfs/web2 \ fstype=ext4 \ op monitor interval=10s primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ params ip=10.99.0.142 cidr_netmask=24 \ op monitor interval=30s primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname=nfs \ op monitor interval=30s group g_nfs p_lvm_nfs p_fs_web1 p_fs_web2 p_ip_nfs ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 \ master-node-max=1 \ clone-max=2 \ clone-node-max=1 \ notify=true colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master property $id=cib-bootstrap-options \ dc-version=1.1.8-7.el6-394e906 \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1364212090 \ maintenance-mode=false rsc_defaults $id=rsc_defaults-options \ resource-stickiness=100 Any ideas