[Pacemaker] solaris problem

2013-03-25 Thread Andrei Belov
Hi folks,

I'm trying to build test HA cluster on Solaris 5.11 using libqb 0.14.4, 
corosync 2.3.0 and pacemaker 1.1.8,
and I'm facing a strange problem while starting pacemaker.

Log shows the following errors:

Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: main: Failed to allocate lrmd 
server.  shutting down
Mar 25 09:21:26 [33722]pengine:error: mainloop_add_ipc_server:  Could 
not start pengine IPC server: Unknown error (-48)
Mar 25 09:21:26 [33722]pengine:error: main: Couldn't start IPC 
server
Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit:  Child process 
lrmd exited (pid=33720, rc=255)
Mar 25 09:21:26 [33721]  attrd:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/attrd): Permission denied (13)
Mar 25 09:21:26 [33721]  attrd:error: mainloop_add_ipc_server:  Could 
not start attrd IPC server: Unknown error (-13)
Mar 25 09:21:26 [33721]  attrd:error: main: Could not start IPC 
server
Mar 25 09:21:26 [33721]  attrd:error: main: Aborting startup
Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit:  Child process 
pengine exited (pid=33722, rc=1)
Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit:  Child process 
attrd exited (pid=33721, rc=100)
Mar 25 09:21:26 [33718]cib:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/cib_ro): Permission denied (13)
Mar 25 09:21:26 [33718]cib:error: mainloop_add_ipc_server:  Could 
not start cib_ro IPC server: Unknown error (-13)
Mar 25 09:21:26 [33718]cib:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/cib_rw): Permission denied (13)
Mar 25 09:21:26 [33718]cib:error: mainloop_add_ipc_server:  Could 
not start cib_rw IPC server: Unknown 

[Pacemaker] Patrik Rapposch is out of the office

2013-03-25 Thread Patrik . Rapposch

Ich werde ab  25.03.2013 nicht im Büro sein. Ich kehre zurück am
27.03.2013.

Sehr geehrte Damen und Herren,
ich bin bis einschließlich 27.03 auf Dienstreise. Trotzdem versuche ich Ihr
Anliegen so schnell als möglich zu beantworten. Bitte setzen Sie immer
ksi.network in Kopie.
Please note, that I am on a business trip till 27.03.13. Always use
ksi.netw...@knapp.com, which ensures that one of our network
adminsitrators takes care of your interest. If you need operating system
support please contact ksi...@knapp-systems.com.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] solaris problem

2013-03-25 Thread LGL Extern
With solaris/openindiana you should use this setting 
export PCMK_ipc_type=socket 

Andreas

-Ursprüngliche Nachricht-
Von: Andrei Belov [mailto:defana...@gmail.com] 
Gesendet: Montag, 25. März 2013 10:43
An: pacemaker@oss.clusterlabs.org
Betreff: [Pacemaker] solaris problem

Hi folks,

I'm trying to build test HA cluster on Solaris 5.11 using libqb 0.14.4, 
corosync 2.3.0 and pacemaker 1.1.8, and I'm facing a strange problem while 
starting pacemaker.

Log shows the following errors:

Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: mainloop_add_ipc_server:  Could 
not start lrmd IPC server: Unknown error (-48)
Mar 25 09:21:26 [33720]   lrmd:error: try_server_create:New IPC 
server could not be created because another lrmd process exists, sending 
shutdown command to old lrmd process.
Mar 25 09:21:26 [33720]   lrmd:error: main: Failed to allocate lrmd 
server.  shutting down
Mar 25 09:21:26 [33722]pengine:error: mainloop_add_ipc_server:  Could 
not start pengine IPC server: Unknown error (-48)
Mar 25 09:21:26 [33722]pengine:error: main: Couldn't start IPC 
server
Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit:  Child process 
lrmd exited (pid=33720, rc=255)
Mar 25 09:21:26 [33721]  attrd:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/attrd): Permission denied (13)
Mar 25 09:21:26 [33721]  attrd:error: mainloop_add_ipc_server:  Could 
not start attrd IPC server: Unknown error (-13)
Mar 25 09:21:26 [33721]  attrd:error: main: Could not start IPC 
server
Mar 25 09:21:26 [33721]  attrd:error: main: Aborting startup
Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit:  Child process 
pengine exited (pid=33722, rc=1)
Mar 25 09:21:26 [33717] pacemakerd:error: pcmk_child_exit:  Child process 
attrd exited (pid=33721, rc=100)
Mar 25 09:21:26 [33718]cib:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/cib_ro): Permission denied (13)
Mar 25 09:21:26 [33718]cib:error: mainloop_add_ipc_server:  Could 
not start cib_ro IPC 

Re: [Pacemaker] solaris problem

2013-03-25 Thread Andrei Belov
Andreas,

just tried PCMK_ipc_type=socket pacemaker -fV - a bunch of additional 
event_send errors appeared:

Mar 25 11:15:55 [33641] ha1 corosync error   [MAIN  ] event_send retuned -32, 
expected 256!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 217!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 219!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 256!
Mar 25 11:15:55 [53980]pengine:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/pengine): Permission denied (13)
Mar 25 11:15:55 [53980]pengine:error: mainloop_add_ipc_server:  Could 
not start pengine IPC server: Unknown error (-13)
Mar 25 11:15:55 [53980]pengine:error: main: Couldn't start IPC 
server
Mar 25 11:15:55 [53975] pacemakerd:error: pcmk_child_exit:  Child process 
pengine exited (pid=53980, rc=1)
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 256!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [53979]  attrd:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/attrd): Permission denied (13)
Mar 25 11:15:55 [53979]  attrd:error: mainloop_add_ipc_server:  Could 
not start attrd IPC server: Unknown error (-13)
Mar 25 11:15:55 [53979]  attrd:error: main: Could not start IPC 
server
Mar 25 11:15:55 [53979]  attrd:error: main: Aborting startup
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [53975] pacemakerd:error: pcmk_child_exit:  Child process 
attrd exited (pid=53979, rc=100)
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 256!
Mar 25 11:15:55 [53976]cib:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/cib_ro): Permission denied (13)
Mar 25 11:15:55 [53976]cib:error: mainloop_add_ipc_server:  Could 
not start cib_ro IPC server: Unknown error (-13)
Mar 25 11:15:55 [53976]cib:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/cib_rw): Permission denied (13)
Mar 25 11:15:55 [53976]cib:error: mainloop_add_ipc_server:  Could 
not start cib_rw IPC server: Unknown error (-13)
Mar 25 11:15:55 [53976]cib:error: qb_ipcs_us_publish:   Could 
not bind AF_UNIX (/var/run/cib_shm): Permission denied (13)
Mar 25 11:15:55 [53976]cib:error: mainloop_add_ipc_server:  Could 
not start cib_shm IPC server: Unknown error (-13)
Mar 25 11:15:55 [53976]cib:error: cib_init: Couldnt start 
all IPC channels, exiting.
Mar 25 11:15:55 [53975] pacemakerd:error: pcmk_child_exit:  Child process 
cib exited (pid=53976, rc=255)
Mar 25 11:15:55 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 223!
Mar 25 11:16:04 [53977] stonith-ng:error: setup_cib:Could not 
connect to the CIB service: -134 fd7fc421a0b0
Mar 25 11:16:04 [33641] ha1 corosync error   [SERV  ] event_send retuned -32, 
expected 217!
Mar 25 11:16:04 [53975] pacemakerd:   notice: pcmk_shutdown_worker: 
Attempting to inhibit respawning after fatal error


# fgrep 32 /usr/include/sys/errno.h 
#define EPIPE   32  /* Broken pipe  */



On Mar 25, 2013, at 13:55 , Grüninger, Andreas (LGL Extern) 
andreas.gruenin...@lgl.bwl.de wrote:

 With solaris/openindiana you should use this setting 
 export PCMK_ipc_type=socket 
 
 Andreas
 
 -Ursprüngliche Nachricht-
 Von: Andrei Belov [mailto:defana...@gmail.com] 
 Gesendet: Montag, 25. März 2013 10:43
 An: pacemaker@oss.clusterlabs.org
 Betreff: [Pacemaker] solaris problem
 
 Hi folks,
 
 I'm trying to build test HA cluster on Solaris 5.11 using libqb 0.14.4, 
 corosync 2.3.0 and pacemaker 1.1.8, and I'm facing a strange problem while 
 starting pacemaker.
 
 Log shows the following errors:
 
 Mar 25 09:21:26 [33720]   lrmd:

Re: [Pacemaker] CMAN, corosync pacemaker

2013-03-25 Thread Lars Marowsky-Bree
On 2013-03-21T15:28:17, Leon Fauster leonfaus...@googlemail.com wrote:

  I believe the preferred pacemaker based HA configuration in RHEL 6.4 uses 
  all three packages and the preferred configuration in SLES11 SP2 is just 
  corosync/pacemaker (I do not believe CMAN is even available in SLE-HAE).
  Why the different approaches and what is the advantage of each 
  configuration?
 cman is part of the RH cluster suite - the motivations of the different ha 
 approaches for 
 both vendors are for sure more based in strategic/marketing needs.

There's also a technical point of cman/pacemaker not being available
when SLE HA 11 was created, and it's not possible to do a rolling
upgrade to this configuration. Hence, cman will not appear in SLE HA
11.

If there is a technical benefit to the approach, we will reevaluate this
for SLE HA 12.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Linking lib/cib and lib/pengine to each other?

2013-03-25 Thread Viacheslav Dubrovskyi
23.03.2013 08:27, Viacheslav Dubrovskyi пишет:
 Hi.

 I'm building a package for my distributive. Everything is built, but the
 package does not pass our internal tests. I get errors like this:
 verify-elf: ERROR: ./usr/lib/libpe_status.so.4.1.0: undefined symbol:
 get_object_root

 It mean, that libpe_status.so not linked with libcib.so where defined
 get_object_root. I can easy fix it adding
 libpe_status_la_LIBADD  =  $(top_builddir)/lib/cib/libcib.la
 in lib/pengine/Makefile.am

 But for this I need build libcib before lib/pengine. And it's impossible
 too, because libcib used symbols from lib/pengine. So we have situation,
 when two library must be linked to each other.

 And this is very bad because then the in fact it should be one library.
 Or symbols should be put in a third library, such as common.

 Can anyone comment on this situation?
Patch for fix this error.

-- 
WBR,
Viacheslav Dubrovskyi

diff --git a/Makefile.am b/Makefile.am
index 4f742e4..fdf19eb 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -23,7 +23,7 @@ EXTRA_DIST  = autogen.sh ConfigureMe README.in libltdl.tar m4/gnulib
 MAINTAINERCLEANFILES= Makefile.in aclocal.m4 configure DRF/config-h.in \
 DRF/stamp-h.in libtool.m4 ltdl.m4 libltdl.tar
 
-CORE	= $(LIBLTDL_DIR) replace include lib mcp pengine cib crmd fencing lrmd tools xml
+CORE	= $(LIBLTDL_DIR) replace include lib mcp cib pengine crmd fencing lrmd tools xml
 SUBDIRS	= $(CORE) cts extra doc
 
 doc_DATA = AUTHORS COPYING COPYING.LIB
diff --git a/lib/Makefile.am b/lib/Makefile.am
index 5563819..4ebd91b 100644
--- a/lib/Makefile.am
+++ b/lib/Makefile.am
@@ -39,7 +39,7 @@ clean-local:
 	rm -f *.pc
 
 ## Subdirectories...
-SUBDIRS	= gnu common pengine transition cib fencing services lrmd cluster
+SUBDIRS	= gnu common cib pengine transition fencing services lrmd cluster
 DIST_SUBDIRS =  $(SUBDIRS) ais
 
 if BUILD_CS_PLUGIN
diff --git a/lib/cib/Makefile.am b/lib/cib/Makefile.am
index 6ab02fc..c73a329 100644
--- a/lib/cib/Makefile.am
+++ b/lib/cib/Makefile.am
@@ -32,11 +32,14 @@ if ENABLE_ACL
 libcib_la_SOURCES   += cib_acl.c
 endif
 
-libcib_la_LDFLAGS	= -version-info 3:0:0 $(top_builddir)/lib/common/libcrmcommon.la $(CRYPTOLIB) \
-			$(top_builddir)/lib/pengine/libpe_rules.la
-
+libcib_la_LDFLAGS	= -version-info 3:0:0 -L$(top_builddir)/lib/pengine/.libs
+libcib_la_LIBADD= $(CRYPTOLIB) $(top_builddir)/lib/pengine/libpe_rules.la $(top_builddir)/lib/common/libcrmcommon.la
 libcib_la_CFLAGS	= -I$(top_srcdir)
 
+libcib_la_DEPENDENCIES  = libpe_rules
+libpe_rules:
+	make -C ../../lib/pengine libpe_rules.la
+
 clean-generic:
 	rm -f *.log *.debug *.xml *~
 
diff --git a/lib/pengine/Makefile.am b/lib/pengine/Makefile.am
index 9cb2392..a173522 100644
--- a/lib/pengine/Makefile.am
+++ b/lib/pengine/Makefile.am
@@ -28,10 +28,11 @@ noinst_HEADERS	= unpack.h variant.h
 
 libpe_rules_la_LDFLAGS	= -version-info 2:2:0
 libpe_rules_la_SOURCES	= rules.c common.c
+libpe_rules_la_LIBADD	= $(top_builddir)/lib/common/libcrmcommon.la
 
 libpe_status_la_LDFLAGS	= -version-info 5:0:1
 libpe_status_la_SOURCES	=  status.c unpack.c utils.c complex.c native.c group.c clone.c rules.c common.c
-libpe_status_la_LIBADD	=  @CURSESLIBS@
+libpe_status_la_LIBADD  =  @CURSESLIBS@ $(top_builddir)/lib/common/libcrmcommon.la $(top_builddir)/lib/cib/libcib.la
 
 clean-generic:
 	rm -f *.log *.debug *~
diff --git a/lib/services/Makefile.am b/lib/services/Makefile.am
index 3ee3347..ef8fbc3 100644
--- a/lib/services/Makefile.am
+++ b/lib/services/Makefile.am
@@ -26,7 +26,7 @@ noinst_HEADERS  = upstart.h systemd.h services_private.h
 libcrmservice_la_SOURCES = services.c services_linux.c
 libcrmservice_la_LDFLAGS = -version-info 1:0:0
 libcrmservice_la_CFLAGS  = $(GIO_CFLAGS)
-libcrmservice_la_LIBADD   = $(GIO_LIBS)
+libcrmservice_la_LIBADD   = $(GIO_LIBS) $(top_builddir)/lib/common/libcrmcommon.la
 
 if BUILD_UPSTART
 libcrmservice_la_SOURCES += upstart.c
diff --git a/lib/transition/Makefile.am b/lib/transition/Makefile.am
index 49c7113..7279c59 100644
--- a/lib/transition/Makefile.am
+++ b/lib/transition/Makefile.am
@@ -29,6 +29,7 @@ libtransitioner_la_SOURCES	= unpack.c graph.c utils.c
 
 libtransitioner_la_LDFLAGS	= -version-info 2:0:0
 libtransitioner_la_CFLAGS	= -I$(top_builddir)
+libtransitioner_la_LIBADD   = $(top_builddir)/lib/common/libcrmcommon.la
 
 clean-generic:
 	rm -f *~
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] solaris problem

2013-03-25 Thread Andrei Belov

I've rebuilt libqb using separated SOCKETDIR (/var/run/qb), and set 
hacluster:haclient ownership to this dir.

After that pacemakerd has been successfully started with all its childs:

[root@ha1 /var/run/qb]# pacemakerd -fV
Could not establish pacemakerd connection: Connection refused (146)
info: crm_ipc_connect:  Could not establish pacemakerd connection: 
Connection refused (146)
info: get_cluster_type: Detected an active 'corosync' cluster
info: read_config:  Reading configure for stack: corosync
  notice: crm_add_logfile:  Additional logging available in 
/var/log/cluster/corosync.log
  notice: main: Starting Pacemaker 1.1.8 (Build: 1f8858c):  ncurses 
libqb-logging libqb-ipc upstart systemd  corosync-native
info: main: Maximum core file size is: 18446744073709551613
info: qb_ipcs_us_publish:   server name: pacemakerd
  notice: update_node_processes:48de70 Node 182452614 now known as ha1, 
was: 
info: start_child:  Forked child 60719 for process cib
info: start_child:  Forked child 60720 for process stonith-ng
info: start_child:  Forked child 60721 for process lrmd
info: start_child:  Forked child 60722 for process attrd
info: start_child:  Forked child 60723 for process pengine
info: start_child:  Forked child 60724 for process crmd
info: main: Starting mainloop

[root@ha1 /var/run/qb]# ls -l
total 0
srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 attrd
srwxrwxrwx 1 root  root 0 Mar 25 11:43 cfg
srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_ro
srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_rw
srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_shm
srwxrwxrwx 1 root  root 0 Mar 25 11:43 cmap
srwxrwxrwx 1 root  root 0 Mar 25 11:43 cpg
srwxrwxrwx 1 root  root 0 Mar 25 11:50 lrmd
srwxrwxrwx 1 root  root 0 Mar 25 11:50 pacemakerd
srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 pengine
srwxrwxrwx 1 root  root 0 Mar 25 11:43 quorum
srwxrwxrwx 1 root  root 0 Mar 25 11:50 stonith-ng

However, libqb still can not create some files in /var/run due to insufficient 
permissions:

Mar 25 11:50:45 [60719]cib: info: init_cs_connection_once:  
Connection to 'corosync': established
Mar 25 11:50:45 [60719]cib: info: crm_get_peer: Node 182452614 
is now known as ha1
Mar 25 11:50:45 [60719]cib: info: crm_get_peer: Node 182452614 
has uuid 182452614
Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish:   server 
name: cib_ro
Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish:   server 
name: cib_rw
Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish:   server 
name: cib_shm
Mar 25 11:50:45 [60719]cib: info: cib_init: Starting cib 
mainloop
Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership:  
Joined[0.0] cib.182452614 
Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership:  
Member[0.0] cib.182452614 
Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership:  
Member[0.1] cib.182452614 
Mar 25 11:50:46 [60719]cib:error: qb_sys_mmap_file_open:
couldn't open file /var/run/qb-cib_rw-control-60719-60720-15: Permission denied 
(13)
Mar 25 11:50:46 [60719]cib:error: qb_ipcs_us_connect:   
couldn't create file for mmap (60719-60720-15): Permission denied (13)
Mar 25 11:50:46 [60719]cib:error: handle_new_connection:Invalid 
IPC credentials (60719-60720-15).
Mar 25 11:50:46 [60720] stonith-ng: info: crm_ipc_connect:  Could not 
establish cib_rw connection: Permission denied (13)
Mar 25 11:50:46 [60719]cib:error: qb_sys_mmap_file_open:
couldn't open file /var/run/qb-cib_shm-control-60719-60724-16: Permission 
denied (13)
Mar 25 11:50:46 [60719]cib:error: qb_ipcs_us_connect:   
couldn't create file for mmap (60719-60724-16): Permission denied (13)
Mar 25 11:50:46 [60719]cib:error: handle_new_connection:Invalid 
IPC credentials (60719-60724-16).
Mar 25 11:50:46 [60724]   crmd: info: crm_ipc_connect:  Could not 
establish cib_shm connection: Permission denied (13)
Mar 25 11:50:46 [60724]   crmd: info: do_cib_control:   Could not 
connect to the CIB service: Transport endpoint is not connected
Mar 25 11:50:46 [60724]   crmd:  warning: do_cib_control:   Couldn't 
complete CIB registration 1 times... pause and retry


If someone has working setup on Linux with corosync 2.x, libqb and pacemaker 
1.1.x - I'd be very appreciated for sharing some information about a places 
which libqb uses for its special socket files.

Thanks in advance!

(Can we say now that this problem is libqb-related, not pacemaker?)



On Mar 25, 2013, at 15:30 , Andrei Belov defana...@gmail.com wrote:

 Andreas,
 
 just tried PCMK_ipc_type=socket pacemaker -fV - a bunch of additional 
 event_send errors appeared:
 
 Mar 25 11:15:55 [33641] ha1 corosync error   [MAIN  ] event_send 

[Pacemaker] DRBD+LVM+NFS problems

2013-03-25 Thread Dennis Jacobfeuerborn

Hi,
I'm currently trying create a two node redundant NFS setup on CentOS 6.4 
using pacemaker and crmsh.


I use this Document as a starting poing:
https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html

The first issue is that using these instructions I get the cluster up 
and running but the moment I try to stop the pacemaker service on the 
current master node several resources just fail and everything goes 
pear-shaped.


Since the problem seems to relate to the nfs bits in the configuration I 
removed these in order to get to a minimal working setup and then add 
things piece by piece in order to find the source of the problem.


Now I am at a point where I basically have only 
DRBD+LVM+Filesystems+IPAddr2 configured and now LVM seems to act up.


I can start the cluster and everything is fine but the moment I stop 
pacemaker on the master i end up with this as a status:


===
Node nfs2: standby
Online: [ nfs1 ]

 Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
 Masters: [ nfs1 ]
 Stopped: [ p_drbd_nfs:1 ]

Failed actions:
p_lvm_nfs_start_0 (node=nfs1, call=505, rc=1, status=complete): 
unknown error

===

and in the log on nfs1 I see:
LVM(p_lvm_nfs)[7515]:	2013/03/25_12:34:21 ERROR: device-mapper: reload 
ioctl on failed: Invalid argument device-mapper: reload ioctl on failed: 
Invalid argument 2 logical volume(s) in volume group nfs now active


However a lvs in this state shows:
[root@nfs1 ~]# lvs
  LV  VGAttr  LSize   Pool Origin Data%  Move Log
  web1nfs   -wi--   2,00g
  web2nfs   -wi--   2,00g
  lv_root vg_nfs1.local -wi-ao---   2,45g
  lv_swap vg_nfs1.local -wi-ao--- 256,00m

So the volume group is present.

My current configuration looks like this:

node nfs1 \
attributes standby=off
node nfs2 \
attributes standby=on
primitive p_drbd_nfs ocf:linbit:drbd \
params drbd_resource=nfs \
op monitor interval=15 role=Master \
op monitor interval=30 role=Slave
primitive p_fs_web1 ocf:heartbeat:Filesystem \
params device=/dev/nfs/web1 \
  directory=/srv/nfs/web1 \
  fstype=ext4 \
op monitor interval=10s
primitive p_fs_web2 ocf:heartbeat:Filesystem \
params device=/dev/nfs/web2 \
  directory=/srv/nfs/web2 \
  fstype=ext4 \
op monitor interval=10s
primitive p_ip_nfs ocf:heartbeat:IPaddr2 \
params ip=10.99.0.142 cidr_netmask=24 \
op monitor interval=30s
primitive p_lvm_nfs ocf:heartbeat:LVM \
params volgrpname=nfs \
op monitor interval=30s
group g_nfs p_lvm_nfs p_fs_web1 p_fs_web2 p_ip_nfs
ms ms_drbd_nfs p_drbd_nfs \
meta master-max=1 \
  master-node-max=1 \
  clone-max=2 \
  clone-node-max=1 \
  notify=true
colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
property $id=cib-bootstrap-options \
dc-version=1.1.8-7.el6-394e906 \
cluster-infrastructure=classic openais (with plugin) \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1364212090 \
maintenance-mode=false
rsc_defaults $id=rsc_defaults-options \
resource-stickiness=100

Any ideas why this isn't working?

Regards,
  Dennis

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] DRBD+LVM+NFS problems

2013-03-25 Thread Dennis Jacobfeuerborn
I just found the following in the dmesg output which might or might not 
add to understanding the problem:


device-mapper: table: 253:2: linear: dm-linear: Device lookup failed
device-mapper: ioctl: error adding target to table

Regards,
  Dennis

On 25.03.2013 13:04, Dennis Jacobfeuerborn wrote:

Hi,
I'm currently trying create a two node redundant NFS setup on CentOS 6.4
using pacemaker and crmsh.

I use this Document as a starting poing:
https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html


The first issue is that using these instructions I get the cluster up
and running but the moment I try to stop the pacemaker service on the
current master node several resources just fail and everything goes
pear-shaped.

Since the problem seems to relate to the nfs bits in the configuration I
removed these in order to get to a minimal working setup and then add
things piece by piece in order to find the source of the problem.

Now I am at a point where I basically have only
DRBD+LVM+Filesystems+IPAddr2 configured and now LVM seems to act up.

I can start the cluster and everything is fine but the moment I stop
pacemaker on the master i end up with this as a status:

===
Node nfs2: standby
Online: [ nfs1 ]

  Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
  Masters: [ nfs1 ]
  Stopped: [ p_drbd_nfs:1 ]

Failed actions:
 p_lvm_nfs_start_0 (node=nfs1, call=505, rc=1, status=complete):
unknown error
===

and in the log on nfs1 I see:
LVM(p_lvm_nfs)[7515]:2013/03/25_12:34:21 ERROR: device-mapper:
reload ioctl on failed: Invalid argument device-mapper: reload ioctl on
failed: Invalid argument 2 logical volume(s) in volume group nfs now
active

However a lvs in this state shows:
[root@nfs1 ~]# lvs
   LV  VGAttr  LSize   Pool Origin Data%  Move Log
   web1nfs   -wi--   2,00g
   web2nfs   -wi--   2,00g
   lv_root vg_nfs1.local -wi-ao---   2,45g
   lv_swap vg_nfs1.local -wi-ao--- 256,00m

So the volume group is present.

My current configuration looks like this:

node nfs1 \
 attributes standby=off
node nfs2 \
 attributes standby=on
primitive p_drbd_nfs ocf:linbit:drbd \
 params drbd_resource=nfs \
 op monitor interval=15 role=Master \
 op monitor interval=30 role=Slave
primitive p_fs_web1 ocf:heartbeat:Filesystem \
 params device=/dev/nfs/web1 \
   directory=/srv/nfs/web1 \
   fstype=ext4 \
 op monitor interval=10s
primitive p_fs_web2 ocf:heartbeat:Filesystem \
 params device=/dev/nfs/web2 \
   directory=/srv/nfs/web2 \
   fstype=ext4 \
 op monitor interval=10s
primitive p_ip_nfs ocf:heartbeat:IPaddr2 \
 params ip=10.99.0.142 cidr_netmask=24 \
 op monitor interval=30s
primitive p_lvm_nfs ocf:heartbeat:LVM \
 params volgrpname=nfs \
 op monitor interval=30s
group g_nfs p_lvm_nfs p_fs_web1 p_fs_web2 p_ip_nfs
ms ms_drbd_nfs p_drbd_nfs \
 meta master-max=1 \
   master-node-max=1 \
   clone-max=2 \
   clone-node-max=1 \
   notify=true
colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
property $id=cib-bootstrap-options \
 dc-version=1.1.8-7.el6-394e906 \
 cluster-infrastructure=classic openais (with plugin) \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 last-lrm-refresh=1364212090 \
 maintenance-mode=false
rsc_defaults $id=rsc_defaults-options \
 resource-stickiness=100

Any ideas why this isn't working?

Regards,
   Dennis

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] racing crm commands... last write wins?

2013-03-25 Thread Dejan Muhamedagic
On Wed, Mar 20, 2013 at 10:40:10AM -0700, Bob Haxo wrote:
 Regarding the replace triggering a DC election ... which is causing
 issues with scripted installs ... how do I determine which crm commands
 will NOT trigger this election?

It seems like every configure commit could possible result in
new election. But I'm not sure what does it depend on.

 I need a way of avoiding this election
 while installing.
 
 I'm finding that when repeating the scripted install with the same
 commands, sometimes the DC election gets triggered and sometimes it does
 not.

With the same configuration updates?

 With the DC election, these messages get logged, followed by the
 whole xml version of the configuration.
 
 Call cib_replace failed (-62): Timer expired

This is a problem connecting to the cib process, i.e. it's not
related to a configuration update (as it cannot proceed anyway).

 ERROR: 55: could not replace cib
 INFO: 55: offending xml: configuration
 
 Any suggestions for avoiding replacing rather than incrementally
 modifying the configuration?

Not right now. The configuration update process in crmsh needs to
be modified.

Thanks,

Dejan

 Thanks,
 Bob Haxo
 SGI
 
 
 
 On Mon, 2013-03-04 at 17:25 +0100, Lars Marowsky-Bree wrote:
  On 2013-03-04T17:14:28, Dejan Muhamedagic deja...@fastmail.fm wrote:
  
Thought so at the time, yes. And I do think it cleaned up a few things,
we just need to improve it. The full CIB replace also seems to trigger
an election ...
   I think that used to happen in Heartbeat clusters but got fixed
   in the meantime, the details are a bit foggy.
  
  No, if you look at the current logs on the DC, you'll also see this
  happening. I think it's the replace of the node section that triggers
  it.
  
  
Then most of the logic in crmsh would remain unchanged (i.e., it'd still
operate on whole CIBs only), but the way how it passes it on to
Pacemaker would improve. I hope.
   crmsh currently doesn't keep the original copy of the CIB.
  
  Right, but that should be a simple thing to add and prototype quickly.
  (Says he who isn't going to be the one doing it ;-)
  
   Anyway, this approach is worth investigating.
  
  Thanks, let me know how it goes!
  
  
  Regards,
  Lars
  
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-25 Thread Angel L. Mateo

Hello,

	I am newbie with pacemaker (and, generally, with ha clusters). I have 
configured a two nodes cluster. Both nodes are virtual machines (vmware 
esx) and use a shared storage (provided by a SAN, although access to the 
SAN is from esx infrastructure and VM consider it as scsi disk). I have 
configured clvm so logical volumes are only active in one of the nodes.


	Now I need some help with the stonith configuration to avoid data 
corrumption. Since I'm using ESX virtual machines, I think I won't have 
any problem using external/vcenter stonith plugin to shutdown virtual 
machines.


	My problem is how to avoid split brain situation with this 
configuration, without configuring a 3rd node. I have read about quorum 
disks, external/sbd stonith plugin and other references, but I'm too 
confused with all this.


	For example, [1] mention techniques to improve quorum with scsi reserve 
or quorum daemon, but it didn't point to how to do this pacemaker. Or 
[2] talks about external/sbd.


Any help?

PS: I have attached my corosync.conf and crm configure show outputs

[1] 
http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html

[2] http://www.gossamer-threads.com/lists/linuxha/pacemaker/78887

--
Angel L. Mateo Martínez
Sección de Telemática
Área de Tecnologías de la Información
y las Comunicaciones Aplicadas (ATICA)
http://www.um.es/atica
Tfo: 868889150
Fax: 86337
# Please read the openais.conf.5 manual page

totem {
version: 2

# How long before declaring a token lost (ms)
token: 3000

# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10

# How long to wait for join messages in the membership protocol (ms)
join: 60

# How long to wait for consensus to be achieved before starting a new 
round of membership configuration (ms)
consensus: 3600

# Turn off the virtual synchrony filter
vsftype: none

# Number of messages that may be sent by one processor on receipt of 
the token
max_messages: 20

# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes

# Disable encryption
secauth: off

# How many threads to use for encryption/decryption
threads: 0

# Optionally assign a fixed node id (integer)
# nodeid: 1234

# This specifies the mode of redundant ring, which may be none, active, 
or passive.
rrp_mode: none

interface {
# The following values need to be set based on your environment 
ringnumber: 0
bindnetaddr: 155.54.211.160
mcastaddr: 226.94.1.1
mcastport: 5405
}
}

amf {
mode: disabled
}

service {
# Load the Pacemaker Cluster Resource Manager
ver:   1
name:  pacemaker
}

aisexec {
user:   root
group:  root
}

logging {
fileline: off
to_stderr: yes
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}
node myotis51
node myotis52
primitive clvm ocf:lvm2:clvmd \
params daemon_timeout=30 \
meta target-role=Started
primitive dlm ocf:pacemaker:controld \
meta target-role=Started
primitive vg_users1 ocf:heartbeat:LVM \
params volgrpname=UsersDisk exclusive=yes \
op monitor interval=60 timeout=60
group dlm-clvm dlm clvm
clone dlm-clvm-clone dlm-clvm \
meta interleave=true ordered=true target-role=Started
location cli-prefer-vg_users1 vg_users1 \
rule $id=cli-prefer-rule-vg_users1 inf: #uname eq myotis52
property $id=cib-bootstrap-options \
dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1364212376
rsc_defaults $id=rsc-options \
resource-stickiness=100

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-25 Thread emmanuel segura
I have a production cluster, using two vm on esx cluster, for stonith i'm
using sbd, everything work fine


2013/3/25 emmanuel segura emi2f...@gmail.com

 I have a production cluster, using two vm on esx cluster, for stonith i'm
 using sbd, everything work find

 2013/3/25 Angel L. Mateo ama...@um.es

 Hello,

 I am newbie with pacemaker (and, generally, with ha clusters). I
 have configured a two nodes cluster. Both nodes are virtual machines
 (vmware esx) and use a shared storage (provided by a SAN, although access
 to the SAN is from esx infrastructure and VM consider it as scsi disk). I
 have configured clvm so logical volumes are only active in one of the nodes.

 Now I need some help with the stonith configuration to avoid data
 corrumption. Since I'm using ESX virtual machines, I think I won't have any
 problem using external/vcenter stonith plugin to shutdown virtual machines.

 My problem is how to avoid split brain situation with this
 configuration, without configuring a 3rd node. I have read about quorum
 disks, external/sbd stonith plugin and other references, but I'm too
 confused with all this.

 For example, [1] mention techniques to improve quorum with scsi
 reserve or quorum daemon, but it didn't point to how to do this pacemaker.
 Or [2] talks about external/sbd.

 Any help?

 PS: I have attached my corosync.conf and crm configure show outputs

 [1] http://techthoughts.typepad.**com/managing_computers/2007/**
 10/split-brain-quo.htmlhttp://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
 [2] 
 http://www.gossamer-threads.**com/lists/linuxha/pacemaker/**78887http://www.gossamer-threads.com/lists/linuxha/pacemaker/78887

 --
 Angel L. Mateo Martínez
 Sección de Telemática
 Área de Tecnologías de la Información
 y las Comunicaciones Aplicadas (ATICA)
 http://www.um.es/atica
 Tfo: 868889150
 Fax: 86337

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




 --
 esta es mi vida e me la vivo hasta que dios quiera




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-25 Thread emmanuel segura
I have a production cluster, using two vm on esx cluster, for stonith i'm
using sbd, everything work find

2013/3/25 Angel L. Mateo ama...@um.es

 Hello,

 I am newbie with pacemaker (and, generally, with ha clusters). I
 have configured a two nodes cluster. Both nodes are virtual machines
 (vmware esx) and use a shared storage (provided by a SAN, although access
 to the SAN is from esx infrastructure and VM consider it as scsi disk). I
 have configured clvm so logical volumes are only active in one of the nodes.

 Now I need some help with the stonith configuration to avoid data
 corrumption. Since I'm using ESX virtual machines, I think I won't have any
 problem using external/vcenter stonith plugin to shutdown virtual machines.

 My problem is how to avoid split brain situation with this
 configuration, without configuring a 3rd node. I have read about quorum
 disks, external/sbd stonith plugin and other references, but I'm too
 confused with all this.

 For example, [1] mention techniques to improve quorum with scsi
 reserve or quorum daemon, but it didn't point to how to do this pacemaker.
 Or [2] talks about external/sbd.

 Any help?

 PS: I have attached my corosync.conf and crm configure show outputs

 [1] http://techthoughts.typepad.**com/managing_computers/2007/**
 10/split-brain-quo.htmlhttp://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
 [2] 
 http://www.gossamer-threads.**com/lists/linuxha/pacemaker/**78887http://www.gossamer-threads.com/lists/linuxha/pacemaker/78887

 --
 Angel L. Mateo Martínez
 Sección de Telemática
 Área de Tecnologías de la Información
 y las Comunicaciones Aplicadas (ATICA)
 http://www.um.es/atica
 Tfo: 868889150
 Fax: 86337

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] solaris problem

2013-03-25 Thread Andrei Belov

Ok, I fixed this issue with the following patch against libqb 0.14.4:

--- lib/unix.c.orig 2013-03-25 12:30:50.445762231 +
+++ lib/unix.c  2013-03-25 12:49:59.322276376 +
@@ -83,7 +83,7 @@
 #if defined(QB_LINUX) || defined(QB_CYGWIN)
snprintf(path, PATH_MAX, /dev/shm/%s, file);
 #else
-   snprintf(path, PATH_MAX, LOCALSTATEDIR /run/%s, file);
+   snprintf(path, PATH_MAX, %s/%s, SOCKETDIR, file);
is_absolute = path;
 #endif
}
@@ -91,7 +91,7 @@
if (fd  0  !is_absolute) {
qb_util_perror(LOG_ERR, couldn't open file %s, path);
 
-   snprintf(path, PATH_MAX, LOCALSTATEDIR /run/%s, file);
+   snprintf(path, PATH_MAX, %s/%s, SOCKETDIR, file);
fd = open_mmap_file(path, file_flags);
if (fd  0) {
res = -errno;


libqb was configured with --with-socket-dir=/var/run/qb, /var/run/qb owned by
hacluster:haclient - this configuration works fine with both corosync 2.3.0 and
pacemaker 1.1.8.

Though I'm not sure that libqb is the right place to touch - maybe it'd be 
better
to add some enhancements to pacemaker's lib/common/mainloop.c,
mainloop_add_ipc_server() ?


Cheers.


On Mar 25, 2013, at 16:01 , Andrei Belov defana...@gmail.com wrote:

 
 I've rebuilt libqb using separated SOCKETDIR (/var/run/qb), and set 
 hacluster:haclient ownership to this dir.
 
 After that pacemakerd has been successfully started with all its childs:
 
 [root@ha1 /var/run/qb]# pacemakerd -fV
 Could not establish pacemakerd connection: Connection refused (146)
info: crm_ipc_connect:  Could not establish pacemakerd connection: 
 Connection refused (146)
info: get_cluster_type: Detected an active 'corosync' cluster
info: read_config:  Reading configure for stack: corosync
  notice: crm_add_logfile:  Additional logging available in 
 /var/log/cluster/corosync.log
  notice: main: Starting Pacemaker 1.1.8 (Build: 1f8858c):  ncurses 
 libqb-logging libqb-ipc upstart systemd  corosync-native
info: main: Maximum core file size is: 18446744073709551613
info: qb_ipcs_us_publish:   server name: pacemakerd
  notice: update_node_processes:48de70 Node 182452614 now known as 
 ha1, was: 
info: start_child:  Forked child 60719 for process cib
info: start_child:  Forked child 60720 for process stonith-ng
info: start_child:  Forked child 60721 for process lrmd
info: start_child:  Forked child 60722 for process attrd
info: start_child:  Forked child 60723 for process pengine
info: start_child:  Forked child 60724 for process crmd
info: main: Starting mainloop
 
 [root@ha1 /var/run/qb]# ls -l
 total 0
 srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 attrd
 srwxrwxrwx 1 root  root 0 Mar 25 11:43 cfg
 srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_ro
 srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_rw
 srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 cib_shm
 srwxrwxrwx 1 root  root 0 Mar 25 11:43 cmap
 srwxrwxrwx 1 root  root 0 Mar 25 11:43 cpg
 srwxrwxrwx 1 root  root 0 Mar 25 11:50 lrmd
 srwxrwxrwx 1 root  root 0 Mar 25 11:50 pacemakerd
 srwxrwxrwx 1 hacluster root 0 Mar 25 11:50 pengine
 srwxrwxrwx 1 root  root 0 Mar 25 11:43 quorum
 srwxrwxrwx 1 root  root 0 Mar 25 11:50 stonith-ng
 
 However, libqb still can not create some files in /var/run due to 
 insufficient permissions:
 
 Mar 25 11:50:45 [60719]cib: info: init_cs_connection_once:  
 Connection to 'corosync': established
 Mar 25 11:50:45 [60719]cib: info: crm_get_peer: Node 
 182452614 is now known as ha1
 Mar 25 11:50:45 [60719]cib: info: crm_get_peer: Node 
 182452614 has uuid 182452614
 Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish:   
 server name: cib_ro
 Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish:   
 server name: cib_rw
 Mar 25 11:50:45 [60719]cib: info: qb_ipcs_us_publish:   
 server name: cib_shm
 Mar 25 11:50:45 [60719]cib: info: cib_init: Starting cib 
 mainloop
 Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership:  
 Joined[0.0] cib.182452614 
 Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership:  
 Member[0.0] cib.182452614 
 Mar 25 11:50:45 [60719]cib: info: pcmk_cpg_membership:  
 Member[0.1] cib.182452614 
 Mar 25 11:50:46 [60719]cib:error: qb_sys_mmap_file_open:
 couldn't open file /var/run/qb-cib_rw-control-60719-60720-15: Permission 
 denied (13)
 Mar 25 11:50:46 [60719]cib:error: qb_ipcs_us_connect:   
 couldn't create file for mmap (60719-60720-15): Permission denied (13)
 Mar 25 11:50:46 [60719]cib:error: handle_new_connection:
 Invalid IPC credentials (60719-60720-15).
 Mar 25 11:50:46 [60720] stonith-ng: info: crm_ipc_connect:  Could not 
 establish cib_rw connection: 

Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-25 Thread Jacek Konieczny
On Mon, 25 Mar 2013 13:54:22 +0100
   My problem is how to avoid split brain situation with this 
 configuration, without configuring a 3rd node. I have read about
 quorum disks, external/sbd stonith plugin and other references, but
 I'm too confused with all this.
 
   For example, [1] mention techniques to improve quorum with
 scsi reserve or quorum daemon, but it didn't point to how to do this
 pacemaker. Or [2] talks about external/sbd.
 
   Any help?


With corosync 2.2 (2.1 too, I guess) you can use, in corosync.conf:

quorum {
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}

Corosync will then manage quorum for the two-node cluster and Pacemaker
can use that. You still need proper fencing to enforce the quorum (both
for pacemaker and the storage layer – dlm in case you use clvmd), but no
extra quorum node is needed.

There is one more thing, though: you need two nodes active to boot the
cluster, but then when one fails (and is fenced) the other may continue,
keeping quorum.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-25 Thread Angel L. Mateo


Jacek Konieczny jaj...@jajcus.net escribió:

On Mon, 25 Mar 2013 13:54:22 +0100
  My problem is how to avoid split brain situation with this 
 configuration, without configuring a 3rd node. I have read about
 quorum disks, external/sbd stonith plugin and other references, but
 I'm too confused with all this.
 
  For example, [1] mention techniques to improve quorum with
 scsi reserve or quorum daemon, but it didn't point to how to do this
 pacemaker. Or [2] talks about external/sbd.
 
  Any help?


With corosync 2.2 (2.1 too, I guess) you can use, in corosync.conf:

quorum {
   provider: corosync_votequorum
   expected_votes: 2
   two_node: 1
}

Corosync will then manage quorum for the two-node cluster and Pacemaker

  I'm using corosync 1.1 which is the one  provided with my distribution 
(ubuntu 12.04). I could also use cman.

can use that. You still need proper fencing to enforce the quorum (both
for pacemaker and the storage layer – dlm in case you use clvmd), but
no
extra quorum node is needed.

  I hace configured a dlm resource usted with clvm.

  One doubt... With this configuration, how split brain problem is handled?

There is one more thing, though: you need two nodes active to boot the
cluster, but then when one fails (and is fenced) the other may
continue,
keeping quorum.

Greets,
   Jacek

-- 
Enviado desde mi teléfono Android con K-9 Mail.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] solaris problem

2013-03-25 Thread Andrei Belov
Andreas,

thank you for sharing this link and your start script!

My goal is to make possible building those tools using more convenient way of 
NetBSD's pkgsrc system.
Perhaps using something like --localstatedir=${VARBASE}/cluster for both libqb, 
corosync and pacemaker,
and setting the appropriate permissions to /var/cluster will solve the problem.

Thanks again!


On Mar 25, 2013, at 20:35 , Grüninger, Andreas (LGL Extern) 
andreas.gruenin...@lgl.bwl.de wrote:

 Andrei
 
 There is no need to make this change.
 
 I described in 
 http://grueni.github.com/libqb/ 
 how I compiled libqb and the other programs.
 
 LOCALSTATEDIR should be defined with ./configure.
 Please look a Compile Corosync in my description.
 
 I guess your start scripts should be changed.
 
 We use this as start script called by the smf instance
 ##
 #!/usr/bin/bash
 # Start/stop HACluster service
 #
 . /lib/svc/share/smf_include.sh
 
 ## Tracing mit debug version
 # PCMK_trace_files=1
 # PCMK_trace_functions=1
 # PCMK_trace_formats=1
 # PCMK_trace_tags=1
 
 export PCMK_ipc_type=socket
 CLUSTER_USER=hacluster
 COROSYNC=corosync
 PACEMAKERD=pacemakerd
 PACEMAKER_PROCESSES=pacemaker
 APPPATH=/opt/ha/sbin/
 SLEEPINTERVALL=10
 SLEEPCOUNT=5
 SLEPT=0
 
 
 killapp() {
   pid=`pgrep -f $1`
   if [ x$pid != x ]; then
  kill -9 $pid 
   fi
   return 0
 }
 
 start0() {
stop0
su ${CLUSTER_USER} -c ${APPPATH}${COROSYNC}
sleep $sleep0
su ${CLUSTER_USER} -c ${APPPATH}${PACEMAKERD} 
return 0
 }
 
 stop0() {
 # first try, graceful shutdown
pid=`pgrep -U ${CLUSTER_USER} -f ${PACEMAKERD}`
if [ x$pid != x ]; then
   ${APPPATH}${PACEMAKERD} --shutdown 
   sleep $SLEEPINTERVALL
fi
 # second try, kill the rest
killapp ${APPPATH}${COROSYNC}
killapp ${PACEMAKER_PROCESSES}
return 0
 }
 
 let sleep0=$SLEEPINTERVALL/2
 case $1 in
 'start')
start0
;;
 'restart')
stop0
start0
;;
 'stop')
stop0
;;
 *)
echo Usage: -bash { start | stop | restart}
exit 1
;;
 esac
 exit 0
 ###
 
 Andreas
 
 
 -Ursprüngliche Nachricht-
 Von: Andrei Belov [mailto:defana...@gmail.com] 
 Gesendet: Montag, 25. März 2013 15:08
 An: The Pacemaker cluster resource manager
 Betreff: Re: [Pacemaker] solaris problem
 
 
 Ok, I fixed this issue with the following patch against libqb 0.14.4:
 
 --- lib/unix.c.orig 2013-03-25 12:30:50.445762231 +
 +++ lib/unix.c  2013-03-25 12:49:59.322276376 +
 @@ -83,7 +83,7 @@
 #if defined(QB_LINUX) || defined(QB_CYGWIN)
snprintf(path, PATH_MAX, /dev/shm/%s, file);  #else
 -   snprintf(path, PATH_MAX, LOCALSTATEDIR /run/%s, file);
 +   snprintf(path, PATH_MAX, %s/%s, SOCKETDIR, file);
is_absolute = path;
 #endif
}
 @@ -91,7 +91,7 @@
if (fd  0  !is_absolute) {
qb_util_perror(LOG_ERR, couldn't open file %s, path);
 
 -   snprintf(path, PATH_MAX, LOCALSTATEDIR /run/%s, file);
 +   snprintf(path, PATH_MAX, %s/%s, SOCKETDIR, file);
fd = open_mmap_file(path, file_flags);
if (fd  0) {
res = -errno;
 
 
 libqb was configured with --with-socket-dir=/var/run/qb, /var/run/qb owned by 
 hacluster:haclient - this configuration works fine with both corosync 2.3.0 
 and pacemaker 1.1.8.
 
 Though I'm not sure that libqb is the right place to touch - maybe it'd be 
 better to add some enhancements to pacemaker's lib/common/mainloop.c,
 mainloop_add_ipc_server() ?
 
 
 Cheers.
 
 
 On Mar 25, 2013, at 16:01 , Andrei Belov defana...@gmail.com wrote:
 
 
 I've rebuilt libqb using separated SOCKETDIR (/var/run/qb), and set 
 hacluster:haclient ownership to this dir.
 
 After that pacemakerd has been successfully started with all its childs:
 
 [root@ha1 /var/run/qb]# pacemakerd -fV Could not establish pacemakerd 
 connection: Connection refused (146)
   info: crm_ipc_connect:  Could not establish pacemakerd connection: 
 Connection refused (146)
   info: get_cluster_type: Detected an active 'corosync' cluster
   info: read_config:  Reading configure for stack: corosync
 notice: crm_add_logfile:  Additional logging available in 
 /var/log/cluster/corosync.log
 notice: main: Starting Pacemaker 1.1.8 (Build: 1f8858c):  ncurses 
 libqb-logging libqb-ipc upstart systemd  corosync-native
   info: main: Maximum core file size is: 18446744073709551613
   info: qb_ipcs_us_publish:   server name: pacemakerd
 notice: update_node_processes:48de70 Node 182452614 now known as 
 ha1, was: 
   info: start_child:  Forked child 60719 for process cib
   info: start_child:  Forked child 60720 for process stonith-ng
   info: start_child:  Forked child 60721 for process lrmd
   info: start_child:  Forked child 60722 for process attrd
   info: 

Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-25 Thread Jacek Konieczny
On Mon, 25 Mar 2013 20:01:28 +0100
Angel L. Mateo ama...@um.es wrote:
 quorum {
  provider: corosync_votequorum
  expected_votes: 2
  two_node: 1
 }
 
 Corosync will then manage quorum for the two-node cluster and
 Pacemaker
 
   I'm using corosync 1.1 which is the one  provided with my
 distribution (ubuntu 12.04). I could also use cman.

I don't think corosync 1.1 can do that, but I guess in this case cman
should be able provide this functionality.
 
 can use that. You still need proper fencing to enforce the quorum
 (both for pacemaker and the storage layer – dlm in case you use
 clvmd), but no
 extra quorum node is needed.
 
   I hace configured a dlm resource usted with clvm.
 
   One doubt... With this configuration, how split brain problem is
 handled?

The first node to notice that the other is unreachable will fence (kill)
the other, making sure it is the only one operating on the shared data.
Even though it is only half of the node, the cluster is considered
quorate as the other node is known not to be running any cluster
resources.

When the fenced node reboots its cluster stack starts, but with no
quorum until it comminicates with the surviving node again. So no
cluster services are started there until both nodes communicate
properly and the proper quorum is recovered.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] OCF Resource agent promote question

2013-03-25 Thread Andreas Kurz
Hi Steve,

On 2013-03-25 18:44, Steven Bambling wrote:
 All,
 
 I'm trying to work on a OCF resource agent that uses postgresql
 streaming replication.  I'm running into a few issues that I hope might
 be answered or at least some pointers given to steer me in the right
 direction.

Why are you not using the existing pgsql RA? It is capable of doing
synchronous and asynchronous replication and it is known to work fine.

Best regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 
 1.  A quick way of obtaining a list of Online nodes in the cluster
 that a resource will be able to migrate to.  I've accomplished it with
 some grep and see but its not pretty or fast.
 
 # time pcs status | grep Online | sed -e s/.*\[\(.*\)\]/\1/ | sed 's/ //'
 p1.example.net http://p1.example.net p2.example.net
 http://p2.example.net
 
 real0m2.797s
 user0m0.084s
 sys0m0.024s
 
 Once I get a list of active/online nodes in the cluster my thinking was
 to use PSQL to get the current xlog location and lag or each of the
 remaining nodes and compare them.  If the node has a greater log
 position and/or less lag it will be given a greater master preference.  
 
 2.  How to force a monitor/probe before a promote is run on ALL nodes to
 make sure that the master preference is up to date before
 migrating/failing over the resource.
 - I was thinking that maybe during the promote call it could get the log
 location and lag from each of the nodes via an psql call ( like above)
 and then force the resource to a specific node.  Is there a way to do
 this and does it sound like a sane idea ?
 
 
 The start of my RA is located here suggestions and comments 100%
 welcome https://github.com/smbambling/pgsqlsr/blob/master/pgsqlsr
 
 v/r
 
 STEVE
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource is Too Active (on both nodes)

2013-03-25 Thread Andreas Kurz
On 2013-03-22 21:35, Mohica Jasha wrote:
 Hey,
 
 I have two cluster nodes.
 
 I have a service process which is prone to crash and takes a very long
 time to start. 
 Since the service process takes a long time to start I have the service
 process running on both nodes, but only the active node with the virtual
 IP serves the incoming requests.
 
 On both nodes, I have a cron job which periodically checks if the
 service process is up and if not it starts the service.
 
 I want pacemaker to periodically check if the service is down on the
 active node and if so, it switches the virtual IP to the second node
 (without starting or stopping the my service)
 
 I have the following configuration:
 
 primitive clusterIP ocf:heartbeat:IPaddr2 \
 params ip=10.0.1.247 \
 op monitor interval=10s timeout=20s
 
 primitive serviceMonitoring ocf:serviceMonitoring:serviceMonitoring 
 params op monitor interval=10s timeout=20s
 
 colocation HACluster inf: serviceMonitoring clusterIP
 order serviceMonitoring-after-clusterIP inf: clusterIP serviceMonitoring
 
 My serviceMonitoring resource doesn't do anything other than checking
 the state of the service process. I get the following in the log file:
 
 Mar 05 15:07:59 [1543] ha1 pengine:   notice: unpack_rsc_op: Operation
 monitor found resource serviceMonitoring active on ha2
 Mar 05 15:07:59 [1543] ha1 pengine:   notice: unpack_rsc_op: Operation
 monitor found resource serviceMonitoring active on ha1
 Mar 05 15:07:59 [1543] ha1 pengine:error: native_create_actions:
 Resource serviceMonitoring (ocf:: serviceMonitoring) is active on 2
 nodes attempting recovery
 Mar 05 15:07:59 [1543] ha1 pengine:  warning: native_create_actions: See
 http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
 
 So it seems that pacemaker calls the monitor method of the
 serviceMonitoring resource on both nodes.

Yes, it does a probing of the resources on all nodes ... clone your
serviceMonitoring resource and set it into unmanaged mode, that should
give you the desired behavior ... or simply clone it and let Pacemaker
do the complete management and go without your cron-check-restart magic.

Regards,
Andreas

 
 Any idea how I can fix this?
 
 Thanks,
 Mohica
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-25 Thread Andreas Kurz
On 2013-03-22 19:31, John White wrote:
 Hello Folks,
   We're trying to get a corosync/pacemaker instance going on a 4 node 
 cluster that boots via pxe.  There have been a number of state/file system 
 issues, but those appear to be *mostly* taken care of thus far.  We're 
 running into an issue now where cib just isn't staying up with errors akin to 
 the following (sorry for the lengthy dump, note the attrd and cib connection 
 errors).  Any ideas would be greatly appreciated: 
 
 Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG 
 parser context
 Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
 /usr/lib64/heartbeat/attrd 
 Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
 active directory to /var/lib/heartbeat/cores/hacluster
 Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
 Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type 
 is: 'corosync'
 Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting 
 to cluster infrastructure: corosync
 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
 connect to the Cluster Process Group API: 2
 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
 Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
 Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
 Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
 /usr/lib64/heartbeat/pengine 
 Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
 active directory to /var/lib/heartbeat/cores/hacluster
 Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
 instances of pengine
 Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
 init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine

That /var/run/crm directory is available and owned by
hacluster.haclient ... and writable by at least the hacluster user?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
 process attrd exited (pid=25841, rc=100)
 Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
 process attrd no longer wishes to be respawned
 Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node 
 n0014.lustre now has process list: 00110312 (was 
 00111312)
 Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
 init_client_ipc_comms_nodispatch: Could not init comms on: 
 /var/run/crm/pengine
 Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
 Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding 
 fd=4 to mainloop
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
 Connection to 'corosync': established
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
 entry for node n0014.lustre/247988234
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
 n0014.lustre now has id: 247988234
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 
 is now known as n0014.lustre
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
 init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
 Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd 
 Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: 
 Channel 0x995530 connected: 1 children
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng 
 mainloop
 Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed 
 active directory to /var/lib/heartbeat/cores/hacluster
 Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
 a02c0f19a00c1eb2527ad38f146ebc0834814558
 Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
 Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: 
 [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
 Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
 #011// A_LOG   
 Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
 #011// A_STARTUP
 Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal 
 Handlers
 Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM 
 objects
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node 
 n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 
 proc=00110312 (new)
 Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added 
 signal handler for signal 17
 Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
 #011// A_CIB_START
 

Re: [Pacemaker] pacemaker node stuck offline

2013-03-25 Thread Andreas Kurz
On 2013-03-22 03:39, pacema...@feystorm.net wrote:
 
 On 03/21/2013 11:15 AM, Andreas Kurz wrote:
 On 2013-03-21 14:31, Patrick Hemmer wrote:
 I've got a 2-node cluster where it seems last night one of the nodes
 went offline, and I can't see any reason why.

 Attached are the logs from the 2 nodes (the relevant timeframe seems to
 be 2013-03-21 between 06:05 and 06:10).
 This is on ubuntu 12.04
 
 Looks like your non-redundant cluster-communication was interrupted at
 around that time for whatever reason and your cluster split-brained.
 
 Does the drbd-replication use a different network-connection? If yes,
 why not using it for a redundant ring setup ... and you should use
 STONITH.
 
 I also wonder why you have defined expected_votes='1' in your
 cluster.conf.
 
 Regards,
 Andreas
 But shouldn't it have recovered? The node shows as OFFLINE, even
 though it's clearly communicating with the rest of the cluster. What is
 the procedure for getting the node back online. Anything other than
 bouncing pacemaker?

Looks like the cluster has some troubles trying to rejoin the two DCs
after the split-brain. Try to stop cman/Pacemaker on i-3307d96b and
clean there the /var/lib/heartbeat/crm directory so it starts with an
empty configuration and receives the latest updates from i-a706d8ff.

 
 Unfortunately no to the different network connection for drbd. These are
 2 EC2 instances, so redundant connections aren't available. Though since
 it is EC2, I could set up a STONITH to whack the other instance. The
 only problem here would be a race condition. The EC2 api for shutting
 down or rebooting an instance isn't instantaneous. Both nodes could end
 up sending the signal to reboot the other node.

Yeah, you would need to add a very generous start-timeout to the monitor
operation of the stonith primitive ... but it works ;-)

 
 As for expected_votes=1, it's because it's a two-node cluster. Though I
 apparently forgot to set the `two_node` attribute :-(

Those two parameters should not be needed for a cman/pacemaker cluster,
you can tell pacemaker to ignore loss of quorum.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] DRBD+LVM+NFS problems

2013-03-25 Thread Dennis Jacobfeuerborn
I have now reduced the configuration further and removed LVM from the 
picture. Still the cluster fails when I set the master node to standby.
What's interesting is that things get fixed when I issue a simple 
cleanup for the filesystem resource.


This is what my current config looks like:

node nfs1 \
attributes standby=off
node nfs2
primitive p_drbd_web1 ocf:linbit:drbd \
params drbd_resource=web1 \
op monitor interval=15 role=Master \
op monitor interval=30 role=Slave
primitive p_fs_web1 ocf:heartbeat:Filesystem \
params device=/dev/drbd0 \
directory=/srv/nfs/web1 fstype=ext4 \
op monitor interval=10s
ms ms_drbd_web1 p_drbd_web1 \
meta master-max=1 master-node-max=1 \
clone-max=2 clone-node-max=1 notify=true
colocation c_web1_on_drbd inf: ms_drbd_web1:Master p_fs_web1
order o_drbd_before_web1 inf: ms_drbd_web1:promote p_fs_web1
property $id=cib-bootstrap-options \
dc-version=1.1.8-7.el6-394e906 \
cluster-infrastructure=classic openais (with plugin) \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1364259713 \
maintenance-mode=false
rsc_defaults $id=rsc-options \
resource-stickiness=100

I cannot figure out what is wrong with this configuration.

Regards,
  Dennis

On 25.03.2013 13:09, Dennis Jacobfeuerborn wrote:

I just found the following in the dmesg output which might or might not
add to understanding the problem:

device-mapper: table: 253:2: linear: dm-linear: Device lookup failed
device-mapper: ioctl: error adding target to table

Regards,
   Dennis

On 25.03.2013 13:04, Dennis Jacobfeuerborn wrote:

Hi,
I'm currently trying create a two node redundant NFS setup on CentOS 6.4
using pacemaker and crmsh.

I use this Document as a starting poing:
https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html



The first issue is that using these instructions I get the cluster up
and running but the moment I try to stop the pacemaker service on the
current master node several resources just fail and everything goes
pear-shaped.

Since the problem seems to relate to the nfs bits in the configuration I
removed these in order to get to a minimal working setup and then add
things piece by piece in order to find the source of the problem.

Now I am at a point where I basically have only
DRBD+LVM+Filesystems+IPAddr2 configured and now LVM seems to act up.

I can start the cluster and everything is fine but the moment I stop
pacemaker on the master i end up with this as a status:

===
Node nfs2: standby
Online: [ nfs1 ]

  Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
  Masters: [ nfs1 ]
  Stopped: [ p_drbd_nfs:1 ]

Failed actions:
 p_lvm_nfs_start_0 (node=nfs1, call=505, rc=1, status=complete):
unknown error
===

and in the log on nfs1 I see:
LVM(p_lvm_nfs)[7515]:2013/03/25_12:34:21 ERROR: device-mapper:
reload ioctl on failed: Invalid argument device-mapper: reload ioctl on
failed: Invalid argument 2 logical volume(s) in volume group nfs now
active

However a lvs in this state shows:
[root@nfs1 ~]# lvs
   LV  VGAttr  LSize   Pool Origin Data%  Move Log
   web1nfs   -wi--   2,00g
   web2nfs   -wi--   2,00g
   lv_root vg_nfs1.local -wi-ao---   2,45g
   lv_swap vg_nfs1.local -wi-ao--- 256,00m

So the volume group is present.

My current configuration looks like this:

node nfs1 \
 attributes standby=off
node nfs2 \
 attributes standby=on
primitive p_drbd_nfs ocf:linbit:drbd \
 params drbd_resource=nfs \
 op monitor interval=15 role=Master \
 op monitor interval=30 role=Slave
primitive p_fs_web1 ocf:heartbeat:Filesystem \
 params device=/dev/nfs/web1 \
   directory=/srv/nfs/web1 \
   fstype=ext4 \
 op monitor interval=10s
primitive p_fs_web2 ocf:heartbeat:Filesystem \
 params device=/dev/nfs/web2 \
   directory=/srv/nfs/web2 \
   fstype=ext4 \
 op monitor interval=10s
primitive p_ip_nfs ocf:heartbeat:IPaddr2 \
 params ip=10.99.0.142 cidr_netmask=24 \
 op monitor interval=30s
primitive p_lvm_nfs ocf:heartbeat:LVM \
 params volgrpname=nfs \
 op monitor interval=30s
group g_nfs p_lvm_nfs p_fs_web1 p_fs_web2 p_ip_nfs
ms ms_drbd_nfs p_drbd_nfs \
 meta master-max=1 \
   master-node-max=1 \
   clone-max=2 \
   clone-node-max=1 \
   notify=true
colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
property $id=cib-bootstrap-options \
 dc-version=1.1.8-7.el6-394e906 \
 cluster-infrastructure=classic openais (with plugin) \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 last-lrm-refresh=1364212090 \
 maintenance-mode=false
rsc_defaults $id=rsc_defaults-options \
 resource-stickiness=100

Any ideas