Re: [Pacemaker] Cannot create more than 27 multistate resources

2014-07-21 Thread Colin Mason
I have run into this issue with greater than 55 resources when using 
pcs-0.9.90-2.el6.centos.2.noarch.

The other workaround is to run the pcs command with --debug. Take the whole XML 
text that pcs is trying to run cibadmin with and place it into a text file. Now 
convert the cibadmin command from --xml-text (-X) to --xml-file (-x) and use 
the XML file you just created. Fixed.

Colin

-Original Message-
From: Chris Feist [mailto:cfe...@redhat.com] 
Sent: Monday, July 21, 2014 11:14 AM
To: K Mehta
Cc: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Cannot create more than 27 multistate resources

On 07/21/2014 05:22 AM, Andrew Beekhof wrote:
> Chris,
>
> Does the error below mean anything to you?
> This seems to be happening once the CIB reaches a certain size, but is on the 
> client side and possibly before the pacemaker tools are invoked.

I grabbed your debug file and did some tests and it looks like the issue is 
caused by earlier version of pcs (0.9.90 is affected) which try to pass the 
entire cib on the command line to cibadmin.  This has been fixed upstream (and 
should be built in the next release of RHEL/CentOS).

As a workaround, you can use the upstream sources here: 
https://github.com/feist/pcs (just run pcs from the directory that is cloned).

Thanks!
Chris

>
> On 9 Jul 2014, at 6:49 pm, K Mehta  wrote:
>
>> [root@vsanqa11 ~]# pcs resource create 
>> vha-3de5ab16-9917-4b90-93d2-7b04fc71879c 
>> ocf:heartbeat:vgc-cm-agent.ocf 
>> cluster_uuid=3de5ab16-9917-4b90-93d2-7b04fc71879c op monitor 
>> role="Master" interval=30s timeout=100s op monitor role="Master" 
>> interval=30s timeout=100s
>>
>>
>> pcs status output includes
>>   vha-3de5ab16-9917-4b90-93d2-7b04fc71879c   
>> (ocf::heartbeat:vgc-cm-agent.ocf):  Started vsanqa11
>>
>>
>> [root@vsanqa11 ~]# pcs resource master 
>> ms-3de5ab16-9917-4b90-93d2-7b04fc71879c 
>> vha-3de5ab16-9917-4b90-93d2-7b04fc71879c meta clone-max=2 
>> globally-unique=false target-role=started
>> Error: unable to locate command: /usr/sbin/cibadmin
>>
>
>
> Looking in the logs, I see:
>
> Jul 12 11:18:24 vsanqa11 cibadmin[7966]:   notice: crm_log_args: Invoked: 
> /usr/sbin/cibadmin -c -R --xml-text #012   id="cib-bootstrap-options">#012 id="cib-bootstrap-options-dc-version" name="dc-version" 
> value="1.1.10-14.el6_5.2-368c726"/>#012 id="cib-bootstrap-options-cluster-infrastructure" 
> name="cluster-infrastructure" value="cman"/>#012#012 id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" 
> value="ignore"/>#012   name="stoni
>
> But am I right in thinking that this that doesn't look like the result of a 
> pcs command?
>
> Kiran: Can you give us more information on the other commands you're running?
>


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker error trying to add Apache resource

2013-06-26 Thread Colin Blair
Hi Jake,

Thank you for the info.  I was using OCF. Lsb worked.

R,
CB

-Original Message-
From: Jake Smith [mailto:jsm...@argotec.com] 
Sent: Wednesday, June 26, 2013 12:08 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Pacemaker error trying to add Apache resource




- Original Message -
> From: "Colin Blair" 
> To: "The Pacemaker cluster resource manager" 
> 
> Sent: Wednesday, June 26, 2013 10:56:49 AM
> Subject: [Pacemaker] Pacemaker error trying to add Apache resource
> 
> All,
> Couldn't find a solution in the forum. Configuration info:
> 
> Ubuntu 12.04 Server
> Corosync 1.4.2 cman plugin
> Pacemaker 1.1.6
> Apache 2.2.22
> 
> I have an active/passive 2-node cluster running.
> 
> I am receiving the following error when adding the web-server
> resource:
> 
> Failed actions:
> web-server_start_0 (node=funl-pear, call37, rc=1, status=complete):
> unknown error
> web-server_start_0 (node=funl-pear2, call38, rc=1, status=complete):
> unknown error
> 

rc = 1 is not an LSB compliant response to a status check.  I assume you are 
using the LSB init script for Apache in your cluster?

Test the init script as indicated here:
http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html
and here:
http://oss.clusterlabs.org/pipermail/pacemaker/2010-July/007008.html


Could possibly be related to this too but I doubt it:
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1018171

> -My configfile=/etc/apache2/apache2.conf
> -My server-status is allowed from all and is tested to work.
> -There are no errors in the apache log.
> 
> Event from the corosync.log:
> 
> Jun 26 10:13:28 funl-pear crmd: [4576]: info: update_dc: Unset DC
> funl-pear2
> Jun 26 10:13:28 funl-pear crmd: [4576]: info: do_state_transition:
> State transition S_NOT_DC -> S_PENDING [ input=I_PENDING 
> cause=C_FSA_INTERNAL origin=do_election_count_vote ] Jun 26 10:13:28 
> funl-pear crmd: [4576]: info: update_dc: Set DC to
> funl-pear2 (3.0.5)
> Jun 26 10:13:28 funl-pear crmd: [4576]: info: do_state_transition:
> State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC 
> cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Jun 26 
> 10:13:28 funl-pear crmd: [4576]: info: do_lrm_rsc_op:
> Performing key=5:633:7:0209fb9d-82d7-488c-9eec-3a2070d4f4b2
> op=web-server_monitor_0 )
> Jun 26 10:13:28 funl-pear lrmd: [4572]: info: rsc:web-server probe[29] 
> (pid 30943) Jun 26 10:13:28 funl-pear lrmd: [4572]: info: operation 
> monitor[29] on web-server for client 4576: pid 30943 exited with 
> return code 1 THIS LOOKS FISHY***

Yes it is

> Jun 26 10:13:28 funl-pear crmd: [4576]: info: process_lrm_event: LRM 
> operation web-server_monitor_0 (call=29, rc=1, cib-update=113,
> confirmed=true) unknown error
> 
> Jun 26 10:13:28 funl-pear crmd: [4576]: info: do_lrm_rsc_op:
> Performing key=2:634:0:0209fb9d-82d7-488c-9eec-3a2070d4f4b2
> op=web-server_stop_0 )
> Jun 26 10:13:28 funl-pear lrmd: [4572]: info: rsc:web-server stop[30] 
> (pid 31010) Jun 26 10:13:29 funl-pear lrmd: [4572]: info: RA output:
> (web-server:stop:stderr) /usr/lib/ocf/resource.d//heartbeat/apache:
> 442: kill: No such process

This looks odd too... not only that it's not running but also the double // But 
figure out the monitor problem first then re-evaluate

Your cluster config would also help...

HTH

Jake

> Jun 26 10:13:29 funl-pear lrmd: [4572]: info: operation stop[30] on 
> web-server for client 4576: pid 31010 exited with return code 0 Jun 26 
> 10:13:29 funl-pear crmd: [4576]: info: process_lrm_event: LRM 
> operation web-server_stop_0 (call=30, rc=0, cib-update=114,
> confirmed=true) ok
> Jun 26 10:13:33 funl-pear crmd: [4576]: info: do_lrm_rsc_op:
> Performing key=9:636:0:0209fb9d-82d7-488c-9eec-3a2070d4f4b2
> op=web-server_start_0 )
> Jun 26 10:13:33 funl-pear lrmd: [4572]: info: rsc:web-server start[31] 
> (pid 31110) Jun 26 10:13:36 funl-pear lrmd: [4572]: info: RA output:
> (web-server:start:stderr) /usr/lib/ocf/resource.d//heartbeat/apache:
> 442: kill: No such process
> Jun 26 10:13:36 funl-pear lrmd: [4572]: info: operation start[31] on 
> web-server for client 4576: pid 31110 exited with return code 1 Jun 26 
> 10:13:36 funl-pear crmd: [4576]: info: process_lrm_event: LRM 
> operation web-server_start_0 (call=31, rc=1, cib-update=115,
> confirmed=true) unknown error
> Jun 26 10:13:36 funl-pear crmd: [4576]: info: do_lrm_rsc_op:
> Performing key=2:638:0:0209fb9d-82d7-488c-9eec-3a2070d4f4b2
> op=web-server_stop_0 )
> Jun 26 10:13:36 funl-pear lrmd: [4572]: info: rsc:web-server stop[32] 
> (pid 31272) Jun 26 10:13:36 funl-pear lrmd: [4572]: info: operation 
> stop[32] on web-server for client 4576: pid 31272 exited wit

[Pacemaker] Pacemaker error trying to add Apache resource

2013-06-26 Thread Colin Blair
All,
Couldn't find a solution in the forum. Configuration info:

Ubuntu 12.04 Server
Corosync 1.4.2 cman plugin
Pacemaker 1.1.6
Apache 2.2.22

I have an active/passive 2-node cluster running.

I am receiving the following error when adding the web-server resource:

Failed actions:
web-server_start_0 (node=funl-pear, call37, rc=1, status=complete): unknown 
error
web-server_start_0 (node=funl-pear2, call38, rc=1, status=complete): unknown 
error

-My configfile=/etc/apache2/apache2.conf
-My server-status is allowed from all and is tested to work.
-There are no errors in the apache log.

Event from the corosync.log:

Jun 26 10:13:28 funl-pear crmd: [4576]: info: update_dc: Unset DC funl-pear2
Jun 26 10:13:28 funl-pear crmd: [4576]: info: do_state_transition: State 
transition S_NOT_DC -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
origin=do_election_count_vote ]
Jun 26 10:13:28 funl-pear crmd: [4576]: info: update_dc: Set DC to funl-pear2 
(3.0.5)
Jun 26 10:13:28 funl-pear crmd: [4576]: info: do_state_transition: State 
transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE 
origin=do_cl_join_finalize_respond ]
Jun 26 10:13:28 funl-pear crmd: [4576]: info: do_lrm_rsc_op: Performing 
key=5:633:7:0209fb9d-82d7-488c-9eec-3a2070d4f4b2 op=web-server_monitor_0 )
Jun 26 10:13:28 funl-pear lrmd: [4572]: info: rsc:web-server probe[29] (pid 
30943)
Jun 26 10:13:28 funl-pear lrmd: [4572]: info: operation monitor[29] on 
web-server for client 4576: pid 30943 exited with return code 1
THIS LOOKS FISHY***
Jun 26 10:13:28 funl-pear crmd: [4576]: info: process_lrm_event: LRM operation 
web-server_monitor_0 (call=29, rc=1, cib-update=113, confirmed=true) unknown 
error

Jun 26 10:13:28 funl-pear crmd: [4576]: info: do_lrm_rsc_op: Performing 
key=2:634:0:0209fb9d-82d7-488c-9eec-3a2070d4f4b2 op=web-server_stop_0 )
Jun 26 10:13:28 funl-pear lrmd: [4572]: info: rsc:web-server stop[30] (pid 
31010)
Jun 26 10:13:29 funl-pear lrmd: [4572]: info: RA output: 
(web-server:stop:stderr) /usr/lib/ocf/resource.d//heartbeat/apache: 442: kill: 
No such process
Jun 26 10:13:29 funl-pear lrmd: [4572]: info: operation stop[30] on web-server 
for client 4576: pid 31010 exited with return code 0
Jun 26 10:13:29 funl-pear crmd: [4576]: info: process_lrm_event: LRM operation 
web-server_stop_0 (call=30, rc=0, cib-update=114, confirmed=true) ok
Jun 26 10:13:33 funl-pear crmd: [4576]: info: do_lrm_rsc_op: Performing 
key=9:636:0:0209fb9d-82d7-488c-9eec-3a2070d4f4b2 op=web-server_start_0 )
Jun 26 10:13:33 funl-pear lrmd: [4572]: info: rsc:web-server start[31] (pid 
31110)
Jun 26 10:13:36 funl-pear lrmd: [4572]: info: RA output: 
(web-server:start:stderr) /usr/lib/ocf/resource.d//heartbeat/apache: 442: kill: 
No such process
Jun 26 10:13:36 funl-pear lrmd: [4572]: info: operation start[31] on web-server 
for client 4576: pid 31110 exited with return code 1
Jun 26 10:13:36 funl-pear crmd: [4576]: info: process_lrm_event: LRM operation 
web-server_start_0 (call=31, rc=1, cib-update=115, confirmed=true) unknown error
Jun 26 10:13:36 funl-pear crmd: [4576]: info: do_lrm_rsc_op: Performing 
key=2:638:0:0209fb9d-82d7-488c-9eec-3a2070d4f4b2 op=web-server_stop_0 )
Jun 26 10:13:36 funl-pear lrmd: [4572]: info: rsc:web-server stop[32] (pid 
31272)
Jun 26 10:13:36 funl-pear lrmd: [4572]: info: operation stop[32] on web-server 
for client 4576: pid 31272 exited with return code 0
Jun 26 10:13:36 funl-pear crmd: [4576]: info: process_lrm_event: LRM operation 
web-server_stop_0 (call=32, rc=0, cib-update=116, confirmed=true) ok

Any ideas?
R,
CB

The information contained in this transmission may contain privileged and 
confidential information. 
It is intended only for the use of the person(s) named above. 
If you are not the intended recipient, you are hereby notified that any review, 
dissemination, distribution or duplication of this communication is strictly 
prohibited. 
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. 
Technica Corporation does not represent this e-mail to be free from any virus, 
fault or defect and it is therefore the responsibility of the recipient to 
first scan it for viruses, faults and defects. 
To reply to our e-mail administrator directly, please send an e-mail to 
postmas...@technicacorp.com. Thank you.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] GPU Processing

2013-06-25 Thread Colin Blair
Andrew,

Does Pacemaker support GPU processes?

R,
CB
The information contained in this transmission may contain privileged and 
confidential information. 
It is intended only for the use of the person(s) named above. 
If you are not the intended recipient, you are hereby notified that any review, 
dissemination, distribution or duplication of this communication is strictly 
prohibited. 
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. 
Technica Corporation does not represent this e-mail to be free from any virus, 
fault or defect and it is therefore the responsibility of the recipient to 
first scan it for viruses, faults and defects. 
To reply to our e-mail administrator directly, please send an e-mail to 
postmas...@technicacorp.com. Thank you.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] NIC memory dump on 2nd node Corosync startup

2013-06-19 Thread Colin Blair
All,

Anyone run into a NIC memory dump when Corosync is started on 2nd node? 
Corosync starts fine on each node. When I start them at the same time, the 
network is flooded and the NICs memory dump.

Ubuntu 11.10
Corosync 1.1.3


Thx,
CB
The information contained in this transmission may contain privileged and 
confidential information. 
It is intended only for the use of the person(s) named above. 
If you are not the intended recipient, you are hereby notified that any review, 
dissemination, distribution or duplication of this communication is strictly 
prohibited. 
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. 
Technica Corporation does not represent this e-mail to be free from any virus, 
fault or defect and it is therefore the responsibility of the recipient to 
first scan it for viruses, faults and defects. 
To reply to our e-mail administrator directly, please send an e-mail to 
postmas...@technicacorp.com. Thank you.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Starting Pacemaker Cluster Manager: [FAILED]

2013-06-19 Thread Colin Blair
service cman start.

Thx,
CB


-Original Message-
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: Tuesday, June 18, 2013 8:03 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Starting Pacemaker Cluster Manager: [FAILED]


On 18/06/2013, at 9:35 PM, Colin Blair  wrote:

> Thank you Andrew. Heads up: according to 
> http://clusterlabs.org/wiki/FAQ#Can_I_use_Pacemaker_with_CMAN.3F
> 
> Can I use Pacemaker with CMAN? 
> 
> Yes. Pacemaker added support for CMAN in version 1.1.5 to better 
> integrate with distros shipping the RHCS cluster stack. This is 
> particularly relevant for those looking to use GFS2 or OCFS2. See the 
> documentation for more details

I appear to have forgotten that.  There have been quite a few improvements to 
that support since then though.
Did you run "service cman start" or "service corosync start"? 

> 
> 
> Can you provide a link to a newer pacemaker package compatible with UBUNTU 
> 11.10 Server?

No. The debian/ubuntu people like to do their own thing.

> 
> R,
> CB
> 
> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Monday, June 17, 2013 7:32 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Starting Pacemaker Cluster Manager: [FAILED]
> 
> 
> On 18/06/2013, at 3:09 AM, Colin Blair  wrote:
> 
>> All,
>> Newbie here.  I am trying to create a two-node cluster with the following:
>> 
>> Ubuntu Server 11.10
>> Pacemaker 1.1.5
>> Corosync Cluster Engine 1.3.0
>> CMAN
>> 
>> I am unable to start Pacemaker. CMAN seems to run with Corosync fine. I see 
>> this line : write(1, "[FAILED]\r", 9) = 9. Is this a permissions issue?
> 
> No. Pacemaker 1.1.5 didn't yet support cman.  You'll need to get something 
> newer.
> 
>> 
>> Results of strace service pacemaker start:
>> 
>> execve("/usr/sbin/service", ["service", "pacemaker", "start"], [/* 21 vars 
>> */]) = 0
>> brk(0)  = 0x10fb000
>> access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or 
>> directory)
>> mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
>> 0x7f7cfd6d8000
>> access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or 
>> directory)
>> open("/etc/ld.so.cache", O_RDONLY)  = 3
>> fstat(3, {st_mode=S_IFREG|0644, st_size=22838, ...}) = 0 mmap(NULL, 
>> 22838, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7cfd6d2000
>> close(3)= 0
>> access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or 
>> directory)
>> open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY) = 3 read(3,
>> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \24\2\0\0\0\0\0"...,
>> 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1677624, ...}) = 0 
>> mmap(NULL, 3793768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 
>> 3,
>> 0) = 0x7f7cfd11b000 mprotect(0x7f7cfd2b, 2093056, PROT_NONE) = 0 
>> mmap(0x7f7cfd4af000, 20480, PROT_READ|PROT_WRITE, 
>> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x194000) = 0x7f7cfd4af000 
>> mmap(0x7f7cfd4b4000, 21352, PROT_READ|PROT_WRITE, 
>> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f7cfd4b4000
>> close(3)= 0
>> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
>> 0) = 0x7f7cfd6d1000 mmap(NULL, 8192, PROT_READ|PROT_WRITE, 
>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7cfd6cf000 
>> arch_prctl(ARCH_SET_FS, 0x7f7cfd6cf720) = 0 mprotect(0x7f7cfd4af000, 16384, 
>> PROT_READ) = 0
>> mprotect(0x619000, 4096, PROT_READ) = 0
>> mprotect(0x7f7cfd6da000, 4096, PROT_READ) = 0
>> munmap(0x7f7cfd6d2000, 22838)   = 0
>> getpid()= 8253
>> rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 
>> 0x7f7cfd151420}, {SIG_DFL, [], 0}, 8) = 0
>> geteuid()   = 0
>> brk(0)  = 0x10fb000
>> brk(0x111c000)  = 0x111c000
>> getppid()   = 8252
>> stat("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat(".", 
>> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>> open("/usr/sbin/service", O_RDONLY) = 3
>> fcntl(3, F_DUPFD, 10)   = 10
>> close(3)= 0
>> fcntl(10, F_SETFD, FD_CLOEXEC)  = 0
>> rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8)

Re: [Pacemaker] Starting Pacemaker Cluster Manager: [FAILED]

2013-06-18 Thread Colin Blair
Thanks Sven. Unfortunately, I am unable at this time. 

CB

-Original Message-
From: Sven Arnold [mailto:sven.arn...@localite.de] 
Sent: Tuesday, June 18, 2013 3:50 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Starting Pacemaker Cluster Manager: [FAILED]

Hi Colin,

> Newbie here.  I am trying to create a two-node cluster with the following:
>
> Ubuntu Server 11.10
>
> Pacemaker 1.1.5
>
> Corosync Cluster Engine 1.3.0
>
> CMAN
>
> I am unable to start Pacemaker. CMAN seems to run with Corosync fine. 
> I see this line : write(1, "[FAILED]\r", 9) = 9. Is this a permissions issue?
>

Any chance to upgrade to Ubuntu 12.04 LTS? There you have pacemaker
1.1.6 included in the distribution. This version works (so far) with cman for 
me.

Best regards,

Sven

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
The information contained in this transmission may contain privileged and 
confidential information. 
It is intended only for the use of the person(s) named above. 
If you are not the intended recipient, you are hereby notified that any review, 
dissemination, distribution or duplication of this communication is strictly 
prohibited. 
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. 
Technica Corporation does not represent this e-mail to be free from any virus, 
fault or defect and it is therefore the responsibility of the recipient to 
first scan it for viruses, faults and defects. 
To reply to our e-mail administrator directly, please send an e-mail to 
postmas...@technicacorp.com. Thank you.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Starting Pacemaker Cluster Manager: [FAILED]

2013-06-18 Thread Colin Blair
Thank you Andrew. Heads up: according to 
http://clusterlabs.org/wiki/FAQ#Can_I_use_Pacemaker_with_CMAN.3F

Can I use Pacemaker with CMAN? 

Yes. Pacemaker added support for CMAN in version 1.1.5 to better integrate with 
distros shipping the RHCS cluster stack. This is particularly relevant for 
those looking to use GFS2 or OCFS2. See the documentation for more details


Can you provide a link to a newer pacemaker package compatible with UBUNTU 
11.10 Server?

R,
CB

-Original Message-
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: Monday, June 17, 2013 7:32 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Starting Pacemaker Cluster Manager: [FAILED]


On 18/06/2013, at 3:09 AM, Colin Blair  wrote:

> All,
> Newbie here.  I am trying to create a two-node cluster with the following:
>  
> Ubuntu Server 11.10
> Pacemaker 1.1.5
> Corosync Cluster Engine 1.3.0
> CMAN
>  
> I am unable to start Pacemaker. CMAN seems to run with Corosync fine. I see 
> this line : write(1, "[FAILED]\r", 9) = 9. Is this a permissions issue?

No. Pacemaker 1.1.5 didn't yet support cman.  You'll need to get something 
newer.

>  
> Results of strace service pacemaker start:
>  
> execve("/usr/sbin/service", ["service", "pacemaker", "start"], [/* 21 vars 
> */]) = 0
> brk(0)  = 0x10fb000
> access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or 
> directory)
> mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
> 0x7f7cfd6d8000
> access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or 
> directory)
> open("/etc/ld.so.cache", O_RDONLY)  = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=22838, ...}) = 0 mmap(NULL, 
> 22838, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7cfd6d2000
> close(3)= 0
> access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or 
> directory)
> open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY) = 3 read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \24\2\0\0\0\0\0"..., 
> 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1677624, ...}) = 0 
> mmap(NULL, 3793768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 
> 0) = 0x7f7cfd11b000 mprotect(0x7f7cfd2b, 2093056, PROT_NONE) = 0 
> mmap(0x7f7cfd4af000, 20480, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x194000) = 0x7f7cfd4af000 
> mmap(0x7f7cfd4b4000, 21352, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f7cfd4b4000
> close(3)= 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
> 0) = 0x7f7cfd6d1000 mmap(NULL, 8192, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7cfd6cf000 
> arch_prctl(ARCH_SET_FS, 0x7f7cfd6cf720) = 0 mprotect(0x7f7cfd4af000, 16384, 
> PROT_READ) = 0
> mprotect(0x619000, 4096, PROT_READ) = 0
> mprotect(0x7f7cfd6da000, 4096, PROT_READ) = 0
> munmap(0x7f7cfd6d2000, 22838)   = 0
> getpid()= 8253
> rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 
> 0x7f7cfd151420}, {SIG_DFL, [], 0}, 8) = 0
> geteuid()   = 0
> brk(0)  = 0x10fb000
> brk(0x111c000)  = 0x111c000
> getppid()   = 8252
> stat("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat(".", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> open("/usr/sbin/service", O_RDONLY) = 3
> fcntl(3, F_DUPFD, 10)   = 10
> close(3)= 0
> fcntl(10, F_SETFD, FD_CLOEXEC)  = 0
> rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0 
> rt_sigaction(SIGINT, {0x40f050, ~[RTMIN RT_1], SA_RESTORER, 
> 0x7f7cfd151420}, NULL, 8) = 0 rt_sigaction(SIGQUIT, NULL, {SIG_DFL, 
> [], 0}, 8) = 0 rt_sigaction(SIGQUIT, {SIG_DFL, ~[RTMIN RT_1], 
> SA_RESTORER, 0x7f7cfd151420}, NULL, 8) = 0 rt_sigaction(SIGTERM, NULL, 
> {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGTERM, {SIG_DFL, ~[RTMIN 
> RT_1], SA_RESTORER, 0x7f7cfd151420}, NULL, 8) = 0 read(10, 
> "#!/bin/sh\n\n#"..., 8192) = 4614
> pipe([3, 4])= 0
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> child_tidptr=0x7f7cfd6cf9f0) = 8254
> close(4)= 0
> read(3, "service\n", 128)   = 8
> read(3, "", 128)= 0
> --- SIGCHLD (Child exited) @ 0 (0) ---
> close(3)= 0
> wait4(-

Re: [Pacemaker] Can't issue 'crm configure' commands under privileged user

2012-10-03 Thread Colin McCormack

On 10/02/12 15:49, 
pacemaker-requ...@oss.clusterlabs.org
 wrote:

Check out http://clusterlabs.org/rpm-next for the latest pacemaker for
RHEL5 derivatives.

Thank you! No more hangs!

Although i leave the "crm options user hacluster" my normal linux user can 
issue 'crm configure primitive' commands now when they couldn't before.

When i go "crm options user colinlinux" AND try to issue 'crm configure 
primitive' commands it hangs on:
500  32708 32673  0 10:43 pts/10   00:00:00 /bin/sh -c sudo -E -u colinlinux 
>/dev/null 2>&1 lrmadmin -C

Either way i suppose i'm happy, thanks


This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, please note that any review, dissemination, 
disclosure, alteration, printing, circulation, retention or transmission of 
this e-mail and/or any file or attachment transmitted with it, is prohibited 
and may be unlawful. If you have received this e-mail or any file or attachment 
transmitted with it in error please notify postmas...@openet.com. Although 
Openet has taken reasonable precautions to ensure no viruses are present in 
this email, we cannot accept responsibility for any loss or damage arising from 
the use of this email or attachments.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Can't issue 'crm configure' commands under privileged user

2012-10-02 Thread Colin McCormack

Hi Dejan,
I see that's some kind of workaround in pacemaker code - but how do i
affect a workaround?
But i need a workaround from the terminal/bash

So when i do: crm configure...  it won't hang on me

Cheers again

Col



On 10/02/12 12:04, pacemaker-requ...@oss.clusterlabs.org wrote:

Ah, it's v1.0.x. The workaround is here:

https://github.com/ClusterLabs/pacemaker/commit/dc015e4b9b38ca5a76f36a3245719966082dcdd4

Thanks,

Dejan



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, please note that any review, dissemination, 
disclosure, alteration, printing, circulation, retention or transmission of 
this e-mail and/or any file or attachment transmitted with it, is prohibited 
and may be unlawful. If you have received this e-mail or any file or attachment 
transmitted with it in error please notify postmas...@openet.com. Although 
Openet has taken reasonable precautions to ensure no viruses are present in 
this email, we cannot accept responsibility for any loss or damage arising from 
the use of this email or attachments.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Can't issue 'crm configure' commands under privileged user

2012-10-02 Thread Colin McCormack

Hi again,

"OK. This seems to be a deficiency in lrmd which got fixed later. But
there was a workaround in crm shell for almost two years (iirc since
pacemaker v1.1.5)."

What was this workaround - sorry for such low-level questions - but
googling for this isn't very useful - they're all re-posts from this
mailing i think

"I meant the Pacemaker ACLs. But those are available starting with
Pacemaker v1.1.6."

I'm bound to CentOS 5.x - i did a yum install pacemaker corosync to get
pacemaker - and the version the EPEL installed for me is 1.0.12 - can i
get the latest version? yum update of course had no tagged updates.
Cheers and thanks again

Col



On 10/01/12 10:06, pacemaker-requ...@oss.clusterlabs.org wrote:

On Fri, Sep 28, 2012 at 04:51:36PM +0100, Colin McCormack wrote:

>  Hi Dejan - thanks for taking the time to respond again
>

>  >"Hangs? Wasn't it in the first message that "cibadmin is not

>  available"? If it hangs, then you should check the process list (pstree)
>  to see what the shell is doing at the time and take a look at the logs."
>
>  crm configure...
>  Hangs
>
>  sudo crm configure...
>  cibadmin is not available is issued
>
>  When it hangs this is what i see with a grepped ps:
>
>  500  13710 13677  0 13:19 pts/10   00:00:00 /bin/sh -c sudo -E -u
>  colinlinux>/dev/null 2>&1 lrmadmin -C

OK. This seems to be a deficiency in lrmd which got fixed later.
But there was a workaround in crm shell for almost two years
(iirc since pacemaker v1.1.5).


>  **
>

>  >  "For this, if I understood correctly, you would like to take a look

>  at ACLs. That doesn't require configuring sudo, i.e. the crm shell runs
>  all the time as the real user and the cluster should be instructed by a
>  set of ACL rules about users' rights."
>
>  I haven't configured any ACLs yet - but i have given permissions (as a
>  test) to all of dir /var/lib/heartbeat/crm with no luck

That's not needed actually. And better not to change default
permissions.


>  What directorie(s) should i apply ACLs on?

I meant the Pacemaker ACLs. But those are available starting with
Pacemaker v1.1.6.



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, please note that any review, dissemination, 
disclosure, alteration, printing, circulation, retention or transmission of 
this e-mail and/or any file or attachment transmitted with it, is prohibited 
and may be unlawful. If you have received this e-mail or any file or attachment 
transmitted with it in error please notify postmas...@openet.com. Although 
Openet has taken reasonable precautions to ensure no viruses are present in 
this email, we cannot accept responsibility for any loss or damage arising from 
the use of this email or attachments.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Can't issue 'crm configure' commands under privileged user

2012-09-28 Thread Colin McCormack

Hi Dejan - thanks for taking the time to respond again

>"Hangs? Wasn't it in the first message that "cibadmin is not
available"? If it hangs, then you should check the process list (pstree)
to see what the shell is doing at the time and take a look at the logs."

crm configure...
Hangs

sudo crm configure...
cibadmin is not available is issued

When it hangs this is what i see with a grepped ps:

500  13710 13677  0 13:19 pts/10   00:00:00 /bin/sh -c sudo -E -u
colinlinux >/dev/null 2>&1 lrmadmin -C

**

> "For this, if I understood correctly, you would like to take a look
at ACLs. That doesn't require configuring sudo, i.e. the crm shell runs
all the time as the real user and the cluster should be instructed by a
set of ACL rules about users' rights."

I haven't configured any ACLs yet - but i have given permissions (as a
test) to all of dir /var/lib/heartbeat/crm with no luck

What directorie(s) should i apply ACLs on?

Thanks

Col



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, please note that any review, dissemination, 
disclosure, alteration, printing, circulation, retention or transmission of 
this e-mail and/or any file or attachment transmitted with it, is prohibited 
and may be unlawful. If you have received this e-mail or any file or attachment 
transmitted with it in error please notify postmas...@openet.com. Although 
Openet has taken reasonable precautions to ensure no viruses are present in 
this email, we cannot accept responsibility for any loss or damage arising from 
the use of this email or attachments.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Can't issue 'crm configure' commands under privileged user

2012-09-28 Thread Colin McCormack

Hi Lars,

> "This doesn't "allow" the user to configure the cluster, but runs all
commands from crm as this user (even if running as root). I'm not sure
this is very well tested. "
When i then run commands like crm configure under the root user it also
hangs.

> "I have the impression that the user colinlinux doesn't have
/usr/sbin in its path."
I do, see my original mail (but i understand you could have missed it as
it was a large mail)

Thanks for your reply and time taken.

I would be keen to verify that this behaviour is reasonable to assume
should be in pacemaker. The equivilant is in Veritas cluster
server where certain commands are issued from a 'normal' user and
trusted to configure the cluster/node.

Thanks again

Col




On 09/27/12 18:07, pacemaker-requ...@oss.clusterlabs.org wrote:

Message: 3
Date: Thu, 27 Sep 2012 16:40:15 +0200
From: Lars Marowsky-Bree
To: The Pacemaker cluster resource manager
 
Subject: Re: [Pacemaker] Can't issue 'crm configure' commands under
 privileged user
Message-ID:<20120927144015.go4...@suse.de>
Content-Type: text/plain; charset=iso-8859-1

On 2012-09-27T14:57:08, Colin McCormack  wrote:


>  I installed pacemaker/corosync as root (details below):
>  Pacemaker version 1.0.12, release 1.el5.centos, x86_64
>  Corosync version 1.2.7, release 1.1.el5, x86_64

You have the user in the haclient group, and thus it should be able to
control the cluster. Perhaps


>  Allow user with privileged access to configure the node:
>  crm options user colinlinux

This doesn't "allow" the user to configure the cluster, but runs all
commands from crm as this user (even if running as root). I'm not sure
this is very well tested.


>  WITH SUDO:
>  colinlinux# sudo crm configure primitive xclock ocf:tester:xclock op monitor 
interval=20 timeout=20 start-delay=30s params run_user=colinlinux meta 
failure-timeout="360" migration-threshold=5
>  error given:
>  # cibadmin not available, check your installation

I have the impression that the user colinlinux doesn't have /usr/sbin in
its path.

If you want to restrict the commands that a non-root user can execute on
the cluster, check out the CIB and the shell's ACL support.


Regards,
 Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imend?rffer, HRB 
21284 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, please note that any review, dissemination, 
disclosure, alteration, printing, circulation, retention or transmission of 
this e-mail and/or any file or attachment transmitted with it, is prohibited 
and may be unlawful. If you have received this e-mail or any file or attachment 
transmitted with it in error please notify postmas...@openet.com. Although 
Openet has taken reasonable precautions to ensure no viruses are present in 
this email, we cannot accept responsibility for any loss or damage arising from 
the use of this email or attachments.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Can't issue 'crm configure' commands under privileged user

2012-09-27 Thread Colin McCormack

Hi,

I can't issue 'crm configure' commands under a designated privileged user (via: 
crm options user priv_user) - pacemaker seems to be only be able to be 
configured via the 'root' user. Run with sudo it gives this error: 'cibadmin 
not available, check your installation'

Steps taken:

I installed pacemaker/corosync as root (details below):
Pacemaker version 1.0.12, release 1.el5.centos, x86_64
Corosync version 1.2.7, release 1.1.el5, x86_64

Started corosync under root:
service corosync start

Made config changes under root (for single-node setup):
crm configure property stonith-enabled=false
crm configure property no-quorum-policy=ignore
crm configure property start-failure-is-fatal=false

Allow user with privileged access to configure the node:
crm options user colinlinux

Now when i try to configure under my 'privileged user' a sample xclock & 
gnome-calculator process dependancy - it just hangs...

colinlinux# crm configure primitive xclock ocf:tester:xclock op monitor interval=20 
timeout=20 start-delay=30s params run_user=colinlinux meta 
failure-timeout="360" migration-threshold=5 (HANGS HERE!)

colinlinux# crm configure primitive gnome-calculator ocf:openet:gnome-calculator op 
monitor interval=60s timeout=60s start-delay=30s op start timeout=90 op stop timeout=60 
params run_user=colinlinux meta failure-timeout="360" migration-threshold=5 
(never executes due to hang above)

WITH SUDO:
colinlinux# sudo crm configure primitive xclock ocf:tester:xclock op monitor interval=20 
timeout=20 start-delay=30s params run_user=colinlinux meta 
failure-timeout="360" migration-threshold=5
error given:
# cibadmin not available, check your installation




Sudoers file:
rootALL=(ALL)   ALL
colinlinuxALL=(ALL) NOPASSWD: ALL

User groups for colinlinux user:
# groups colinlinux
colinlinux : colinlinux haclient

PATH:
PATH=$PATH:$HOME/bin:/usr/sbin:/sbin
#which cibadmin
/usr/sbin/cibadmin

Corosync config file:
# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
   version: 2
   secauth: off
   threads: 0
   interface {
   ringnumber: 0
bindnetaddr: 127.0.0.1
mcastaddr: 0.0.0.0
mcastport: 4000
   }
}

logging {
   fileline: off
   to_stderr: no
   to_logfile: yes
   to_syslog: no
   logfile: /tmp/corosync/log/coroLog.log
   debug: on
   timestamp: on
   logger_subsys {
   subsys: AMF
   debug: off
   }
}

amf {
   mode: disabled
}
aisexec {
   user:  root
   group: root
}
service {
   name: pacemaker
   ver: 0
}

Resource files:
See attached (basically the start action starts and returns success - then all 
other actions are dummies and just return success)
But we never get to the start action or any action because the first crm 
command hangs

Log files?:
No activity in the log files.



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, please note that any review, dissemination, 
disclosure, alteration, printing, circulation, retention or transmission of 
this e-mail and/or any file or attachment transmitted with it, is prohibited 
and may be unlawful. If you have received this e-mail or any file or attachment 
transmitted with it in error please notify postmas...@openet.com. Although 
Openet has taken reasonable precautions to ensure no viruses are present in 
this email, we cannot accept responsibility for any loss or damage arising from 
the use of this email or attachments.
#!/bin/sh
#
#
#   Incoming variables of the RA for Mediation Server
#   OCF_RESKEY_port - ms port
#   OCF_RESKEY_prod - FW prod path
#   OCF_RESKEY_home - FW home path
#   OCF_RESKEY_run_user - FW user ID
#
###
# Initialization:

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

###

meta_data() {
cat <


1.0


displays gnome-calculator

display calc



Username from which the resource action will run from, and 
more importantly the environment it will run in
Username that the resource is run under















END
}

###

ms_usage() {
cat <#!/bin/sh


###
# Initialization:

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

###

meta_data() {
cat <


1.0


displays xclock

displays xclock



Username from which the resource action will run from, and 
more importantly the environment it will run in
Username that the resource is run under















END
}

###

ns_usage() {
cat <

[Pacemaker] Setting the logfile option in corosync.conf to a directory not created causes corosync to fail to start with non-descriptive parse error...

2011-04-08 Thread Colin Hines
Just adding this as an FYI if anyone comes across it...

Not creating the logfile directory that is listed in corosync.conf will
create the following log errors and corosync will fail to start (this is
with the latest rpm based builds from http://www.clusterlabs.org/rpm/epel-5/

Apr  8 12:34:24 cvt-db-003 corosync[24350]:   [MAIN  ] Successfully read
main configuration file '/etc/corosync/corosync.conf'.
Apr  8 12:34:24 cvt-db-003 corosync[24350]:   [MAIN  ] parse error in
config: parse error in config: .
Apr  8 12:34:24 cvt-db-003 corosync[24350]:   [MAIN  ] Corosync Cluster
Engine exiting with status 8 at main.c:1397.

c
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Following the clusters from scratch v2 document, and coming up with weird (erroneous?) errors...

2011-04-08 Thread Colin Hines
Okey dokey, I've done some further troubleshooting and started again from
scratch on a new node.  I'm performing this setup on a CentOS 5.5 node.
 Here's an excerpt from my messages file taken after doing a "yum -y install
pacemaker corosync"

Apr  8 11:50:19 cvt-db-003 yum: Updated: bzip2-libs-1.0.3-6.el5_5.x86_64
many packages removed..
Apr  8 11:50:34 cvt-db-003 yum: Installed: corosync-1.2.7-1.1.el5.i386
Apr  8 11:50:34 cvt-db-003 yum: Installed: corosynclib-1.2.7-1.1.el5.x86_64
Apr  8 11:50:34 cvt-db-003 yum: Installed:
pacemaker-libs-1.0.10-1.4.el5.x86_64
Apr  8 11:50:34 cvt-db-003 yum: Installed: corosync-1.2.7-1.1.el5.x86_64
Apr  8 11:50:35 cvt-db-003 yum: Installed:
heartbeat-stonith-2.1.4-11.el5.x86_64
Apr  8 11:50:35 cvt-db-003 yum: Installed: pacemaker-1.0.10-1.4.el5.i386
Apr  8 11:50:35 cvt-db-003 yum: Updated: rpm-libs-4.4.2.3-20.el5_5.1.x86_64
Apr  8 11:50:35 cvt-db-003 yum: Updated: rpm-4.4.2.3-20.el5_5.1.x86_64
Apr  8 11:50:35 cvt-db-003 yum: Updated:
rpm-python-4.4.2.3-20.el5_5.1.x86_64
Apr  8 11:50:36 cvt-db-003 yum: Installed: pacemaker-1.0.10-1.4.el5.x86_64
Apr  8 11:50:39 cvt-db-003 cl_status: [18858]: ERROR: Cannot signon with
heartbeat
Apr  8 11:50:39 cvt-db-003 cl_status: [18858]: ERROR: REASON: hb_api_signon:
Can't initiate connection  to heartbeat
Apr  8 11:50:39 cvt-db-003 cl_status: [18859]: ERROR: Cannot signon with
heartbeat
Apr  8 11:50:39 cvt-db-003 cl_status: [18859]: ERROR: REASON: hb_api_signon:
Can't initiate connection  to heartbeat
Apr  8 11:51:39 cvt-db-003 cl_status: [18971]: ERROR: Cannot signon with
heartbeat
...many more follow


What's weird to me is that I hadn't started ANY services or run any commands
by this point, I'm thinking something in the RPM is kicking off that
cl_status command.

I believe I've determined that when rpm package
heartbeat-3.0.3-2.3.el5.x86_64.rpm is installed, that's when the errors
start occurring.  It seems like that is a required dependency for the latest
pacemaker RPM on http://www.clusterlabs.org/rpm/epel-5/.  I removed the
pacemaker and heartbeat packages using yum, and then re-added them via RPMs,
but found out that pacemaker requires the heartbeat-libs package or tools
such as crm_verify fail.  Following re-install of heartbeat-libs, pacemaker,
and pacemaker-libs with --no-deps, no more erroneous error messages.  I can
break/fix the issue by installing and removing
the heartbeat-3.0.3-2.3.el5.x86_64 package.

c


On Fri, Apr 8, 2011 at 9:48 AM, Lars Ellenberg wrote:

> On Fri, Apr 08, 2011 at 09:13:45AM +0200, Andrew Beekhof wrote:
> > On Thu, Apr 7, 2011 at 11:48 PM, Colin Hines 
> wrote:
> > > I've recently followed the clusters from scratch v2 document for RHEL
> and
> > > although my cluster works and fails over correctly using corosync, I
> have
> > > the following error message coming up in my logs consistently, twice a
> > > minute:
> > > Apr  7 17:44:41 cvt-db-005 cl_status: [5901]: ERROR: Cannot signon with
> > > heartbeat
> > > Apr  7 17:44:41 cvt-db-005 cl_status: [5901]: ERROR: REASON:
> hb_api_signon:
> > > Can't initiate connection  to heartbeat
> >
> > Someone/something is running cl_status.
> > Find out who/what and stop them - it has no place in a corosync based
> cluster.
>
> That could be the status action of the SBD stonith plugin,
> between commits
> http://hg.linux-ha.org/glue/rev/faada7f3d069(Apr 2010)
> http://hg.linux-ha.org/glue/rev/1448deafdf79(May 2010)
>
> if so, upgrade your "cluster glue".
>
> > > I can send my configs, but they're pretty vanilla, has anyone seen
> anything
> > > like this before.   I did have a heartbeat installation on this host
> before
> > > I followed the CFSv2 document, but heartbeat is stopped and I've
> verified
> > > that cl_status doesn't output those errors if I stop corosync.
> > > c
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Following the clusters from scratch v2 document, and coming up with weird (erroneous?) errors...

2011-04-07 Thread Colin Hines
I've recently followed the clusters from scratch v2 document for RHEL and
although my cluster works and fails over correctly using corosync, I have
the following error message coming up in my logs consistently, twice a
minute:

Apr  7 17:44:41 cvt-db-005 cl_status: [5901]: ERROR: Cannot signon with
heartbeat
Apr  7 17:44:41 cvt-db-005 cl_status: [5901]: ERROR: REASON: hb_api_signon:
Can't initiate connection  to heartbeat

I can send my configs, but they're pretty vanilla, has anyone seen anything
like this before.   I did have a heartbeat installation on this host before
I followed the CFSv2 document, but heartbeat is stopped and I've verified
that cl_status doesn't output those errors if I stop corosync.

c
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Colin
On Mon, Jan 18, 2010 at 11:52 AM, Florian Haas  wrote:
>
> the current approach is to utilize 2 Pacemaker clusters, each highly
> available in its own right, and employing manual failover. As described
> here:

Thanks for the pointer! Perhaps "site" is not quite the correct term
for our setup, where we still have (multiple) Gbit-or-faster ethernet
links, think fire areas, at most in adjacent buildings.

For the next step up, two geographically different sites, I agree that
manual failover is more appropriate, but we feel that our case of the
fire areas should still be handled automatically…(?)

Can anybody judge how difficult it would be to integrate some kind of
quorum-support into the cluster? (All cluster nodes attempt a quorum
reservation; the node that gets it, has 1.5 or 2 votes towards the
quorum, rather than just one; this would ensure continued operation in
the case of a) a fire area losing power, b) the separate quorum-server
failing, or c) the cross-fire-area cluster-interconnects failing (but
not more than one failure at a time)…)

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Colin
Hi All,

we are currently looking at nearly the same issue, in fact I just
wanted to start a similarly titled thread when I stumbled over these
messages…

The setup we are evaluating is actually a 2*N-node-cluster, i.e. two
slightly separated sites with N nodes each. The main difference to an
N-node-cluster is that a failure of one of the two groups of nodes
must be considered a single failure event [against which the cluster
must protect, e.g. loss of power at one site].

As far as I gather from this, and other, mail threads, there is
currently no out-of-the-box quorum-something solution for pacemaker.
Before I start digging deeper [into possible solutions], there's one
question I need to ask:

In a pacemaker + corosync setup, who decides whether a partition has
quorum? I.e, would a quorum-device mechanism need to be integrated
with corosync, or with pacemaker, or with both?

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-12-10 Thread Colin
On Thu, Dec 10, 2009 at 2:00 PM, Andrew Beekhof  wrote:
> On Fri, Nov 27, 2009 at 10:54 AM, Colin  wrote:
>> On Mon, Nov 23, 2009 at 9:59 AM, Colin  wrote:
>>> On Fri, Nov 20, 2009 at 8:05 PM, Andrew Beekhof  wrote:
>>>> On Fri, Nov 20, 2009 at 12:36 PM, Andrew Beekhof  
>>>> wrote:
>>>>> Remote notifications should work, I'll test that today.
>>>>
>>>> As of http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/a6d70b1b479d
>>>> they finally work for clear-text connections.
>>>
>>> Downloading ... Compiling ... Testing ... Success!
>>>
>>> (Although there's still the following message from crm_mon:
>>> "Notification setup failed, won't be able to reconnect after failure",
>>> it does seem to hang on and update itself correctly when the CIB
>>> changes...)
>>
>> On my other test cluster, with 32bit systems, the notification does
>> not work, i.e. crm_mon gives me the correct status and then doesn't
>> ever update.
>
> Very odd.  Client and host were both 32-bit?

AFAIR yes, one testing cluster has hardware that isn't even 64bit capable.

(Would you expect problems between mixed hosts?)

Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] How to delete a resource

2009-12-07 Thread Colin
On Mon, Dec 7, 2009 at 12:27 PM, Andrew Beekhof  wrote:
> On Mon, Dec 7, 2009 at 12:14 PM, Colin  wrote:
>> On Mon, Dec 7, 2009 at 12:07 PM, Andrew Beekhof  wrote:
>>> On Mon, Dec 7, 2009 at 11:10 AM, Colin  wrote:
>>>>>  # crm configure delete 
>>>>
>>>> Thanks, that did the trick — it recursively deletes everything
>>>> connected to the resource.
>>>>
>>>> Wonder why crm_resource —delete doesn't do the same thing…
>>>
>>> Its not trying to be clever.
>>> It does only what you ask it to do.
>>
>> So what is the conceptual difference between asking to delete a
>> resource via "crm configure delete" and via "crm_resource —delete"?
>
> One magically deletes everything.
> The other just tries to delete what you tell it to.

Let me rephrase that question: Why do two interfaces for one and the
same thing behave differently? There must be some conceptional
rationale...

>> (Is there any case where the latter will actually work?)
>
> Yes, when there are no constraints referencing the resource.

Ok, it's the constraints, not the dynamic part of the cib.

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] How to delete a resource

2009-12-07 Thread Colin
On Mon, Dec 7, 2009 at 12:07 PM, Andrew Beekhof  wrote:
> On Mon, Dec 7, 2009 at 11:10 AM, Colin  wrote:
>>>  # crm configure delete 
>>
>> Thanks, that did the trick — it recursively deletes everything
>> connected to the resource.
>>
>> Wonder why crm_resource —delete doesn't do the same thing…
>
> Its not trying to be clever.
> It does only what you ask it to do.

So what is the conceptual difference between asking to delete a
resource via "crm configure delete" and via "crm_resource —delete"?
(Is there any case where the latter will actually work?)

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] How to delete a resource

2009-12-07 Thread Colin
On Mon, Dec 7, 2009 at 11:06 AM, Michael Schwartzkopff
 wrote:
> Am Montag, 7. Dezember 2009 10:53:46 schrieb Colin:
>> Hi,
>>
>> when trying to delete a resource, either by replacing the whole
>> ""-part of the CIB with cibadmin with a new version where
>> some resources are missing, or by using a "crm_resource -t primitive
>> —resource name —delete", I get the following error:
>>
>> Error performing operation: Update does not conform to the configured
>> schema/DTD
>
> well, you need to tell the cluster WHAT resource you want to delete. Please
> enter the name of the resource after the -r
>
> crm_resource -D -t primitive -r 
>
> or do you have a resource with the ID "--delete"?

Above I wrote "crm_resource -t primitive —resource name —delete", with
the implication that I inserted the actual resource name on the actual
command line [but that didn't work for me].

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] How to delete a resource

2009-12-07 Thread Colin
On Mon, Dec 7, 2009 at 11:03 AM, Tim Serong  wrote:
> On 12/7/2009 at 08:53 PM, Colin  wrote:
>> Hi,
>>
>> when trying to delete a resource, either by replacing the whole
>> ""-part of the CIB with cibadmin with a new version where
>> some resources are missing, or by using a "crm_resource -t primitive
>> —resource name —delete", I get the following error:
>>
>> Error performing operation: Update does not conform to the configured
>> schema/DTD
>>
>> Now since the error doesn't tell me where the problem is, I can only
>> guess that the problem is that other, dynamic parts of the CIB still
>> "reference" the resource, and the schema prevents "dangling
>> references". So if these methods don't work, and the "crm"-shell
>> doesn't have a "delete" for resources, is there an official and simple
>> way to delete a resource?
>
> This should do it:
>
>  # crm configure delete 

Thanks, that did the trick — it recursively deletes everything
connected to the resource.

Wonder why crm_resource —delete doesn't do the same thing…

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] How to delete a resource

2009-12-07 Thread Colin
Hi,

when trying to delete a resource, either by replacing the whole
""-part of the CIB with cibadmin with a new version where
some resources are missing, or by using a "crm_resource -t primitive
—resource name —delete", I get the following error:

Error performing operation: Update does not conform to the configured schema/DTD

Now since the error doesn't tell me where the problem is, I can only
guess that the problem is that other, dynamic parts of the CIB still
"reference" the resource, and the schema prevents "dangling
references". So if these methods don't work, and the "crm"-shell
doesn't have a "delete" for resources, is there an official and simple
way to delete a resource?

(Otherwise I need to shutdown the cluster on all nodes, trash the cib,
and configure from scratch.)

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-27 Thread Colin
On Mon, Nov 23, 2009 at 9:59 AM, Colin  wrote:
> On Fri, Nov 20, 2009 at 8:05 PM, Andrew Beekhof  wrote:
>> On Fri, Nov 20, 2009 at 12:36 PM, Andrew Beekhof  wrote:
>>> Remote notifications should work, I'll test that today.
>>
>> As of http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/a6d70b1b479d
>> they finally work for clear-text connections.
>
> Downloading ... Compiling ... Testing ... Success!
>
> (Although there's still the following message from crm_mon:
> "Notification setup failed, won't be able to reconnect after failure",
> it does seem to hang on and update itself correctly when the CIB
> changes...)

On my other test cluster, with 32bit systems, the notification does
not work, i.e. crm_mon gives me the correct status and then doesn't
ever update.

Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-23 Thread Colin
>> (Although there's still the following message from crm_mon:
>> "Notification setup failed, won't be able to reconnect after failure",
>> it does seem to hang on and update itself correctly when the CIB
>> changes...)
>
> Eventually I'll implement that functionality too and the message will go away.

Then the next Cool Thing would be to support multiple CIB_servers and
use the first one that a connection can be made to.

Hm.

Or do other people use a clustered IP address for remote
administration, together with e.g. some iptables forwarding?

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-23 Thread Colin
On Fri, Nov 20, 2009 at 8:05 PM, Andrew Beekhof  wrote:
> On Fri, Nov 20, 2009 at 12:36 PM, Andrew Beekhof  wrote:
>> Remote notifications should work, I'll test that today.
>
> As of http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/a6d70b1b479d
> they finally work for clear-text connections.

Downloading ... Compiling ... Testing ... Success!

(Although there's still the following message from crm_mon:
"Notification setup failed, won't be able to reconnect after failure",
it does seem to hang on and update itself correctly when the CIB
changes...)

Thanks a lot, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-20 Thread Colin
PS: I believe this CRM_ASSERT() in lib/common/remote.c can never trigger.

if(encrypted) {
#ifdef HAVE_GNUTLS_GNUTLS_H
reply = cib_recv_tls(session);
#else
CRM_ASSERT(encrypted == FALSE);
#endif
} else {

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-20 Thread Colin
On Fri, Nov 20, 2009 at 12:36 PM, Andrew Beekhof  wrote:
> On Fri, Nov 20, 2009 at 11:17 AM, Colin  wrote:
>> - The assumption that a partial read (wrt. the buffer) signals no more
>> data is IMO not valid.
>
> It is if you didn't get a signal.

What if the number of payload bytes per IP packet is not a multiple of
the third argument to recv(), and you have a slow connection? This is
TCP, so you the data can come at any fast or slow rate. And TCP
lacking any kind of implicit record markers (not like UDP or SCTP that
have them) you normally have to look at the data to know when you're
done reading... At least that's my current understanding of [the
shortcomings of the stream-abstraction provided by] TCP.

> But I agree the code needs a cleanup.
>
> I went with: http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/5acf9f2e9c9e

Great, I'll set up Mercurial and then I'll test it.

>> And that is as far as I can get with crm_mon, as it doesn't supports
>> continuous update via remote access?
>>
>> static int cib_remote_set_connection_dnotify(
>>    cib_t *cib, void (*dnotify)(gpointer user_data))
>> {
>>    return cib_NOTSUPPORTED;
>> }
>
> No, thats something else.
> Remote notifications should work, I'll test that today.

Right, this function does not seem to get used. With:

if(full) {
  crm_debug_3("Full connect: start");
if(rc == cib_ok) {
  crm_debug_3("Full connect: dnotify");
rc = cib->cmds->set_connection_dnotify(cib,
mon_cib_connection_destroy);
}

if(rc == cib_ok) {
  crm_debug_3("Full connect: callback");
cib->cmds->del_notify_callback(cib, T_CIB_DIFF_NOTIFY,
crm_diff_update);
rc = cib->cmds->add_notify_callback(cib,
T_CIB_DIFF_NOTIFY, crm_diff_update);
}

if(rc != cib_ok) {
print_as("Notification setup failed, could not monitor
CIB actions");
if(as_console) { sleep(2); }
clean_up(-rc);
}
}

the output of 'tools/.libs/crm_mon -VVVNrf' finishes with:

Migration summary:
* Node cluster1:
crm_mon[21188]: 2009/11/20_12:51:58 debug: debug3:
cleanup_calculations: deleting resources
crm_mon[21188]: 2009/11/20_12:51:58 debug: debug3:
cleanup_calculations: deleting actions
crm_mon[21188]: 2009/11/20_12:51:58 debug: debug3:
cleanup_calculations: deleting nodes
crm_mon[21188]: 2009/11/20_12:51:58 debug: debug3: cib_connect: Full
connect: start
crm_mon[21188]: 2009/11/20_12:51:58 debug: debug3: cib_connect: Full
connect: dnotify
crm_mon[21188]: 2009/11/20_12:51:58 debug: cib_remote_signoff: Signing
out of the CIB Service
crm_mon[21188]: 2009/11/20_12:51:58 WARN: cib_remote_free: Freeing CIB
Notification setup failed, could not monitor CIB
actionscluster1:~/Pacemaker-my# fg

Side note: Now I often get two password prompts?!?

cluster1:~/Pacemaker-my# tools/.libs/crm_mon -VNrf
Attempting connection to the cluster...Password:Password:

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-20 Thread Colin
Hi,

this is looking better again: A remote "cibadmin -Q" is now doing the
right thing, however a remote "crm_mon" is still _not_ working
correctly.

Let's see, now that I should know where to look ... the function
cib_recv_plaintext() in lib/common/remote.c looks a bit suspicious to
me:

- The "if (len == 0)" check will never be true because len is
initialised to 512 and then only grows.
- The assumption that a partial read (wrt. the buffer) signals no more
data is IMO not valid.

With the following patch I can at least get a "crm_mon -1rf" to do the
right thing:

diff -ur Pacemaker-1-0-f7a8250d23fc/lib/common/remote.c
Pacemaker-my/lib/common/remote.c
--- Pacemaker-1-0-f7a8250d23fc/lib/common/remote.c  2009-11-19
21:12:53.0 +0100
+++ Pacemaker-my/lib/common/remote.c2009-11-20 10:52:36.0 +0100
@@ -220,33 +220,29 @@
 char*
 cib_recv_plaintext(int sock)
 {
-   int last = 0;
char* buf = NULL;
-   int chunk_size = 512;
-   int len = chunk_size;
+   ssize_t buf_size = 512;
+   ssize_t len = 0;

-   crm_malloc0(buf, chunk_size);
+   crm_malloc0(buf, buf_size);

while(1) {
-   int rc = recv(sock, buf+last, chunk_size, 0);
+   ssize_t rc = recv(sock, buf+len, buf_size-len, 0);
if (rc == 0) {
if(len == 0) {
goto bail;
}
return buf;

-   } else if(rc > 0 && rc < chunk_size) {
-   return buf;
-
-   } else if(rc == chunk_size) {
-   last = len;
-   len += chunk_size;
-   crm_realloc(buf, len);
-   CRM_ASSERT(buf != NULL);
+   } else if(rc > 0) {
+ len += rc;
+ if (len == buf_size) {
+   crm_realloc(buf, buf_size += 512);  /* Should do
exponential growth for amortized constant time? */
+   CRM_ASSERT(buf != NULL);
+ }
}
-
if(rc < 0 && errno != EINTR) {
-   crm_perror(LOG_ERR,"Error receiving message: %d", rc);
+ crm_perror(LOG_ERR,"Error receiving message: %d", (int)rc);
goto bail;
}
}

And that is as far as I can get with crm_mon, as it doesn't supports
continuous update via remote access?

static int cib_remote_set_connection_dnotify(
cib_t *cib, void (*dnotify)(gpointer user_data))
{
return cib_NOTSUPPORTED;
}


Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-19 Thread Colin
On Thu, Nov 19, 2009 at 8:31 PM, Andrew Beekhof  wrote:
> Fixed the plaintext connections and made a couple of the changes you 
> suggested.
>
> http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/971d8989e9f0

That's great, thanks!

/me is off to compile Pacemaker.

Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-16 Thread Colin
On Mon, Nov 16, 2009 at 4:42 PM, Andrew Beekhof  wrote:
> On Mon, Nov 16, 2009 at 4:31 PM, Colin  wrote:
>>
>> On Mon, Nov 16, 2009 at 3:19 PM, Andrew Beekhof  wrote:
>>> On Thu, Nov 12, 2009 at 4:46 PM, Colin  wrote:
>>>> On Thu, Nov 12, 2009 at 3:36 PM, Andrew Beekhof  wrote:
>>>
>>>> 5) The log message "cib: [2941]: debug: cib_remote_listen: New
>>>> clear-text connection" should include from where the connection came.
>>>
>>> why and how?
>>
>> Why: It's like "file not found" without the info which file wasn't
>> found ... perhaps it's just me, but I would like to see the source IP
>> and port of the connection.
>>
>> How: You're probably not asking me how to implement the feature, so
>> I'm assuming that you misunderstood what exactly I was asking for(?).
>
> No, I'm saying that I'm pretty sure we don't have access to the IP 
> information.

In cib/remote.c the call to accept(2) which fills in the data
structure with the IP is just 2 lines after the call to crm_debug(),
is it a problem to change the order?

>>>> 6) The log message "cib: [2941]: ERROR: cib_remote_listen: User is not
>>>> a member of the required group" might mention which user and which
>>>> group...
>>>
>>> it doesn't do so for security reasons
>>
>> Hm.
>>
>> Security? I see, that's when you use unencrypted remote syslogging --
>> anybody already on the machine could just use ps(1).
>>
>> How about logging it in the ERROR messages, but only when
>> debug-logging is enabled?
>
> No, because then I'll get confused emails from people wondering why
> there are a stream of ERRORs in the logs.

Erm, I don't want to change the frequency or the level of any message,
just that the one ERROR message quoted above is changed in content to
include the uid/user and gid/group to which it refers when
debug-logging is enabled.

>> Weird. I'm using the precompiled Debian packages for Pacemaker 1.0.6
>> with Corosync. Anything that might help debug the problem?
>
> add more hours to the day? :)

One-way ticket to Mars help?

Colin ;-)

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-16 Thread Colin
Hi Andrew,

thanks for your response!

On Mon, Nov 16, 2009 at 3:19 PM, Andrew Beekhof  wrote:
> On Thu, Nov 12, 2009 at 4:46 PM, Colin  wrote:
>> On Thu, Nov 12, 2009 at 3:36 PM, Andrew Beekhof  wrote:
>
>> 5) The log message "cib: [2941]: debug: cib_remote_listen: New
>> clear-text connection" should include from where the connection came.
>
> why and how?

Why: It's like "file not found" without the info which file wasn't
found ... perhaps it's just me, but I would like to see the source IP
and port of the connection.

How: You're probably not asking me how to implement the feature, so
I'm assuming that you misunderstood what exactly I was asking for(?).

>> 6) The log message "cib: [2941]: ERROR: cib_remote_listen: User is not
>> a member of the required group" might mention which user and which
>> group...
>
> it doesn't do so for security reasons

Hm.

Security? I see, that's when you use unencrypted remote syslogging --
anybody already on the machine could just use ps(1).

How about logging it in the ERROR messages, but only when
debug-logging is enabled?

>> 8) Just tried with crm_resource: The password prompt when not setting
>> CIB_password is sent to stdout, rather than stderr [which makes it
>> near impossible to send the output someplace].
>
> we can probably change that

That'd be great, also because the new behaviour would be more in-line
with what many other command line programs do...

>> 9) I am getting completely bogus results via the remote connection,
>> e.g. "crm_resource --list" shows only 2 of 8 resources, and shows the
>> as stopped, whereas on the cluster nodes I see the -- correct -- list
>> with 8 resources which are all started. With "cibadmin -Q" I get:
>>
>> # cibadmin -Q | wc  # on a cluster node
>>    379    1895   50474
>>
>> # cibadmin -Q | wc  # via the remote connection
>> cibadmin: Opened connection to 192.168.80.10:6900
>>     66     193    4731
>
> someone else mentioned that, i've not been able to reproduce it yet.

Weird. I'm using the precompiled Debian packages for Pacemaker 1.0.6
with Corosync. Anything that might help debug the problem?

r...@cluster1:~# tail -f /var/log/daemon.log
Nov 16 15:53:33 cluster1 cib: [24749]: debug: cib_remote_listen: New
clear-text connection
Nov 16 15:53:34 cluster1 cib: [24749]: info: log_data_element:
cib_remote_listen: Login:  
Nov 16 15:53:34 cluster1 cib: [24749]: debug: cib_remote_listen: New
clear-text connection
Nov 16 15:53:35 cluster1 cib: [24749]: info: log_data_element:
cib_remote_listen: Login:  
Nov 16 15:53:35 cluster1 corosync[7426]:   [TOTEM ] mcasted message
added to pending queue
[... more corosync messages ...]
Nov 16 15:53:35 cluster1 corosync[7426]:   [TOTEM ] releasing messages
up to and including 48a
Nov 16 15:53:35 cluster1 cib: [24749]: ERROR: cib_recv_remote_msg: Empty reply
Nov 16 15:53:35 cluster1 cib: [24749]: ERROR: cib_recv_plaintext:
Error receiving message: -1: Connection reset by peer (104)
Nov 16 15:53:35 cluster1 cib: [24749]: ERROR: cib_recv_remote_msg: Empty reply
^C
r...@cluster1:~# cibadmin -Q | wc
3821943   51825
r...@cluster1:~#

r...@admin:~# cibadmin -Q > cib.xml
cibadmin: Opened connection to 192.168.80.10:6900
r...@admin:~# wc cib.xml
  86  255 6379 cib.xml
r...@admin:~#

>> 10) It's very easy to trash the cib process, e.g. by connecting via
>> telnet and sending a few bytes of garbage; result is an endless loop
>> of "cib: [7846]: ERROR: cib_recv_remote_msg: Empty reply" messages,
>> one per second, and that I need to "killall -9 cib" in order to get
>> everything working again.
>
> ok, thats not good.
> I think this patch should fix it though:
>
> diff -r 828b3329a64c cib/remote.c
> --- a/cib/remote.c      Fri Nov 06 16:28:21 2009 +0100
> +++ b/cib/remote.c      Mon Nov 16 15:18:41 2009 +0100
> @@ -220,7 +220,7 @@ cib_remote_listen(int ssock, gpointer da
>        }
>
>        do {
> -               crm_debug_2("Iter: %d", lpc++);
> +               crm_debug_2("Iter: %d", lpc);
>                if(ssock == remote_tls_fd) {
>  #ifdef HAVE_GNUTLS_GNUTLS_H
>                    login = cib_recv_remote_msg(session, TRUE);
> @@ -230,7 +230,7 @@ cib_remote_listen(int ssock, gpointer da
>                }
>                sleep(1);
>
> -       } while(login == NULL && lpc < 10);
> +       } while(login == NULL && ++lpc < 10);
>
>        crm_log_xml_info(login, "Login: ");
>        if(login == NULL) {

Thanks, since we have been using precompiled packages I haven't
actually gone through the exercise of compiling Pacemaker, so it might
take some time before I get around to testing this patch...

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-12 Thread Colin
On Thu, Nov 12, 2009 at 3:36 PM, Andrew Beekhof  wrote:
> I used it the other day.
>
> http://www.clusterlabs.org/doc/pacemaker-explained/ch-advanced-options.html#s-remote-connection
>
> Try setting CIB_encrypted to false.

Thanks, that got me a step further...

...but there are still various issues:

1) In cib/remote.c, the function check_group_membership() only checks
whether the user is explicitly listed as member of the group in
/etc/group, but does not accept the user if only the users's primary
group in /etc/passwd is set to the correct group (and the explicit,
then redundant, membership in /etc/group is missing).

2) "Configuration Explained" does not mention CIB_encryped, that's why
my first attempts didn't work in the first place.

3) "Configuration Explained" says "remote-open-port" instead of
"remote-clear-port" in one place.

4) "Configuration Explained" says that CIB_user must be in the
"hacluster" group, rather then the "haclient" group.

5) The log message "cib: [2941]: debug: cib_remote_listen: New
clear-text connection" should include from where the connection came.

6) The log message "cib: [2941]: ERROR: cib_remote_listen: User is not
a member of the required group" might mention which user and which
group...

7) "Configuration Explained" and the page you just sent me both state
that CIB_user must be part of the hacluster group; apart from the
mistake that the group is haclient, the commend in cib/remote.c and my
observations shows that CIB_user actually must be the user as which
the cib process is running.

8) Just tried with crm_resource: The password prompt when not setting
CIB_password is sent to stdout, rather than stderr [which makes it
near impossible to send the output someplace].

9) I am getting completely bogus results via the remote connection,
e.g. "crm_resource --list" shows only 2 of 8 resources, and shows the
as stopped, whereas on the cluster nodes I see the -- correct -- list
with 8 resources which are all started. With "cibadmin -Q" I get:

# cibadmin -Q | wc  # on a cluster node
3791895   50474

# cibadmin -Q | wc  # via the remote connection
cibadmin: Opened connection to 192.168.80.10:6900
 66 1934731

10) It's very easy to trash the cib process, e.g. by connecting via
telnet and sending a few bytes of garbage; result is an endless loop
of "cib: [7846]: ERROR: cib_recv_remote_msg: Empty reply" messages,
one per second, and that I need to "killall -9 cib" in order to get
everything working again.

Only once, out of a couple dozen attempts, did the remote access
actually yield the correct output, other times it completely fails
without any apparent reason ... at this point I'm not quite sure what
to make of all this.

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Remote Access not Working

2009-11-10 Thread Colin
Does anybody else successfully use this feature, or is it suffering
from bit-rot?

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Remote Access not Working

2009-11-09 Thread Colin
Hi All,

just tried to get the remote access to the cluster up-and-running, but
with more error than success...

Starting point was a working cluster installation. Then I did

# cibadmin --modify -X ''
# /etc/init.d/corosync stop
# /etc/init.d/corosync start

to get the listener, erm, listening:

# netstat -ant | grep 6900
tcp0  0 0.0.0.0:69000.0.0.0:*   LISTEN

For a first test I also changed the password of the "hacluster" user.

Then, on another machine, I set up the environment variables as follows:

# env | grep CIB
CIB_server=192.168.80.10
CIB_user=hacluster
CIB_port=6900

And issued a simple command, crm_resource --list. The crm_resource
command asks for a password and then hangs, on the cluster machine I
find the following in /var/log/daemon.log:

Nov  9 17:15:10 mz-dom0-001-4000 cib: [15698]: debug:
cib_remote_listen: New clear-text connection
Nov  9 17:15:10 mz-dom0-001-4000 cib: [15698]: ERROR: crm_xml_err: XML
Error: Entity: line 1: parsererror : Start tag expected, '<' not found
Nov  9 17:15:10 mz-dom0-001-4000 cib: [15698]: ERROR: crm_xml_err: XML
Error: #026#003#002
Nov  9 17:15:10 mz-dom0-001-4000 cib: [15698]: ERROR: crm_xml_err: XML Error: ^
Nov  9 17:15:10 mz-dom0-001-4000 cib: [15698]: WARN: string2xml:
Parsing failed (domain=1, level=3, code=4): Start tag expected, '<'
not found
Nov  9 17:15:10 mz-dom0-001-4000 cib: [15698]: ERROR: string2xml:
Couldn't parse 3 chars: #026#003#002
Nov  9 17:15:10 mz-dom0-001-4000 cib: [15698]: ERROR:
cib_recv_remote_msg: Couldn't parse: '#026#003#002'
Nov  9 17:15:26 mz-dom0-001-4000 cib: [15698]: ERROR:
cib_recv_remote_msg: Empty reply
Nov  9 17:15:27 mz-dom0-001-4000 cib: [15698]: ERROR:
cib_recv_remote_msg: Empty reply
Nov  9 17:15:28 mz-dom0-001-4000 cib: [15698]: ERROR:
cib_recv_remote_msg: Empty reply
Nov  9 17:15:29 mz-dom0-001-4000 cib: [15698]: ERROR:
cib_recv_remote_msg: Empty reply
Nov  9 17:15:30 mz-dom0-001-4000 cib: [15698]: ERROR:
cib_recv_remote_msg: Empty reply
.

This continues forever, an error message every second, and the process
does not stop itself the normal way:

# /etc/init.d/corosync stop
Stopping corosync daemon: corosync.
# ps aux | grep cib
105  15698  0.3  0.7  13844  4588 ?S17:12   0:01
/usr/lib/heartbeat/cib

This seems to prevent other processes from cleanly shutting down, too.

Am I doing something obviously wrong?

Thanks, Colin


PS: AFAICS the remote access does not support something like failover,
or connections to multiple cluster hosts, so I'll have to roll my own
wrapper that takes care of the issue?

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Installation woes (w/Debian packages)

2009-11-09 Thread Colin
On Sun, Oct 18, 2009 at 10:33 AM, Andrew Beekhof  wrote:
> On Fri, Oct 16, 2009 at 10:54 AM, Raoul Bhatia [IPAX]  
> wrote:
>> On 10/16/2009 09:59 AM, Matthew Palmer wrote:
>>> If this were a single-machine service, I'd completely agree with you.
>>> Unfortunately, a cluster service like pacemaker needs to have absolutely
>>> consistent configuration across all the nodes in the cluster, and having it
>>> read off a file on disk would make that *amazingly* difficult and dangerous.
>>> I remember the fun and games I had dealing with cman (or whatever it was
>>> that went with that) and it's "read an XML config file and update everyone"
>>> model.  I'll take "crm configure edit" over that any day, TYVM.
>>
>> to my knowledge, if no cib.xml file exists, pacemaker creates an empty
>> one with epoch="0" (or similar, to my experience at least < 100 ;) )
>>
>> i've done the following steps numerous times:
>> 1. stop pacemaker on all nodes
>> 2. erase all cib.xml related files
>> 3. drop a new cib.xml into the correct directory on one node
>> 4. set the correct permissions
>> 5. startup all nodes
>> 6. witness the new configuration unfold
>
> Yep, if you must take this approach, then the above steps are correct :-)
>
> Though these days, its probably easier to skip steps 3 and 4 and load
> the config using the crm shell.

Is it correct, that the remote-{clear|tls}-port attributes are only
honoured at startup, i.e. I need to restart corosync (or shoot down
the cib process) in order to get the port to be opened?

That would mean a stop-start cycle of the cluster on every node if I
don't start dropping XML-files into place (which I have avoided so
far)...

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Suggestions/questions for Pacemaker

2009-11-08 Thread Colin
On Fri, Nov 6, 2009 at 12:56 PM, Andrew Beekhof  wrote:
> On Fri, Nov 6, 2009 at 12:16 PM, Colin  wrote:
>
>> 2) If I haven't missed something, there is no possibility to configure
>> dependencies on "any of a group"; given a configuration of "resource
>> set A has resources A1, A2, ..., An", we would like to say that
>> "resource B needs at least any n resources from group A
>> up-and-running, and it would be good if they were all up-and-running."
>> (The latter is of course already possible with an appropritate
>> advisory ordering constraint.)
>
> Nod.  http://developerbugs.linux-foundation.org/show_bug.cgi?id=2007

Oh boy, there's even a lot of play in such a "simple" thing: This page
speaks of starting B after at least one of the As is "started", we
would prefer starting B if at least m of the As start up, but only
after trying to start all As, not as early as possible.

>> 7) The naming convention for the XML config seems more difficult than
>> necessary with the mixed use of underscores and dashes as
>> word-separators.
>
> most of the underscores were replaced by dashes with 1.0, the only
> underscores that remain were ones we couldn't change for compatibility
> reasons.

Does this also apply to "multiple-active" vs. "start_stop"?

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Adding Cluster Nodes in Pacemaker "Configuration Explained"

2009-11-05 Thread Colin
On Thu, Nov 5, 2009 at 7:32 PM, Andrew Beekhof  wrote:
> On Thu, Nov 5, 2009 at 4:25 PM, Colin  wrote:
>> On Wed, Oct 21, 2009 at 2:57 PM, Andrew Beekhof  wrote:
>>> On Wed, Oct 21, 2009 at 1:48 PM, Colin  wrote:
>>>> Let's see whether I can summarise correctly:
>>>
>>> basically all correct
>>>
>>>> For Pacemaker + CoroSync/OpenAIS, the Pacemaker attribute
>>>> "expected-quorum-votes" tells the cluster how many votes it needs for
>>>> quorum. (What happens when the option is not set?)
>>>
>>> The cluster sets it for you based on how many nodes it can see.
>>
>> Something strange is going on here:
>>
>> - The cluster updates expected-quorum-votes even when I set it
>> manually; this seems to happen whenever a node joins or leavs the
>> cluster.
>>
>> - A cluster with four expected-quorum-votes, and three of four nodes
>> online, still has quorum; is this a bug??? Or is expected-quorum-votes
>> not the number of nodes required in a partition to have quorum?
>
> no, its the total number of nodes seen by the cluster.
> the cluster wont allow you to set this to a value less than the number
> of nodes it knows about

Great, with that settled I can report that our 1.0.6 w/corosync
cluster survived the first round of testing with no problems!

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped

2009-11-04 Thread Colin
On Wed, Nov 4, 2009 at 5:47 PM, Andrew Beekhof  wrote:
>
> Hopelessly out of date?
> Corosync has been supported for all of 3 days now.

Sorry, it seems that I jumped to a wrong conclusion (namely that with
Corosync being a part of OpenAIS, and Pacemaker having run on OpenAIS
for a while, that there wasn't much difference to supporting Corosync
instea of OpenAIS -- shows that I'm still quite ignorant about some of
the internals.)

Actually, I set up Pacemaker with Corosync from the new packages, just
to see what it looks like, and it was so easy that we'll stick to it
for the next round of tests, i.o.w., the details of the cluster
underneath Pacemaker are so well hidden that (a) it doesn't make much
difference, and (b) my ignorance in that area never was a problem: It
just works.

-Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped

2009-11-04 Thread Colin
On Wed, Nov 4, 2009 at 2:32 PM, Colin  wrote:
> On Tue, Nov 3, 2009 at 4:32 PM, Martin Gerhard Loschwitz wrote:
>
> One question: AFAICS the package dependencies automatically install
> corosync, but not heartbeat; so in order to use pacemaker with
> heartbeat we need to (a) disable corosync in /etc/rcS.d, and (b)
> manually install heartbeat?

Ok, nothing do disable -- /etc/default/corosync sets START to no (how
many more switches do we need for one and the same thing?).

But why _is_ corosync in rcS.d, rather than rc2.d?

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped

2009-11-04 Thread Colin
On Tue, Nov 3, 2009 at 4:32 PM, Martin Gerhard Loschwitz
 wrote:
>
> i am happy to announce the availability of Pacemaker 1.0.6 packages
> for Debian GNU/Linux 5.0 alias Lenny (i386 and amd64).

Great, thanks!

> * pacemaker-openais and pacemaker-heartbeat are gone; pacemaker now
> only comes in one flavour, having support for corosync and heartbeat
> built it. This is based on pacemaker's capability to detect by which
> messaging framework it has been started and act accordingly.

One question: AFAICS the package dependencies automatically install
corosync, but not heartbeat; so in order to use pacemaker with
heartbeat we need to (a) disable corosync in /etc/rcS.d, and (b)
manually install heartbeat?

It also seems that http://clusterlabs.org/wiki/Install and
http://clusterlabs.org/wiki/Initial_Configuration seem to be
hopelessly out of date; suppose I wanted to try out pacemaker with
corosync from the latest packages, what do I minimally need to set up?

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Load Balancing, Node Scores and Stickiness

2009-10-26 Thread Colin
On Fri, Oct 23, 2009 at 2:23 PM, Andrew Beekhof  wrote:
> On Fri, Oct 23, 2009 at 10:13 AM, Colin  wrote:
>> On Thu, Oct 22, 2009 at 3:51 PM, Johan Verrept  wrote:
>>> On Thu, 2009-10-22 at 15:10 +0200, Florian Haas wrote:
>>>> On 10/22/2009 02:37 PM, Andrew Beekhof wrote:
>>>> >> I wondered, does it happen dynamically? If one resource starts using a
>>>> >> lot of resources, are the other migrated to other nodes?
>>>> >
>>>> > Not yet.
>>>> > Such a feature is planned though.
>>>> >
>>>> > At the moment pacemaker purely goes on the number of services it has
>>>> > allocated to the node.
>>>> > Total/Available RAM, CPU, HDD, none of these things are yet taken into 
>>>> > account.
>>>>
>>>> Are there any plans on how this feature would look like in more detail?
>>>> A daemon monitoring various performance indicators and updating node
>>>> attributes accordingly? Couldn't that be done today, as a cloneable
>>>> resource agent?
>>>
>>> I can see a few problems with such a feature if you wish to implement it
>>> today.
>>> First of all, you cannot really move services to less loaded nodes if
>>> you cannot determine which resource causes which load. If you pick a
>>> resource at random, you might move a "too heavy" resource to another
>>> less loaded node and cause even more load on that node resulting in
>>> something (else?) being moved back. It will create a pretty unstable
>>> cluster under load.
>>> I am also unsure if it would be wise to mix this directly into the
>>> current node scoring. Load numbers will vary wildly and unless the
>>> resulting attribute values are in some way stabilised over longer
>>> periods, it will also cause unstability. (RRDTool?)
>>> It might be possible, but it will be one hell of a complex RA :). A
>>> daemon might be better, but both will require a LOT of configuration
>>> just to differentiate the load of the different resources.
>>>
>>>> Or are you referring to missing features actually evaluating such
>>>> information, as in, rather than saying "run this resource on a node with
>>>> at load average of X or less", being able to say "run this resource on
>>>> the node with the currently lowest load average"?
>>>
>>> How will that translate into repeatable node states? At this moment, if
>>> you use a timed evaluation of the cluster state, resources should always
>>> be assigned to the same nodes (at least, I've never seen it change
>>> unless it was under direction of a time contraint).
>>>
>>> "run this resource on the node with the currently lowest load average"
>>> is something that is very unlikely to ever return the same answer twice.
>>>
>>> Complex indeed! Someone is going to have a considerable amount of fun
>>> with this :D
>>
>> Perhaps static load balancing could be implemented first, before
>> trying to go dynamic.
>>
>> Suppose you could configure an arbitrary set of "measures" (I'd call
>> them resources, but that word is already taken in this context), like
>> an arbitrary set of keywords. For every such "measure", you can then
>> configure (a) how much of it each node has, and (b) how much of it
>> each cluster resource/service requires. The cluster can then use some
>> heuristics to find a good distribution of resources (perfect could be
>> too hard, this is squarely in NP-complete land; PostgreSQL uses a
>> genetic algorithm for query optimisation...).
>
> Thats basically what we're going for this time around.
> Maybe with enough experience we'll attempt the dynamic version below.
>
>> This is not as good as dynamic balancing, but still better than
>> nothing, for example this could make sure that a resource with a
>> tendency to do I/O runs on the same node with a resource that
>> generally uses much CPU...

The algorithms should be the same, the difference is whether you run
them once on statically configured resource usage, or continuously on
dynamically gathered resource consumption (or a weighted average of
the two). Probably a good idea to test with static input first...

Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Load Balancing, Node Scores and Stickiness

2009-10-23 Thread Colin
On Thu, Oct 22, 2009 at 3:51 PM, Johan Verrept  wrote:
> On Thu, 2009-10-22 at 15:10 +0200, Florian Haas wrote:
>> On 10/22/2009 02:37 PM, Andrew Beekhof wrote:
>> >> I wondered, does it happen dynamically? If one resource starts using a
>> >> lot of resources, are the other migrated to other nodes?
>> >
>> > Not yet.
>> > Such a feature is planned though.
>> >
>> > At the moment pacemaker purely goes on the number of services it has
>> > allocated to the node.
>> > Total/Available RAM, CPU, HDD, none of these things are yet taken into 
>> > account.
>>
>> Are there any plans on how this feature would look like in more detail?
>> A daemon monitoring various performance indicators and updating node
>> attributes accordingly? Couldn't that be done today, as a cloneable
>> resource agent?
>
> I can see a few problems with such a feature if you wish to implement it
> today.
> First of all, you cannot really move services to less loaded nodes if
> you cannot determine which resource causes which load. If you pick a
> resource at random, you might move a "too heavy" resource to another
> less loaded node and cause even more load on that node resulting in
> something (else?) being moved back. It will create a pretty unstable
> cluster under load.
> I am also unsure if it would be wise to mix this directly into the
> current node scoring. Load numbers will vary wildly and unless the
> resulting attribute values are in some way stabilised over longer
> periods, it will also cause unstability. (RRDTool?)
> It might be possible, but it will be one hell of a complex RA :). A
> daemon might be better, but both will require a LOT of configuration
> just to differentiate the load of the different resources.
>
>> Or are you referring to missing features actually evaluating such
>> information, as in, rather than saying "run this resource on a node with
>> at load average of X or less", being able to say "run this resource on
>> the node with the currently lowest load average"?
>
> How will that translate into repeatable node states? At this moment, if
> you use a timed evaluation of the cluster state, resources should always
> be assigned to the same nodes (at least, I've never seen it change
> unless it was under direction of a time contraint).
>
> "run this resource on the node with the currently lowest load average"
> is something that is very unlikely to ever return the same answer twice.
>
> Complex indeed! Someone is going to have a considerable amount of fun
> with this :D

Perhaps static load balancing could be implemented first, before
trying to go dynamic.

Suppose you could configure an arbitrary set of "measures" (I'd call
them resources, but that word is already taken in this context), like
an arbitrary set of keywords. For every such "measure", you can then
configure (a) how much of it each node has, and (b) how much of it
each cluster resource/service requires. The cluster can then use some
heuristics to find a good distribution of resources (perfect could be
too hard, this is squarely in NP-complete land; PostgreSQL uses a
genetic algorithm for query optimisation...).

This is not as good as dynamic balancing, but still better than
nothing, for example this could make sure that a resource with a
tendency to do I/O runs on the same node with a resource that
generally uses much CPU...

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Load Balancing, Node Scores and Stickiness

2009-10-22 Thread Colin
On Thu, Oct 22, 2009 at 11:40 AM, Andrew Beekhof  wrote:
> On Wed, Oct 21, 2009 at 2:06 PM, Colin  wrote:
>>
>> it seems from the documentation that Pacemaker has some inherent
>> tendency to to load-balancing, in the sense of, given equal choice,
>> not starting all resources on a single node...
>>
>> Now, I would like to be able to choose freely on a scale between
>> "always move everything to achieve good load balancing" and "don't
>> gratuitously migrate resources", and would therefore like to
>> understand the algorithms in Pacemaker better.
>>
>> Given a bunch of nodes and resources with a simple setup, i.e. no
>> resource colocation constraints, no groups etc., I understand that a
>> global score is calculated for each resource and each node, where
>>
>> score( resource, node ) = sum of all rsc_location constraints for that
>> resource and node + if the resource is already running on this node,
>> the stickiness (the stickiness of the resource or the global default
>> stickiness)
>>
>> How does the assignment of nodes proceed? My guess is something like:
>>
>> for every resource in order of resource priority
>>   choose node with highest score for that resource
>
>     if multiple nodes exist with the same score, pick one with the
> least allocated resources

That's easy enough to understand ... and I can't do any fine-tuning,
i.e. suppose that 4 nodes of my 10 node cluster fail, and then come up
again. If all resources have equal score on all nodes (without
counting stickiness), then (a) if stickiness is greater than 0 all
resources will stay put, and (b) if stickiness is 0 then the cluster
will move around resources to distribute them evenly?

(Come to think of it, any kind of fine-tuning taking into account
multiple resource weights and more complicated migration resistance
scores would probably be algorithmically really really horrible and
complex (in the complexity theoretic sense, too)...)

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Load Balancing, Node Scores and Stickiness

2009-10-21 Thread Colin
Hi All,

it seems from the documentation that Pacemaker has some inherent
tendency to to load-balancing, in the sense of, given equal choice,
not starting all resources on a single node...

Now, I would like to be able to choose freely on a scale between
"always move everything to achieve good load balancing" and "don't
gratuitously migrate resources", and would therefore like to
understand the algorithms in Pacemaker better.

Given a bunch of nodes and resources with a simple setup, i.e. no
resource colocation constraints, no groups etc., I understand that a
global score is calculated for each resource and each node, where

score( resource, node ) = sum of all rsc_location constraints for that
resource and node + if the resource is already running on this node,
the stickiness (the stickiness of the resource or the global default
stickiness)

How does the assignment of nodes proceed? My guess is something like:

for every resource in order of resource priority
   choose node with highest score for that resource
   ??? somehow modify scores to prevent all resources on one node

Are the details documented anywhere [except for the source]?

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Adding Cluster Nodes in Pacemaker "Configuration Explained"

2009-10-21 Thread Colin
On Wed, Oct 21, 2009 at 12:14 PM, Andrew Beekhof  wrote:
> On Oct 21, 2009, at 12:11 PM, Colin wrote:
>> On Wed, Oct 21, 2009 at 12:00 PM, Andrew Beekhof  wrote:
>>> On Oct 21, 2009, at 11:50 AM, Colin wrote:
>>>> On Wed, Oct 21, 2009 at 11:12 AM, Andrew Beekhof  
>>>> wrote:
>>>>> On Oct 21, 2009, at 11:05 AM, Colin wrote:
>>>>>
>>>>>> perhaps the section on adding cluster nodes should mention adjusting
>>>>>> the expected-quorum-votes, otherwise you get something like:
>>>>>
>>>>> expected-quorum-votes is ignored for heartbeat based clusters.
>>>>> i should remove it from the crm_mon output when not using openais
>>>>
>>>> Erm ... so how do I tell heartbeat how many nodes are needed?
>>>
>>> You don't.
>>> It knows how many nodes it has ever seen and you get quorum when you have
>>> at least half that.
>>
>> Ok -- is that a "greater equal" or a "greater than"?
>
> greater than.  thats the definition of quorum
>
>> In a 8-node cluster we _don't_ want 4 nodes to have quorum, only 5
>> nodes or more, to strictly prevent a split brain...
>>
>> ...however a 2-node cluster wouldn't work with "greater than 50%", so
>> presumably it's the other (which we don't want).
>
> heartbeat always pretends two-node clusters have quorum.

Ah, I see, that's why I was unsure about the "greater than" vs. "greater equal".

Let's see whether I can summarise correctly:

For Pacemaker + CoroSync/OpenAIS, the Pacemaker attribute
"expected-quorum-votes" tells the cluster how many votes it needs for
quorum. (What happens when the option is not set?)

For Pacemaker + Heartbeat, the Pacemaker attribute
"expected-quorum-votes" is ignored, and Heartbeat's idea of quorum is
used, i.e. (a) for a two-node cluster (list of cluster nodes has two
entries as per "cl_status listnodes") every node that is up and
running believes to have quorum, whether alone or not, and (b) for an
n-node cluster with n greater than two a partition with strictly
greater than 50% of the nodes has quorum. (IIRC (a) can be changed by
making the vote more complicated, but I'm not interested in two-node
clusters and did not look into this.)

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Installation woes (w/Debian packages)

2009-10-16 Thread Colin
On Thu, Oct 15, 2009 at 10:51 AM, Matthew Palmer  wrote:
> On Thu, Oct 15, 2009 at 10:07:56AM +0200, Colin wrote:
>> Another question regarding how to activate a pacemaker config: Is
>> there any way to activate the config before the cluster starts up?
>>
>> (Scenario is that the installation of the cluster nodes is fully
>> automatic. It seems a bit awkward how to configure pacemaker if I
>> can't just write out a config file during install: I need to somehow
>> make sure that on first system boot a script that activates my config
>> is executed, but not too early because it takes a minute or so until
>> cibadmin(1) and friends actually work...)
>
> I believe you can drop a cib.xml into place before the cluster first starts
> and it'll pick up and run with that.  I'm not a fan of that method, though,
> as it has all the same problems as imaging machines (no easy means of
> updating running configs the same way as you update "initial" configs, and
> so on).  We're configuring pacemaker using Puppet, just describing the
> primitives, groups, constraints and so on in the manifest and having Puppet
> do all the heavy lifting if required.

Thanks for the note, plus: I've never heard of Puppet, but will check it out.

(1 min later: http://wiki.github.com/camptocamp/puppet-pacemaker has
no downloads, and no documentation; is it even remotely stable/ready
for use?)

> Why do you need to have the config setup completely before starting
> the cluster, though?

Let's just say I like my programs/daemons to start up with the correct
configuration, because I've already been burnt: Some time back there
was a similar problem with a different application where the default
that it started up with just didn't work correctly; it's always easier
when a program/daemon just reads a config file, and monitors it for
changes (or re-reads it on HUP), these application-specific ways of
feeding a config into an already running program are particular
annoying because every program uses a different method for it.

Is there a simple-and-clean alternative to dropping a cib.xml file
into place when doing a fully-automated installation?

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Installation woes (w/Debian packages)

2009-10-15 Thread Colin
On Sun, Oct 11, 2009 at 9:13 PM, Andrew Beekhof  wrote:
> On Fri, Oct 9, 2009 at 3:12 PM, Colin  wrote:
>> The config explained document is excellent -- once everything is up
>> and running to arrive at "its level".
>
> Agreed.  I've started working on some howtos to fill the gap, but it
> will take time :-)

Another question regarding how to activate a pacemaker config: Is
there any way to activate the config before the cluster starts up?

(Scenario is that the installation of the cluster nodes is fully
automatic. It seems a bit awkward how to configure pacemaker if I
can't just write out a config file during install: I need to somehow
make sure that on first system boot a script that activates my config
is executed, but not too early because it takes a minute or so until
cibadmin(1) and friends actually work...)

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Access CIB/CRM via IPC/Library?

2009-10-14 Thread Colin
Ok, thanks for the pointers.

Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Access CIB/CRM via IPC/Library?

2009-10-09 Thread Colin
Hi All,

after having understood the concepts (if not all details) of
configuring Pacemaker ("Configuration Explained" does a good job) I
started checking out how to actually create/activate a configuration.

Seems that there's some GUI (wherever it is, it's not included in my
packages, but we don't need it anyhow) and the command-line utilities
crm and cibadmin. However what we really need is a more direct access
in order to programmatically change the cluster configuration [without
calling out to the command-line utilities from Python or C].

It also seems that crm and cibadmin use libcrmcommon to communicate
via the Unix sockets in /var/run/crm. Is this library documented
anywhere? Are there Python-bindings? Is there any other officially
supported API?

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Installation woes (w/Debian packages)

2009-10-09 Thread Colin
On Wed, Oct 7, 2009 at 11:31 AM, Andrew Beekhof  wrote:
> On Tue, Oct 6, 2009 at 10:19 AM, Colin  wrote:
>
> ok, you can run both of those commands with metadata to get the descriptions
>  /usr/lib/heartbeat/crmd metadata
>  /usr/lib/heartbeat/pengine metadata

Thanks for the hint; I have now also understood that crmd and pengine
are in fact part of pacemaker, the "heartbeat" directory
notwithstanding.

> Eventually I'll use soem xslt to turn them into a man page.
> There is always the config explained pdf though:
> http://www.clusterlabs.org/wiki/Documentation

The config explained document is excellent -- once everything is up
and running to arrive at "its level".

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Installation woes (w/Debian packages)

2009-10-06 Thread Colin
On Tue, Oct 6, 2009 at 9:59 AM, Andrew Beekhof  wrote:
> On Tue, Oct 6, 2009 at 8:58 AM, Colin  wrote:
>> Is there a complete documentation of all config-parameters somewhere?
>>
>> A heap of "Using default value for..." just scrolled by in the log,
>> with most of the affected parameters not even appearing once in the
>> documentation...
>
> Which process?

Oct  6 10:13:41 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '60s' for cluster option 'dc-deadtime'
Oct  6 10:13:41 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '15min' for cluster option 'cluster-recheck-interval'
Oct  6 10:13:41 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '2min' for cluster option 'election-timeout'
Oct  6 10:13:41 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '20min' for cluster option 'shutdown-escalation'
Oct  6 10:13:41 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '3min' for cluster option 'crmd-integration-timeout'
Oct  6 10:13:41 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '30min' for cluster option 'crmd-finalization-timeout'
Oct  6 10:13:41 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '2' for cluster option 'expected-quorum-votes'
Oct  6 10:14:47 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '60s' for cluster option 'dc-deadtime'
Oct  6 10:14:47 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '15min' for cluster option 'cluster-recheck-interval'
Oct  6 10:14:47 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '2min' for cluster option 'election-timeout'
Oct  6 10:14:47 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '20min' for cluster option 'shutdown-escalation'
Oct  6 10:14:47 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '3min' for cluster option 'crmd-integration-timeout'
Oct  6 10:14:47 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '30min' for cluster option 'crmd-finalization-timeout'
Oct  6 10:14:47 cluster0 crmd: [2704]: debug: cluster_option: Using
default value '2' for cluster option 'expected-quorum-votes'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'stop' for cluster option 'no-quorum-policy'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'true' for cluster option 'symmetric-cluster'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value '0' for cluster option 'default-resource-stickiness'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'true' for cluster option 'is-managed-default'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'false' for cluster option 'maintenance-mode'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'true' for cluster option 'start-failure-is-fatal'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'true' for cluster option 'stonith-enabled'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'reboot' for cluster option 'stonith-action'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value '60s' for cluster option 'stonith-timeout'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'true' for cluster option 'startup-fencing'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value '60s' for cluster option 'cluster-delay'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value '30' for cluster option 'batch-limit'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value '20s' for cluster option 'default-action-timeout'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'false' for cluster option 'stop-all-resources'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'true' for cluster option 'stop-orphan-resources'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster_option: Using
default value 'true' for cluster option 'stop-orphan-actions'
Oct  6 10:14:49 cluster0 pengine: [2718]: debug: cluster

Re: [Pacemaker] Installation woes (w/Debian packages)

2009-10-06 Thread Colin
>> Check all the necessary paths. If the processes cannot write to their
>> respective dirs they refuse to start and if pacemaker cannot start all
>> the necessary processes, the node is rebooted. At least you know now
>> that pacemaker is actually started :)

Ok, with the /var/run/crm created the cluster at least stays up, and
crm_mon gives me some sane output.

cluster0:/var/run# mkdir crm
cluster0:/var/run# chown hacluster:haclient crm

I'll take it from there...

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Installation woes (w/Debian packages)

2009-10-06 Thread Colin
Is there a complete documentation of all config-parameters somewhere?

A heap of "Using default value for..." just scrolled by in the log,
with most of the affected parameters not even appearing once in the
documentation...

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Installation woes (w/Debian packages)

2009-10-05 Thread Colin
On Mon, Oct 5, 2009 at 4:53 PM, Johan Verrept  wrote:
> On Mon, 2009-10-05 at 16:30 +0200, Colin wrote:
>
>> - The pacemaker-heartbeat installation, with "crm yes" in
>> /etc/ha.d/ha.cf, does not seem to start pacemaker, at least a crm_mon
>> just hangs trying to connect to the cluster (the documentation doesn't
>> give any hint on how to check), and a little bit later the machine
>> reboots with
>>
>> Message from sysl...@cluster0 at Oct  5 16:06:49 ...
>>  heartbeat: [2821]: EMERG: Rebooting system.  Reason: /usr/lib/heartbeat/cib
>
> Check all the necessary paths. If the processes cannot write to their
> respective dirs they refuse to start and if pacemaker cannot start all
> the necessary processes, the node is rebooted. At least you know now
> that pacemaker is actually started :)
> If you see no errors, reconfigure your syslog to allow more messages.
> Usually, these are created when doing a "make install"

I'm trying really hard to RTFM, but haven't found anything that is
even half complete and/or correct. Which are the "necessary paths"?
Which processes should there even be?

Anyhow, thanks for the hint, I hadn't looked for errors regarding
directories because I had assumed (wrongly) that the Debian packages
would include all necessary directories. Now I found this:

Oct  5 16:06:46 cluster0 attrd: [2838]: ERROR: socket_wait_conn_new:
trying to create in /var/run/crm/attrd bind:: No such file or
directory
Oct  5 16:06:46 cluster0 attrd: [2838]: ERROR: wait_channel_init:
Can't create wait channel of type uds: No such file or directory (2)

(Since "attrd" is not mentioned anywhere in the included documentation
I will look at the source, or use try and error to find out the
correct ownership of this directory.)

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Installation woes (w/Debian packages)

2009-10-05 Thread Colin
On Mon, Oct 5, 2009 at 8:25 PM, Stefan  wrote:
> helo,
>
> Am Montag 05 Oktober 2009 16:30:37 schrieb Colin:
>> Hi All,
>>
>> in order to evaluate Pacemaker for our new cluster (the features shown
>> in "Configuration Explained" are exactly what we need, and miles
>> beyond what our old Heartbeat version 1-based cluster could do) I just
>> took two freshly installed Debian Lenny systems and tried to install
>> Pacemaker as per http://clusterlabs.org/wiki/Install and
>> http://clusterlabs.org/wiki/Initial_Configuration, once with heartbeat
>> and once with openais. Unfortunately both installations don't get even
>> half off the ground:
>
> you tried this for installing?
>
> http://clusterlabs.org/wiki/Debian_Lenny_HowTo

Not yet -- it seems to install the exactly same packages as the
Install-page uses, except for installing heartbeat as part of the
openais setup (where I would be surprised if it solved the missing
ais-keygen etc. files problem).

I'll give it a spin if nothing else works.

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Installation woes (w/Debian packages)

2009-10-05 Thread Colin
Hi All,

in order to evaluate Pacemaker for our new cluster (the features shown
in "Configuration Explained" are exactly what we need, and miles
beyond what our old Heartbeat version 1-based cluster could do) I just
took two freshly installed Debian Lenny systems and tried to install
Pacemaker as per http://clusterlabs.org/wiki/Install and
http://clusterlabs.org/wiki/Initial_Configuration, once with heartbeat
and once with openais. Unfortunately both installations don't get even
half off the ground:

- The pacemaker-heartbeat installation, with "crm yes" in
/etc/ha.d/ha.cf, does not seem to start pacemaker, at least a crm_mon
just hangs trying to connect to the cluster (the documentation doesn't
give any hint on how to check), and a little bit later the machine
reboots with

Message from sysl...@cluster0 at Oct  5 16:06:49 ...
 heartbeat: [2821]: EMERG: Rebooting system.  Reason: /usr/lib/heartbeat/cib

- The pacemaker-openais installation, or the package dependencies, are
incomplete: There is no ais-keygen command, and no /etc/init.d/openais
script.

Can I expect more success installing from source? Which installation
method/instructions do people successfully use?

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker