Re: [Pacemaker] votequorum for 2 node cluster

2014-06-11 Thread Jacek Konieczny
On 06/11/14 16:35, Kostiantyn Ponomarenko wrote:
> And that is like roulette, in case we lose the lowest nodeid we lose all.
> So I can lose only the node which doesn't have the lowest nodeid?
> And it's not useful in 2 node cluster.
> Am i correct?

It may be usefull. If you define roles of the nodes, like this:

– node 2: 'master' node
– node 1: 'backup' node

and monitor node availability to replace backup when it fails, then you
can get quite a high availability.

It won't work in Active/Active or peer-to-peer (all nodes equal) setups,
though.

Greets,
  Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Howto check if the current node is active?

2014-01-07 Thread Jacek Konieczny
On 2014-01-07 13:33, Bauer, Stefan (IZLBW Extern) wrote:
> How can i check if the current node i’m connected to is the active?
> 
> It should be parseable because i want to use it in a script.

Pacemaker is not limited to Active-Passive setups, in fact it has no
notion of 'Active' node – every node in the cluster is active (unless on
standby).

Active/Passive node status may make sense in many Pacemaker deployments
– but that is specific to the configuration. Sometimes 'active node'
will be running the DRBD master, other times it will be the one where a
specific resource is running. Generally you can test that with by
parsing 'cibadmin' output or some higher-level Pacemaker shell (crmsh or
pcs).

Greets,
   Jacek


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker/corosync: error: qb_sys_mmap_file_open: couldn't open file

2013-06-26 Thread Jacek Konieczny
On Wed, 26 Jun 2013 18:38:37 +1000
Andrew Beekhof  wrote:
> >> trace   Jun 25 13:40:10 gio_read_socket(366):0: 0xa6c140.4 1
> >> (ref=1) trace   Jun 25 13:40:10 lrmd_ipc_accept(89):0: Connection
> >> 0xa6d110 infoJun 25 13:40:10 crm_client_new(276):0: Connecting
> >> 0xa6d110 for uid=17 gid=0 pid=25212
> >> id=d771e06b-47e7-43a6-a447-63343870396e debug   Jun 25 13:40:10
> >> handle_new_connection(735):2147483648: IPC credentials
> >> authenticated (25209-25212-6) debug   Jun 25 13:40:10
> >> qb_ipcs_shm_connect(282):2147483648: connecting to client [25212]
> >> 
> >> Slightly more helpful...
> >> 
> >> What group(s) does uid=17 have?
> > 
> > # id hacluster
> > uid=17(hacluster) gid=60(haclient) groups=60(haclient)
> > # ps -u hacluster
> >  PID TTY  TIME CMD
> > 2335 ?00:00:00 cib
> > 2339 ?00:00:00 attrd
> > 2340 ?00:00:00 pengine
> > 2355 ?00:00:00 crmd
> > # grep -E '^(Uid|Gid|Groups)' /proc/2355/status
> > Uid:17171717
> > Gid:0000
> > Groups:
> > 
> > So, crmd runs with the 'hacluster' uid, but no associated
> > gid/groups. Can this be the problem?
> 
> Definitely. 
> 
> Urgh. Now that I look closer, I see the commits I was thinking of
> came _after_ 1.1.9, not before :-(
> 
> So basically you need an rc of 1.1.10 

1.1.10-rc5 works. Thanks a lot for the debugging help! :)

Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker/corosync: error: qb_sys_mmap_file_open: couldn't open file

2013-06-26 Thread Jacek Konieczny
On Wed, 26 Jun 2013 14:35:03 +1000
Andrew Beekhof  wrote:
> Urgh:
> 
> infoJun 25 13:40:10 lrmd_ipc_connect(913):0: Connecting to lrmd
> trace   Jun 25 13:40:10 pick_ipc_buffer(670):0: Using max message
> size of 51200 error   Jun 25 13:40:10
> qb_sys_mmap_file_open(92):2147483648: couldn't open
> file /dev/shm/qb-lrmd-request-25209-25212-6-header: Permission denied
> (13)
> 
> useless :-(

:-(

> trace   Jun 25 13:40:10 gio_read_socket(366):0: 0xa6c140.4 1 (ref=1)
> trace   Jun 25 13:40:10 lrmd_ipc_accept(89):0: Connection 0xa6d110
> infoJun 25 13:40:10 crm_client_new(276):0: Connecting 0xa6d110
> for uid=17 gid=0 pid=25212 id=d771e06b-47e7-43a6-a447-63343870396e
> debug   Jun 25 13:40:10 handle_new_connection(735):2147483648: IPC
> credentials authenticated (25209-25212-6) debug   Jun 25 13:40:10
> qb_ipcs_shm_connect(282):2147483648: connecting to client [25212]
> 
> Slightly more helpful...
> 
> What group(s) does uid=17 have?

# id hacluster
uid=17(hacluster) gid=60(haclient) groups=60(haclient)
# ps -u hacluster
  PID TTY  TIME CMD
 2335 ?00:00:00 cib
 2339 ?00:00:00 attrd
 2340 ?00:00:00 pengine
 2355 ?00:00:00 crmd
# grep -E '^(Uid|Gid|Groups)' /proc/2355/status
Uid:17  17  17  17
Gid:0   0   0   0
Groups: 

So, crmd runs with the 'hacluster' uid, but no associated gid/groups.
Can this be the problem?

Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker/corosync: error: qb_sys_mmap_file_open: couldn't open file

2013-06-25 Thread Jacek Konieczny
On Tue, 25 Jun 2013 20:24:00 +1000
Andrew Beekhof  wrote:
> On 25/06/2013, at 5:56 PM, Jacek Konieczny  wrote:
> 
> > On Tue, 25 Jun 2013 10:50:14 +0300
> > Vladislav Bogdanov  wrote:
> >> I would recommend qb 1.4.4. 1.4.3 had at least one nasty bug which
> >> affects pacemaker.
> > 
> > Just tried that. It didn't help.
> 
> Can you turn on the blockbox please?

Sure.

> Details at http://blog.clusterlabs.org/blog/2013/pacemaker-logging/
> 
> That should produce a mountain of logs when the error occurs.

I have sent the logs to Andrew only, not to pollute the mailing list
(not sure even if the list accepts MB of attachments).

Myself, I was not able to find anything suspicious in the logs.

Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] pacemaker/corosync: error: qb_sys_mmap_file_open: couldn't open file

2013-06-25 Thread Jacek Konieczny
On Tue, 25 Jun 2013 10:50:14 +0300
Vladislav Bogdanov  wrote:
> I would recommend qb 1.4.4. 1.4.3 had at least one nasty bug which
> affects pacemaker.

Just tried that. It didn't help.

Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] pacemaker/corosync: error: qb_sys_mmap_file_open: couldn't open file

2013-06-25 Thread Jacek Konieczny
On Tue, 25 Jun 2013 08:59:19 +0200
Jacek Konieczny  wrote:
> On Tue, 25 Jun 2013 16:43:54 +1000
> Andrew Beekhof  wrote:
> > 
> > Ok, I was just checking Pacemaker was built for the running version
> > of libqb.
> 
> Yes it was. corosync 2.2.0 and libqb 0.14.0 both on the build system
> and on the cluster systems.
> 
> Hmm… I forgot libqb is a separate package… I guess I should try
> upgrading libqb now…

I have upgraded libqb to 0.14.4 and rebuilt both corosync and pacemaker
with it. No change:

Jun 25 09:52:32 dev1n2 crmd[22714]:error: qb_sys_mmap_file_open: couldn't 
open file /dev/shm/qb-lrmd-request-22711-22714-5-header: Permission denied (13)
Jun 25 09:52:32 dev1n2 crmd[22714]:error: qb_sys_mmap_file_open: couldn't 
open file /var/run/qb-lrmd-request-22711-22714-5-header: No such file or 
directory (2)
Jun 25 09:52:32 dev1n2 crmd[22714]:error: qb_rb_open: couldn't create file 
for mmap
Jun 25 09:52:32 dev1n2 crmd[22714]:error: qb_ipcc_shm_connect: 
qb_rb_open:REQUEST: No such file or directory (2)
Jun 25 09:52:32 dev1n2 crmd[22714]:error: qb_ipcc_shm_connect: connection 
failed: No such file or directory (2)
Jun 25 09:52:32 dev1n2 crmd[22714]:  warning: do_lrm_control: Failed to sign on 
to the LRM 11 (30 max) times

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] pacemaker/corosync: error: qb_sys_mmap_file_open: couldn't open file

2013-06-25 Thread Jacek Konieczny
On Tue, 25 Jun 2013 16:43:54 +1000
Andrew Beekhof  wrote:
> 
> Ok, I was just checking Pacemaker was built for the running version
> of libqb.

Yes it was. corosync 2.2.0 and libqb 0.14.0 both on the build system and
on the cluster systems.

Hmm… I forgot libqb is a separate package… I guess I should try
upgrading libqb now…

> What is the permissions on /dev/shm/ itself?

[root@dev1n2 ~]# ls -ld /dev/shm
drwxrwxrwt 2 root root 800 Jun 24 13:31 /dev/shm


Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] pacemaker/corosync: error: qb_sys_mmap_file_open: couldn't open file

2013-06-24 Thread Jacek Konieczny
On Tue, 25 Jun 2013 10:10:13 +1000
Andrew Beekhof  wrote:
> On 24/06/2013, at 9:31 PM, Jacek Konieczny  wrote:
> 
> > 
> > After I have upgraded Pacemaker from 1.1.8 to 1.1.9 on a node I get
> > the following errors in my syslog and Pacemaker doesn't seem to be
> > able to start services on this node.
> 
> What else did you upgrade?  libqb too?

Only the Pacemaker.

> > Any ideas what is going wrong here?
> > 
> > crmd is running with uid 17 ('hacluster'). I have tried to add it
> > to the 'uidgid' section of corosync conf or set uidgid.uid.17 with
> > corosync-cmapctl, but it didn't help.
> 
> Which distro is this?

PLD-Linux. And I am the packager of Corosync and Pacemaker in PLD-Linux,
so you can assume it is a custom build from sources.

Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] pacemaker/corosync: error: qb_sys_mmap_file_open: couldn't open file

2013-06-24 Thread Jacek Konieczny

After I have upgraded Pacemaker from 1.1.8 to 1.1.9 on a node I get the 
following errors
in my syslog and Pacemaker doesn't seem to be able to start services on this 
node.

Jun 24 13:19:44 dev1n2 crmd[5994]:error: qb_sys_mmap_file_open: couldn't 
open file /dev/shm/qb-lrmd-request-5991-5994-5-header: Permission denied (13)
Jun 24 13:19:44 dev1n2 crmd[5994]:error: qb_sys_mmap_file_open: couldn't 
open file /var/run/qb-lrmd-request-5991-5994-5-header: No such file or 
directory (2)
Jun 24 13:19:44 dev1n2 crmd[5994]:error: qb_rb_open: couldn't create file 
for mmap
Jun 24 13:19:44 dev1n2 crmd[5994]:error: qb_ipcc_shm_connect: 
qb_rb_open:REQUEST: No such file or directory (2)
Jun 24 13:19:44 dev1n2 crmd[5994]:error: qb_ipcc_shm_connect: connection 
failed: No such file or directory (2)
Jun 24 13:19:44 dev1n2 crmd[5994]:  warning: do_lrm_control: Failed to sign on 
to the LRM 18 (30 max) times

I have googled for such messages and found nothing relevant.

Any ideas what is going wrong here?

crmd is running with uid 17 ('hacluster'). I have tried to add it to the
'uidgid' section of corosync conf or set uidgid.uid.17 with
corosync-cmapctl, but it didn't help.

Also:

# ls -l /dev/shm/qb-lrmd*
ls: cannot access /dev/shm/qb-lrmd*: No such file or directory

While on a working, Pacemaker 1.1.8 node:

# ls -l /dev/shm/qb-lrmd*
-rw--- 1 hacluster root 20480 Jun 21 10:36 
/dev/shm/qb-lrmd-event-1182-1185-6-data
-rw--- 1 hacluster root  8248 Jun 21 10:36 
/dev/shm/qb-lrmd-event-1182-1185-6-header
-rw--- 1 hacluster root 20480 Jun 21 10:36 
/dev/shm/qb-lrmd-request-1182-1185-6-data
-rw--- 1 hacluster root  8252 Jun 21 10:36 
/dev/shm/qb-lrmd-request-1182-1185-6-header
-rw--- 1 hacluster root 20480 Jun 21 10:36 
/dev/shm/qb-lrmd-response-1182-1185-6-data
-rw--- 1 hacluster root  8248 Jun 21 10:36 
/dev/shm/qb-lrmd-response-1182-1185-6-header

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Two resource nodes + one quorum node

2013-06-14 Thread Jacek Konieczny
On Thu, 13 Jun 2013 15:50:26 +0400
Andrey Groshev  wrote:
> 11.06.2013, 22:52, "Michael Schwartzkopff" :
> 
> Am Dienstag, 11. Juni 2013, 22:33:32 schrieb Andrey Groshev:
> 
> > Hi,
> 
> > I want to make Postgres cluster.
> 
> > As far as I understand, for the proper functioning of the cluster
> > must use a
> 
> > quorum (ie, at least three nodes).
> 
> No. Two nodes are enough. See: no-quorum-policy="ignore".
>  
> Very big thanks, but I know it. :) 

And what is wrong with a 2-node quorum provided by the corosync '2 node'
mode?

AFAIK it is better than no-quorum-policy="ignore", as it prevents the
'fence loop' – node won't fence the other node or do anything dangerous 
just after booting up after being fenced, because it cannot get quorum
without the other, but when the cluster is booted properly with both
nodes, any of the two can fail and the cluster would continue in
degraded mode.

I agree that using three nodes when two would be ok is an overkill. When
we sell a 'high availability' solution, the customer expects two devices
- one 'normal' and one 'backup'. It would be hard to sell him the third
machine (same high-end server as the other two? or some 'stupid little
box' just for the quorum-keeping?), that 'does nothing'. And it is not
only the purchase cost, but also power and rack space.

And even in a three-node cluster there are still many things that may go
wrong (with very little probablility) so even if it is slightly more
reliable it is probably not worth it. There is no fail-proof solution
for any IT problem. If a 2-node cluster handles gracefully 95% of
possible failures and 3-node cluster would handle 98%, is it worth
installing the third machine?

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-29 Thread Jacek Konieczny
On Fri, 29 Mar 2013 11:37:37 +1100
Andrew Beekhof  wrote:

> On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan
>  wrote:
> > Hi John,
> > to get Corosync/Pacemaker running during anaconda installation, i
> > have created a configuration RPM package which does a few actions
> > before starting Corosync and Pacemaker.
> >
> > An excerpt of the post install of this RPM.
> > # mount /dev/shm if not already existing, otherwise openais cannot
> > work if [ ! -d /dev/shm ]; then
> > mkdir /dev/shm
> > mount /dev/shm
> > fi
> 
> Perhaps mention this to the corosync guys, it should probably go into
> their init script.

I don't think so. It is just a part of modern Linux system environment.
corosync is not supposed to mount the root filesystem or /proc –
mounting /dev/shm is not its responsibility either.

BTW  The excerpt above assumes there is a /dev/shm entry in /etc/fstab.
Should this be added there by the corosync init script too?

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-25 Thread Jacek Konieczny
On Mon, 25 Mar 2013 20:01:28 +0100
"Angel L. Mateo"  wrote:
> >quorum {
> > provider: corosync_votequorum
> > expected_votes: 2
> > two_node: 1
> >}
> >
> >Corosync will then manage quorum for the two-node cluster and
> >Pacemaker
> 
>   I'm using corosync 1.1 which is the one  provided with my
> distribution (ubuntu 12.04). I could also use cman.

I don't think corosync 1.1 can do that, but I guess in this case cman
should be able provide this functionality.
 
> >can use that. You still need proper fencing to enforce the quorum
> >(both for pacemaker and the storage layer – dlm in case you use
> >clvmd), but no
> >extra quorum node is needed.
> >
>   I hace configured a dlm resource usted with clvm.
> 
>   One doubt... With this configuration, how split brain problem is
> handled?

The first node to notice that the other is unreachable will fence (kill)
the other, making sure it is the only one operating on the shared data.
Even though it is only half of the node, the cluster is considered
quorate as the other node is known not to be running any cluster
resources.

When the fenced node reboots its cluster stack starts, but with no
quorum until it comminicates with the surviving node again. So no
cluster services are started there until both nodes communicate
properly and the proper quorum is recovered.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-25 Thread Jacek Konieczny
On Mon, 25 Mar 2013 13:54:22 +0100
>   My problem is how to avoid split brain situation with this 
> configuration, without configuring a 3rd node. I have read about
> quorum disks, external/sbd stonith plugin and other references, but
> I'm too confused with all this.
> 
>   For example, [1] mention techniques to improve quorum with
> scsi reserve or quorum daemon, but it didn't point to how to do this
> pacemaker. Or [2] talks about external/sbd.
> 
>   Any help?


With corosync 2.2 (2.1 too, I guess) you can use, in corosync.conf:

quorum {
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}

Corosync will then manage quorum for the two-node cluster and Pacemaker
can use that. You still need proper fencing to enforce the quorum (both
for pacemaker and the storage layer – dlm in case you use clvmd), but no
extra quorum node is needed.

There is one more thing, though: you need two nodes active to boot the
cluster, but then when one fails (and is fenced) the other may continue,
keeping quorum.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Does LVM resouce agent conflict with UDEV rules?

2013-03-07 Thread Jacek Konieczny
On Wed, 06 Mar 2013 22:41:51 +0100
Sven Arnold  wrote:
> In fact, disabling the udev rule
> 
> SUBSYSTEM=="block", ACTION=="add|change",
> ENV{ID_FS_TYPE}=="lvm*|LVM*",\ RUN+="watershed sh -c '/sbin/lvm
> vgscan; /sbin/lvm vgchange -a y'"
> 
> seems to resolve the problem for me.

This rule looks like asking for problems in clustered LVM environment.
It activates all volumes, non exclusively, as soon as the PV becomes
available - it leaves no place for any management by the cluster
software. I don't think this came from LVM upstream, rather it looks
like some Ubuntu invention.

Removing this rule and activating the volumes in a controlled manner
(e.g. via the LVM resource agent) seems the right way.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Accessing CIB by user not 'root' and not 'hacluster'

2013-01-25 Thread Jacek Konieczny
Hi,

It used to be possible to access the Pacemaker's CIB from any user in
the 'haclient' group, but after one of the upgrades it stopped working
(I didn't care about this issue match then, so I cannot recall the exact
point). Now I would like to restore the cluster state overview
functionality in the UI of my system, so I would like to fix it.

Currently I use Pacemaker 1.1.8 and Corosync 2.2.0. The problem is:

$ id
uid=993(sipgwui) gid=993(sipgwui) groups=993(sipgwui),60(haclient),109(lighttpd)
$ cibadmin -Q
Could not establish cib_rw connection: Permission denied (13)
Signon to CIB failed: Transport endpoint is not connected
Init failed, could not perform requested operations

Strace shows this fails on:

open("/dev/shm/qb-cib_rw-control-12542-19960-19", O_RDWR) = -1 EACCES 
(Permission denied)

and:

$ ls -l /dev/shm/qb-cib_rw-control-12542-19960-19
-rw--- 1 hacluster root 24 Jan 25 10:31 
/dev/shm/qb-cib_rw-control-12542-19960-19

I have googled around and found that a qb_ipcs_connection_auth_set() function
could be used to set the permissions right on the SHM file. I found the
right call in the Pacemaker sources (cib/callbacks.c), enclosed in the
'#if ENABLE_ACL' clause. My build was not compiled with the ACL support,
so I have re-built it with ACL on.

Now the behaviour is the same, with one exception:

$ ls -l /dev/shm/qb-cib_rw-control-1488-5008-17
-rw-rw 1 hacluster root 24 Jan 25 10:19 
/dev/shm/qb-cib_rw-control-1488-5008-17

The file is now group-accessible, but the group is still 'root' and not
'haclient', although  confdefs.h contained:

#define CRM_DAEMON_GROUP "haclient"'

The docs at http://clusterlabs.org/doc/acls.html state:

> The various tools for administering Pacemaker clusters (crm_mon, crm
> shell, cibadmin and friends, Python GUI, Hawk) can be used by the root
> user, or any user in the haclient group. By default, these users have
> full read/write access. 

This clearly is not the case.

Any ideas?

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-24 Thread Jacek Konieczny
On Thu, 24 Jan 2013 09:04:14 +0100
Jacek Konieczny  wrote:
> I should probably upgrade my CIB somehow

Indeed. 'cibadmin --upgrade --force' solved my problem.
Thanks for all the hints.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-24 Thread Jacek Konieczny
Hi,

On Wed, 23 Jan 2013 18:52:20 +0100
Dejan Muhamedagic  wrote:
> > 
> >   
> 
> Note sure if id can start with a digit.

Corosync node id's are always digits-only.

> This should really work with versions >= v1.2.4

Yeah… I have looked into the crmsh code and it has explicit support for
node 'type' attribute in Pacemaker 1.1.8. For some reason this does not
work for me on this cluster (no such problems on another cluster, which
was not upgraded, but set up on Pacemaker 1.1 from the beginning).

> Which schema do you validate against? Look for the validate-with
> attribute of the cib element. 

validate-with="pacemaker-1.0"


  
  
  

  normal
  member
  ping

  

So no, it is not optional here. But it is optional in the pacemaker-1.1 schema.
So the problem is crmsh uses the wrong schema for the XML it generates…

# cibadmin -Q | grep validate-with


So, the 'validate-with="pacemaker-1.0"' comes from the current CIB. crmsh keeps
that, but generates Pacemaker 1.1 XML, so the verification fails.

I should probably upgrade my CIB somehow, but still it seems there is a bug in
crmsh.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB verification failure with any change via crmsh

2013-01-23 Thread Jacek Konieczny
On Wed, 23 Jan 2013 16:44:45 +0100
Lars Marowsky-Bree  wrote:

> On 2013-01-23T16:31:20, Jacek Konieczny  wrote:
> 
> > I have recently upgraded Pacemaker on one of my clusters from
> > 1.0.something to 1.1.8 and installed crmsh to manage it as I used
> > to.
> 
> It'd be helpful if you mentioned which crmsh version you installed.
> The errors you get suggest you need to update it.

You are right, I missed the information.

It was crmsh 1.2.1 and the first thing I tried was an upgrade to 1.2.4,
but this did not change a thing. So it is the same with crmsh 1.2.1 and
crmsh 1.2.4.

Greets,
Jacek


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] CIB verification failure with any change via crmsh

2013-01-23 Thread Jacek Konieczny
Hi,

I have recently upgraded Pacemaker on one of my clusters from
1.0.something to 1.1.8 and installed crmsh to manage it as I used to.

crmsh mostly works for me, until I try to change the configuration with
'crm configure'. Any, even trivial change shows verification errors and
fails to commit:

> crm(live)configure# commit
> element instance_attributes: Relax-NG validity error : Expecting an element 
> nvpair, got nothing
> element node: Relax-NG validity error : Expecting an element 
> instance_attributes, got nothing
> element node: Relax-NG validity error : Element nodes has extra content: node
> element configuration: Relax-NG validity error : Invalid sequence in 
> interleave
> element instance_attributes: Relax-NG validity error : Element node failed to 
> validate attributes
> element cib: Relax-NG validity error : Element cib failed to validate content
>error: main:   CIB did not pass DTD/schema validation
> Errors found during check: config not valid
>   -V may provide more details
> Do you still want to commit? no

It seems as crmsh fails to parse current configuration properly, as:

crm configure save xml /tmp/saved.xml ; crm_verify -V --xml-file /tmp/saved.xml

fails the same way:

> /tmp/saved.xml:19: element instance_attributes: Relax-NG validity error : 
> Expecting an element nvpair, got nothing
> /tmp/saved.xml:18: element node: Relax-NG validity error : Expecting an 
> element instance_attributes, got nothing
> /tmp/saved.xml:18: element node: Relax-NG validity error : Element nodes has 
> extra content: node
> /tmp/saved.xml:3: element configuration: Relax-NG validity error : Invalid 
> sequence in interleave
> /tmp/saved.xml:19: element instance_attributes: Relax-NG validity error : 
> Element node failed to validate attributes
> /tmp/saved.xml:2: element cib: Relax-NG validity error : Element cib failed 
> to validate content
>error: main:   CIB did not pass DTD/schema validation
> Errors found during check: config not valid
>   -V may provide more details


But:

cibadmin -Q > /tmp/good.xml ; crm_verify --xml-file 

shows no error.

Any ideas?

Looking into the 'invalid' XML file gives me no hints, as the line
18 is the first  in:


  

  

  
  

  

  


which looks quite right too me.

Oh… now I see the difference with the current cib. The  elements miss
the type="normal" attribute. After adding those to the crmsh-generated XML
everything works. Then it is a crmsh bug, right?

And the errors reported by crm_verify are very misleading.

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonithd crash on exit

2012-11-01 Thread Jacek Konieczny
On Thu, Nov 01, 2012 at 11:05:04AM +1100, Andrew Beekhof wrote:
> On Thu, Nov 1, 2012 at 7:40 AM, Jacek Konieczny  wrote:
> > On Wed, Oct 31, 2012 at 05:33:03PM +1100, Andrew Beekhof wrote:
> >> I havent seen that before. What version?
> >
> > Pacemaker 1.1.8, corosync 2.1.0, cluster-glue 1.0.11
> 
> I think you want these two patches:
> 
> https://github.com/beekhof/pacemaker/commit/7282066
> https://github.com/beekhof/pacemaker/commit/280926a
> 
> They came after the official release

These help, indeed. Thanks!

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonithd crash on exit

2012-10-31 Thread Jacek Konieczny
On Wed, Oct 31, 2012 at 05:33:03PM +1100, Andrew Beekhof wrote:
> I havent seen that before. What version?

Pacemaker 1.1.8, corosync 2.1.0, cluster-glue 1.0.11


> On Wed, Oct 31, 2012 at 12:42 AM, Jacek Konieczny  wrote:
> > Hello,
> >
> > Probably this is not a critical problem, but it become annoying during
> > my cluster setup/testing time:
> >
> > Whenever I restart corosync with 'systemctl restart corosync.service' I
> > get message about stonithd crashing with SIGSEGV:
> >
> >> stonithd[3179]: segfault at 10 ip 00403144 sp 7fffe83d6370 
> >> error 4 in stonithd (deleted)[40+13000]
> >> stonithd/3179: potentially unexpected fatal signal 11.
> >
> > GDB shows this:
> >
> >> Program received signal SIGTERM, Terminated.
> >> 0x7fd6ec319c18 in poll () from /lib64/libc.so.6
> >> (gdb) signal SIGTERM
> >> Continuing with signal SIGTERM.
> >>
> >> Program received signal SIGSEGV, Segmentation fault.
> >> 0x00403144 in main (argc=, argv=0x7fff4648f318)
> >> at main.c:933
> >> 933 cluster.hb_conn->llc_ops->delete(cluster.hb_conn);
> >> (gdb) bt
> >> #0  0x00403144 in main (argc=, argv=0x7fff4648f318)
> >> at main.c:933


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] stonithd crash on exit

2012-10-30 Thread Jacek Konieczny
Hello,

Probably this is not a critical problem, but it become annoying during
my cluster setup/testing time:

Whenever I restart corosync with 'systemctl restart corosync.service' I
get message about stonithd crashing with SIGSEGV:

> stonithd[3179]: segfault at 10 ip 00403144 sp 7fffe83d6370 error 
> 4 in stonithd (deleted)[40+13000]
> stonithd/3179: potentially unexpected fatal signal 11.

GDB shows this:

> Program received signal SIGTERM, Terminated.
> 0x7fd6ec319c18 in poll () from /lib64/libc.so.6
> (gdb) signal SIGTERM
> Continuing with signal SIGTERM.
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x00403144 in main (argc=, argv=0x7fff4648f318)
> at main.c:933
> 933 cluster.hb_conn->llc_ops->delete(cluster.hb_conn);
> (gdb) bt
> #0  0x00403144 in main (argc=, argv=0x7fff4648f318)
> at main.c:933

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org