Re: [Pacemaker] Node fails to rejoin cluster

2013-02-14 Thread Proskurin Kirill

On 02/08/2013 04:59 AM, Andrew Beekhof wrote:


Suggests it s a bug that got fixed recently.  Keep an eye out for
1.1.9 in the next week or so (or you could try building from source if
you're in a hurry).


Is 1.1.9 will be centos 5.x friendly?

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Periodically appear non-existent nodes

2012-04-17 Thread Proskurin Kirill

On 04/17/2012 03:46 PM, ruslan usifov wrote:

2012/4/17 Andreas Kurz andr...@hastexo.com mailto:andr...@hastexo.com

On 04/14/2012 11:14 PM, ruslan usifov wrote:
  Hello
 
  I remove 2 nodes from cluster, with follow sequence:
 
  crm_node --force -R id of node1
  crm_node --force -R id of node2
  cibadmin --delete --obj_type nodes --crm_xml 'node uname=node1/'
  cibadmin --delete --obj_type status --crm_xml 'node_state
uname=node1/'
  cibadmin --delete --obj_type nodes --crm_xml 'node uname=node2/'
  cibadmin --delete --obj_type status --crm_xml 'node_state
uname=node2/'
 
 
  Nodes after this deleted, but if for example i restart (reboot)
one of
  existent nodes in working cluster, this deleted nodes appear again in
  OFFLINE state


I have this problem some time ago.
I solved it something like that:

crm node delete NODENAME
crm_node --force --remove NODENAME
cibadmin --delete --obj_type nodes --crm_xml 'node uname=NODENAME/'
cibadmin --delete --obj_type status --crm_xml 'node_state 
uname=NODENAME/'


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Questions about reasonable cluster size...

2011-10-20 Thread Proskurin Kirill

On 10/20/2011 03:15 AM, Steven Dake wrote:

On 10/19/2011 01:50 PM, Alan Robertson wrote:

Hi,

I have an application where having a 12-node cluster with about 250
resources would be desirable.

Is this reasonable?  Can Pacemaker+Corosync be expected to reliably
handle a cluster of this size?

If not, what is the current recommendation for maximum number of nodes
and resources?


I start to have problems with 10+ nodes. It`s heavly depended on 
corosync configuration afaik. You should test it.




--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

2011-10-17 Thread Proskurin Kirill

Hello Beekhof.

First of all - I don`t want to waste your time but this problem is realy 
important for me and I can`t solve it by my self and it`s looks like a 
bug or something. I think what I fail at describing of this problem so I 
will try again and try to make a sum of all prev conversation.


I have a situation then pacemaker thinks what resource are running but 
it`s not. Agent from console said it`s not running.

I have no fencing and this resource are fail to stop by timeout.
And you said what it`s a reason of this situation. But I made an 
experiment and found what if pcmk can`t stop resource it make it unmanaged


My resource was not unmanaged - it`s just say what they are running 
and I have no indication of problem.


We already fix this non stoppable scripts but I want to be sure what I 
will not run on this problem any more.


Below some quotes from prev conversation if needed.

12.10.2011 6:11, Andrew Beekhof пишет:

On 10/03/2011 05:32 AM, Andrew Beekhof wrote:


corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with ver: 1



2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is
started
but
it is not.


RA is misbehaving.  Pacemaker will only consider a resource running if
the RA tells us it is (running or in a failed state).


But you can see below, what agent return 7.


Its still broken. Not one stop action succeeds.

Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
signal SIGTERM (15).
Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
with signal SIGTERM (15).
Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
with signal SIGTERM (15).

/That/ is why pacemaker thinks its still running.


I made an experiment.

I create script what don`t die at SIGTERM

#!/usr/bin/perl
$SIG{TERM} = IGNORE; sleep 1 while 1

And run it on pacemaker.
I run 3 tests:
1) primitive test-kill-15.pl ocf:mail.ru:generic \
op monitor interval=20 timeout=5 on-fail=restart \
params binfile=/tmp/test-kill-15.pl external_pidfile=1

2) Same but on-fail=block

3) Same but with metaware stonith.

Each time I do:
crm resource stop test-kill-15.pl

And in case 1 and 2 - I get unmanaged on this resource.


Because you've not configured any fencing devices.



--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

2011-10-05 Thread Proskurin Kirill

On 10/05/2011 04:19 AM, Andrew Beekhof wrote:

On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
k.prosku...@corp.mail.ru  wrote:

On 10/03/2011 05:32 AM, Andrew Beekhof wrote:


corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with ver: 1



2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is started
but
it is not.


RA is misbehaving.  Pacemaker will only consider a resource running if
the RA tells us it is (running or in a failed state).


But you can see below, what agent return 7.


Its still broken. Not one stop action succeeds.

Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
signal SIGTERM (15).
Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
with signal SIGTERM (15).
Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
with signal SIGTERM (15).

/That/ is why pacemaker thinks its still running.


Hm, I think in this situation it must become unmanaged, no?

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

2011-10-05 Thread Proskurin Kirill

On 10/05/2011 04:19 AM, Andrew Beekhof wrote:

On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
k.prosku...@corp.mail.ru  wrote:

On 10/03/2011 05:32 AM, Andrew Beekhof wrote:


corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with ver: 1



2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is started
but
it is not.


RA is misbehaving.  Pacemaker will only consider a resource running if
the RA tells us it is (running or in a failed state).


But you can see below, what agent return 7.


Its still broken. Not one stop action succeeds.

Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
signal SIGTERM (15).
Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
with signal SIGTERM (15).
Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
with signal SIGTERM (15).

/That/ is why pacemaker thinks its still running.


I made an experiment.

I create script what don`t die at SIGTERM

#!/usr/bin/perl
$SIG{TERM} = IGNORE; sleep 1 while 1

And run it on pacemaker.
I run 3 tests:
1) primitive test-kill-15.pl ocf:mail.ru:generic \
op monitor interval=20 timeout=5 on-fail=restart \
params binfile=/tmp/test-kill-15.pl external_pidfile=1

2) Same but on-fail=block

3) Same but with metaware stonith.

Each time I do:
crm resource stop test-kill-15.pl

And in case 1 and 2 - I get unmanaged on this resource.
In case 3 I get stonith situation.

From IRC:
(12:20:44 PM) beekhof: Oloremo: what the hell is the cluster supposed to 
do if stop fails and you dont want fencing?  it cant start it anywhere 
because its still active in the original location
(12:30:09 PM) Oloremo: I get the point, really.  But may be it should 
make it unmanaged?


And it does.

So can I assume what my problem with monitoring still not that clear? I 
don`t get unmanaged - it is just thinks that resource are started but 
it`s not.



--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

2011-10-03 Thread Proskurin Kirill

On 10/03/2011 05:32 AM, Andrew Beekhof wrote:

corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with ver: 1



2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is started but
it is not.


RA is misbehaving.  Pacemaker will only consider a resource running if
the RA tells us it is (running or in a failed state).


But you can see below, what agent return 7.


We use slightly modifed version of anything agent for our
scripts but they are aware of OCF return codes and other staff.

I run monitoring by our agent from console:
# env -i ; OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl
/usr/lib/ocf/resource.d/mail.ru/generic monitor
# generic[14992]: DEBUG: default monitor : 7

So our agent said what it is not running, but pacemaker still think it does.
I runs for 2 days and after I forced to cleanup it. And it find what it`snot
running in seconds.


Did you configure a recurring monitor operation?


Of course. I add my primitive configuration in original letter there is:
op monitor interval=30 timeout=300 on-fail=restart \

I have this third time and this time I found in logs:
Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice:
unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2,
magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on
mysender34.mail.ru

There is different resource name cos logs from third situation but 
problem is same.




3)
This one it confusing and dangerous.

I use failure-timeout on most resources to wipe out temp warn messages from
crm_verify -LV - I use it for monitoring a cluster. All works good but I
found this:

1) Resource can`t start on node and migrate to next one.
2) It can`t start here too and on all other.
3) It is give up and stops. There is many erros about all this in crm_verify
-LV - and it is good.
4) failure-timeout comes and... wipe out all errors.
5) We have stopped resource and all errors are wiped. And we don`t know if
it is stopped by a hands of admin or because of errors.



I think what failure-timeout should not happend on stopped resource.
Any chance to avoid this?



Not sure why you think this is dangerous, the cluster is doing exactly
what you told it to.
If you want resources to stay stopped either set failure-timeout=0
(disabled) or set the target-role to Stopped.


No, I want to use failure-timeout but not wipe out errors then resource 
are already stopped by pacemaker because of errors and not by admin hands.


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Ignoring expired failure

2011-09-30 Thread Proskurin Kirill

Hello all.

corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with ver: 1

I run again on monitoring fail and still don`t know why it happends.
Details are here:
http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09986.html

Some info:
I twice run on situation then pacemaker thinks what resource is started 
but it is not. We use slightly modifed version of anything agent for 
our scripts but they are aware of OCF return codes and other staff.


I run monitoring by our agent from console:

# env -i ; OCF_ROOT=/usr/lib/ocf 
OCF_RESKEY_binfile=/usr/local/mpop/bin/my/tranprocessor.pl 
/usr/lib/ocf/resource.d/mail.ru/generic monitor

# generic[14992]: DEBUG: default monitor : 7


But this time I see in logs:
Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice: 
unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2, 
magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on 
mysender34.mail.ru


So Pacemaker knows what resource may be down but ignoring it. Why?

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

2011-09-29 Thread Proskurin Kirill

Hello all.

corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with ver: 1

I run on some problems this week. I not sure if I need to make 3 
separate letters, sorry if so.


1)
I set node to standby and then to online. And after this I get this:

2643 root RT 0 11424 2052 1744 R 100.9 0.0 657502:53 
/usr/lib/heartbeat/stonithd
2644 hacluste RT 0 12432 3440 2240 R 100.9 0.0 657502:43 
/usr/lib/heartbeat/cib
2648 hacluste RT 0 11828 2860 2456 R 100.9 0.0 657502:45 
/usr/lib/heartbeat/crmd
2646 hacluste RT 0 11764 2240 1904 R 99.9 0.0 657502:49 
/usr/lib/heartbeat/attrd


I was in hurry and it`s a production server, so I kill this proc and 
stop pacemakerd  corosync. Then start them again. And all was ok.
I suppose what pacemakerd and corosync was running while this problems 
occurs. I assume this cos then I run stop on they init scripts it is 
takes some time till they stop.


Any hints?

2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is started 
but it is not. We use slightly modifed version of anything agent for 
our scripts but they are aware of OCF return codes and other staff.


I run monitoring by our agent from console:
# env -i ; OCF_ROOT=/usr/lib/ocf 
OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl 
/usr/lib/ocf/resource.d/mail.ru/generic monitor

# generic[14992]: DEBUG: default monitor : 7

So our agent said what it is not running, but pacemaker still think it 
does. I runs for 2 days and after I forced to cleanup it. And it find 
what it`snot running in seconds.


This is really scary situation. I can`t reproduce it but I already have 
it twice... may be more but I not see it, who knows.


I attach out agent script and that is how we run this script:

primitive dialogues_notify.pl ocf:mail.ru:generic \
op monitor interval=30 timeout=300 on-fail=restart \
op start interval=0 timeout=300 \
op stop interval=0 timeout=300 \
params binfile=/usr/local/mpop/bin/my/dialogues_notify.pl \
meta failure-timeout=120

3)
This one it confusing and dangerous.

I use failure-timeout on most resources to wipe out temp warn messages 
from crm_verify -LV - I use it for monitoring a cluster. All works good 
but I found this:


1) Resource can`t start on node and migrate to next one.
2) It can`t start here too and on all other.
3) It is give up and stops. There is many erros about all this in 
crm_verify -LV - and it is good.

4) failure-timeout comes and... wipe out all errors.
5) We have stopped resource and all errors are wiped. And we don`t know 
if it is stopped by a hands of admin or because of errors.


I think what failure-timeout should not happend on stopped resource.
Any chance to avoid this?

--
Best regards,
Proskurin Kirill
#!/bin/sh

###
# Initialization:
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

if [ ! -z $OCF_RESKEY_binfile ]; then
basename=`basename ${OCF_RESKEY_binfile} .pl`
OCF_RESKEY_pidfile_default=/var/run/${basename}.pid
OCF_RESKEY_logfile_default=/var/log/${basename}.log
fi
OCF_RESKEY_external_pidfile_default=0
OCF_RESKEY_core_dump_default=0

: ${OCF_RESKEY_pidfile=$OCF_RESKEY_pidfile_default}
: ${OCF_RESKEY_logfile=$OCF_RESKEY_logfile_default}
: ${OCF_RESKEY_external_pidfile=$OCF_RESKEY_external_pidfile_default}
: ${OCF_RESKEY_core_dump=$OCF_RESKEY_core_dump_default}

###

generic_usage() {
cat END
usage: $0 {start|stop|monitor|validate-all|meta-data}

Expects to have a fully populated OCF RA-compliant environment set.
END
}

generic_meta() {
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=generic
version1.0/version
longdesc lang=en
Resource agent for any script
/longdesc
shortdesc lang=enResource agent for any script/shortdesc

parameters
parameter name=binfile required=1
longdesc lang=en
The full name of the binary to be executed.
/longdesc
shortdesc lang=enFull path name of the binary to be executed/shortdesc
content type=string /
/parameter
parameter name=options required=0
longdesc lang=en
Command line options to pass to the binary
/longdesc
shortdesc lang=enCommand line options/shortdesc
content type=string /
/parameter
parameter name=pidfile
longdesc lang=en
Path to pidfile. Default is: /var/run/\${basename}.pid
/longdesc
shortdesc lang=enPath to pidfile/shortdesc
content type=string default=${OCF_RESKEY_pidfile_default}/
/parameter
parameter name=logfile
longdesc lang=en
Path to logfile. Default is: /var/log/\${basename}.log
/longdesc
shortdesc lang=enPath to logfile/shortdesc
content type=string default=${OCF_RESKEY_logfile_default}/
/parameter
parameter name=external_pidfile
longdesc lang=en
Write pidfile by ocf-agent, not running script.
/longdesc
shortdesc lang=enWho writes pidfile/shortdesc
content type=boolean default

Re: [Pacemaker] Cluster type is: corosync

2011-08-01 Thread Proskurin Kirill

01.08.2011 5:42, Andrew Beekhof пишет:

Finally, tell Corosync to load the Pacemaker plugin.n


As I said before:
And I run pacemakerd after corosync start.

Anyway - problem is solved for me.

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster type is: corosync

2011-08-01 Thread Proskurin Kirill

02.08.2011 1:00, Andrew Beekhof пишет:

On Mon, Aug 1, 2011 at 10:23 PM, Proskurin Kirill
k.prosku...@corp.mail.ru  wrote:

01.08.2011 5:42, Andrew Beekhof пишет:


Finally, tell Corosync to load the Pacemaker plugin.n


As I said before:
And I run pacemakerd after corosync start.


The two are not mutually exclusive.
You need the plugin AND pacemakerd.


I has service.d/pcmk just like in example.


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster type is: corosync

2011-07-27 Thread Proskurin Kirill

27.07.2011 6:41, Andrew Beekhof пишет:


Ok. And did you add the pacemaker configuration options to corosync's
config file?



I attach our corosync.conf. It is same on all nodes except IP addr.


You missed a step from:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/s-configure-corosync.html


Witch one?
At previous conversation Steven Dake said what I can use exact ip addr 
if I wish(And I do - because some node may have more then one ip addr on 
the same network).

And I run pacemakerd after corosync start.

I can`t say for sure but seems to I fix it by turning compatibility: 
none. After this it start to tell me what here are Cluster type: openais.



Pacemaker is black now - no configuration at all.

Online nodes:
[root@mysender1 ~]# crm configure show
node mysender1.example.com
node mysender2.example.com
node mysender3.example.com
node mysender4.example.com
node mysender5.example.com
node mysender6.example.com
node mysender7.example.com
property $id=cib-bootstrap-options \
dc-version=1.1.5-3-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
cluster-infrastructure=openais \
expected-quorum-votes=6


Offline nodes(Cluster type is: corosync)
[root@mysender2 ~]# crm configure show
[root@mysender2 ~]#





pacemaker-1.1.5
corosync-1.4.0
cluster-glue-1.0.6
openais-1.1.2

All nodes have same rpms.


On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill
k.prosku...@corp.example.com  wrote:


Hello again!

Hope I`m not flooding too much here but I have another problem.

I install same rpm of corosync, openais, pacemaker, cluster_glue on all
nodes. I check it twice.

And then I start some of they - they can`t connect to cluster and stays
offline. In logs I see what they see other nodes and connectivity is
ok.
But
I found the difference:

Online nodes in cluster have:
[root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 20:38:58 mysender39.example.com stonith-ng: [3499]: info:
get_cluster_type: Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.example.com attrd: [3502]: info:
get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.example.com cib: [3500]: info:
get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:59 mysender39.example.com crmd: [3504]: info:
get_cluster_type:
Cluster type is: 'openais'.

Offline have:
[root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 13:39:17 mysender2.example.com stonith-ng: [9028]: info:
get_cluster_type: Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.example.com attrd: [9031]: info:
get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.example.com cib: [9029]: info:
get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:18 mysender2.example.com crmd: [9033]: info:
get_cluster_type:
Cluster type is: 'corosync'.

What`s wrong and how can I fix it?



--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Upgrading from 1.0 to 1.1

2011-07-27 Thread Proskurin Kirill

27.07.2011 5:56, Andrew Beekhof пишет:

On Tue, Jul 19, 2011 at 5:40 PM, Proskurin Kirill
k.prosku...@corp.mail.ru  wrote:

On 07/19/2011 03:22 AM, Andrew Beekhof wrote:


On Fri, Jul 15, 2011 at 10:33 PM, Proskurin Kirill
k.prosku...@corp.mail.ruwrote:


Hello all.

I found what I using corosync with pacemaker ver:0 with installed
pacemaker 1.1.5 - eg without start a pacemakerd.

Sounds wrong. :-)
So I try to upgrade.
I shutdown one node. Change 0 to 1 on service.d/pcmk
Start corosync and then start pacemakerd via init script.

But this node stays online and on clusters DC I see:
cib: [18392]: WARN: cib_peer_callback: Discarding cib_sync_one message
(255)
from mysender10.example.com: not in our membership


Thats odd.  The only you changed was ver: 0 to ver: 1 ?


Yes, only this. To make it more clear - I have 4 nodes with ver 0 and try to
add one with ver 1 and got this.

Well I shutdown all nodes change all to 1 and star them up add all was ok.
Not a really good way to upgrade but I don`t have time.


Do you still have the logs for the failure case?
I'd really like to see them.


No I don`t. But some time ago I got same error on vise-versa situation - 
then I try to add node with ver: 0 to cluster there all nodes are ver: 1


Anyway my cluster are down now so I can do some test. I will sent logs 
to maillist if I reproduce this situation again.


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster type is: corosync

2011-07-26 Thread Proskurin Kirill

On 07/26/2011 11:00 AM, Andrew Beekhof wrote:

On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill
k.prosku...@corp.example.com  wrote:

25.07.2011 10:10, Andrew Beekhof пишет:


Which packages are you using?


It is your official source from repository I build.


Ok. And did you add the pacemaker configuration options to corosync's
config file?



I attach our corosync.conf. It is same on all nodes except IP addr.
Pacemaker is black now - no configuration at all.

Online nodes:
[root@mysender1 ~]# crm configure show
node mysender1.example.com
node mysender2.example.com
node mysender3.example.com
node mysender4.example.com
node mysender5.example.com
node mysender6.example.com
node mysender7.example.com
property $id=cib-bootstrap-options \
dc-version=1.1.5-3-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
cluster-infrastructure=openais \
expected-quorum-votes=6


Offline nodes(Cluster type is: corosync)
[root@mysender2 ~]# crm configure show
[root@mysender2 ~]#





pacemaker-1.1.5
corosync-1.4.0
cluster-glue-1.0.6
openais-1.1.2

All nodes have same rpms.


On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill
k.prosku...@corp.example.comwrote:


Hello again!

Hope I`m not flooding too much here but I have another problem.

I install same rpm of corosync, openais, pacemaker, cluster_glue on all
nodes. I check it twice.

And then I start some of they - they can`t connect to cluster and stays
offline. In logs I see what they see other nodes and connectivity is ok.
But
I found the difference:

Online nodes in cluster have:
[root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 20:38:58 mysender39.example.com stonith-ng: [3499]: info:
get_cluster_type: Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.example.com attrd: [3502]: info: get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.example.com cib: [3500]: info: get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:59 mysender39.example.com crmd: [3504]: info: get_cluster_type:
Cluster type is: 'openais'.

Offline have:
[root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 13:39:17 mysender2.example.com stonith-ng: [9028]: info:
get_cluster_type: Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.example.com attrd: [9031]: info: get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.example.com cib: [9029]: info: get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:18 mysender2.example.com crmd: [9033]: info: get_cluster_type:
Cluster type is: 'corosync'.

What`s wrong and how can I fix it?


--
Best regards,
Proskurin Kirill
totem {
version: 2
token: 2500
token_retransmits_before_loss_const: 10
join: 100
consensus: 3000
vsftype: none
max_messages: 20
send_join: 45
secauth:off
fail_recv_const: 5000
 
interface {
ringnumber: 0
bindnetaddr: 10.6.1.155
mcastaddr: 239.255.1.1
mcastport: 5405
ttl: 31
}

}
 
logging {
fileline: off
to_syslog: no
to_stderr: no
to_logfile: yes
logfile: /var/log/corosync.log
debug: off
timestamp: on
}
 
amf {
mode: disabled
}
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster type is: corosync

2011-07-25 Thread Proskurin Kirill

25.07.2011 10:10, Andrew Beekhof пишет:

Which packages are you using?


It is your official source from repository I build.
pacemaker-1.1.5
corosync-1.4.0
cluster-glue-1.0.6
openais-1.1.2

All nodes have same rpms.


On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill
k.prosku...@corp.mail.ru  wrote:

Hello again!

Hope I`m not flooding too much here but I have another problem.

I install same rpm of corosync, openais, pacemaker, cluster_glue on all
nodes. I check it twice.

And then I start some of they - they can`t connect to cluster and stays
offline. In logs I see what they see other nodes and connectivity is ok. But
I found the difference:

Online nodes in cluster have:
[root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 20:38:58 mysender39.mail.ru stonith-ng: [3499]: info:
get_cluster_type: Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.mail.ru attrd: [3502]: info: get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.mail.ru cib: [3500]: info: get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:59 mysender39.mail.ru crmd: [3504]: info: get_cluster_type:
Cluster type is: 'openais'.

Offline have:
[root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 13:39:17 mysender2.mail.ru stonith-ng: [9028]: info:
get_cluster_type: Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.mail.ru attrd: [9031]: info: get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.mail.ru cib: [9029]: info: get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:18 mysender2.mail.ru crmd: [9033]: info: get_cluster_type:
Cluster type is: 'corosync'.

What`s wrong and how can I fix it?


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster type is: corosync

2011-07-25 Thread Proskurin Kirill

Hello.

I update openais to latest 1.1.4 but this not helps at all.
Google know nothing about it. I run of ideas.

25.07.2011 13:18, Proskurin Kirill пишет:

25.07.2011 10:10, Andrew Beekhof пишет:

Which packages are you using?


It is your official source from repository I build.
pacemaker-1.1.5
corosync-1.4.0
cluster-glue-1.0.6
openais-1.1.2

All nodes have same rpms.


On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill
k.prosku...@corp.mail.ru wrote:

Hello again!

Hope I`m not flooding too much here but I have another problem.

I install same rpm of corosync, openais, pacemaker, cluster_glue on all
nodes. I check it twice.

And then I start some of they - they can`t connect to cluster and stays
offline. In logs I see what they see other nodes and connectivity is
ok. But
I found the difference:

Online nodes in cluster have:
[root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 20:38:58 mysender39.mail.ru stonith-ng: [3499]: info:
get_cluster_type: Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.mail.ru attrd: [3502]: info:
get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.mail.ru cib: [3500]: info: get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:59 mysender39.mail.ru crmd: [3504]: info: get_cluster_type:
Cluster type is: 'openais'.

Offline have:
[root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 13:39:17 mysender2.mail.ru stonith-ng: [9028]: info:
get_cluster_type: Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.mail.ru attrd: [9031]: info: get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.mail.ru cib: [9029]: info: get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:18 mysender2.mail.ru crmd: [9033]: info: get_cluster_type:
Cluster type is: 'corosync'.

What`s wrong and how can I fix it?



--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Sending message via cpg FAILED: (rc=12) Doesn't exist

2011-07-22 Thread Proskurin Kirill

Hello all.


pacemaker-1.1.5
corosync-1.4.0

4 nodes in cluster. 3 online 1 not.
In logs:

Jul 22 11:50:23 my106.example.com crmd: [28030]: info: 
pcmk_quorum_notification: Membership 0: quorum retained (0)
Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started: 
Delaying start, no membership data (0010)
Jul 22 11:50:23 my106.example.com crmd: [28030]: info: 
config_query_callback: Shutdown escalation occurs after: 120ms
Jul 22 11:50:23 my106.example.com crmd: [28030]: info: 
config_query_callback: Checking for expired actions every 90ms
Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started: 
Delaying start, no membership data (0010)
Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect: 
Connected to the CIB after 1 signon attempts
Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect: 
Sending full refresh
Jul 22 11:52:18 corosync [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Jul 22 11:52:18 corosync [CPG   ] chosen downlist: sender r(0) 
ip(10.3.1.107) ; members(old:4 left:1)
Jul 22 11:52:18 corosync [MAIN  ] Completed service synchronization, 
ready to provide service.
Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: 
send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist
Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: 
send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist
Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR: 
send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist




DC:

Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
Jul 22 11:50:07 my107.example.com pacemakerd: [22388]: info: 
update_node_processes: Node my106.example.com now has process list: 
0002 (was 00

12)
Jul 22 11:50:07 my107.example.com attrd: [22397]: info: crm_update_peer: 
Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 
seen=0 proc=00

02 (new)
Jul 22 11:50:07 my107.example.com cib: [22395]: info: crm_update_peer: 
Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 
seen=0 proc=0002

 (new)
Jul 22 11:50:07 my107.example.com stonith-ng: [22394]: info: 
crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null) 
votes=0 born=0 seen=0 proc=0

002 (new)
Jul 22 11:50:07 my107.example.com crmd: [22399]: info: crm_update_peer: 
Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0 
seen=0 proc=000

2 (new)
Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee


There is a problem?

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Sending message via cpg FAILED: (rc=12) Doesn't exist

2011-07-22 Thread Proskurin Kirill

22.07.2011 20:30, Steven Dake пишет:

On 07/22/2011 01:15 AM, Proskurin Kirill wrote:

Hello all.


pacemaker-1.1.5
corosync-1.4.0
11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee


There is a problem?



Does your retransmit list continually display e4 e5 etc for rest of
cluster lifetime, or is this short lived?


Yes it continually display this.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Upgrading from 1.0 to 1.1

2011-07-19 Thread Proskurin Kirill

On 07/19/2011 03:22 AM, Andrew Beekhof wrote:

On Fri, Jul 15, 2011 at 10:33 PM, Proskurin Kirill
k.prosku...@corp.mail.ru  wrote:

Hello all.

I found what I using corosync with pacemaker ver:0 with installed
pacemaker 1.1.5 - eg without start a pacemakerd.

Sounds wrong. :-)
So I try to upgrade.
I shutdown one node. Change 0 to 1 on service.d/pcmk
Start corosync and then start pacemakerd via init script.

But this node stays online and on clusters DC I see:
cib: [18392]: WARN: cib_peer_callback: Discarding cib_sync_one message (255)
from mysender10.example.com: not in our membership


Thats odd.  The only you changed was ver: 0 to ver: 1 ?


Yes, only this. To make it more clear - I have 4 nodes with ver 0 and 
try to add one with ver 1 and got this.


Well I shutdown all nodes change all to 1 and star them up add all was 
ok. Not a really good way to upgrade but I don`t have time.



--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Upgrading from 1.0 to 1.1

2011-07-15 Thread Proskurin Kirill

Hello all.

I found what I using corosync with pacemaker ver:0 with installed 
pacemaker 1.1.5 - eg without start a pacemakerd.


Sounds wrong. :-)
So I try to upgrade.
I shutdown one node. Change 0 to 1 on service.d/pcmk
Start corosync and then start pacemakerd via init script.

But this node stays online and on clusters DC I see:
cib: [18392]: WARN: cib_peer_callback: Discarding cib_sync_one message 
(255) from mysender10.example.com: not in our membership


Is there is a way to upgrade all nodes one by one without shutdown all 
cluster?


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Timeout, interval onfail questions

2011-07-11 Thread Proskurin Kirill

On 07/10/2011 02:53 PM, Lars Marowsky-Bree wrote:

2) I wish to my resources are *never* go to fail status. I found
on-fail=restart option but it is not seems to work as I expected.

So, for example, if some node under high LA and monitoring of
resource is fail - pacemaker will try to run stop action but
because of high LA it will timeout too and pacemaker decide what
resource is unmanaged. How can I tune this behaviour? I wish
pacemaker not to give up and try again.


Repeating the same thing over and over again and expecting the result to
change is one of the clinical tests for irrational and insane behaviour.
So pacemaker doesn't do that. ;-) stop isn't supposed to fail, we
don't support retrying it, and will not.


:-)
Well - this is not quite true. Because env can change - eg LA is start 
to go low. Well I think I will use some cron job for this.



Fix it so that it doesn't fail; if it fails due to a too short timeout,
make the timeout longer.


Sad thing - this host have huge LA time by time and we can`t fix that in 
near future. Timeout not really helps here(3m by now)... well I don`t 
really try to make it 10m or so.


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Timeout, interval onfail questions

2011-07-09 Thread Proskurin Kirill

Hello all!

I trying to understand all logic of pacemaker and have some questions.
1) There is an interval and timeout of monitoring of resource.
Situation:
Interval is 20s, timeout is 60s.

Monitoring action is started but node on load ant is it takes more than 
20 sec to get the result - will second monitoring action start or 
pacemaker understand what he allready have one?


2) I wish to my resources are *never* go to fail status. I found 
on-fail=restart option but it is not seems to work as I expected.


So, for example, if some node under high LA and monitoring of resource 
is fail - pacemaker will try to run stop action but because of high LA 
it will timeout too and pacemaker decide what resource is unmanaged. 
How can I tune this behaviour? I wish pacemaker not to give up and try 
again.


--
Best regards,
Proskurin Kirill



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] SNMP monitoring

2011-07-06 Thread Proskurin Kirill

On 07/05/2011 12:05 PM, Raoul Bhatia [IPAX] wrote:

Proskurin, if you get snmp working, would you kindly post your
configuration to the mailinglist?

the snmp-topic has popped up several times and it would be nice if
we got a working config in the mailinglist archive - or better: in the
wiki - as a reference.


Ok I get it.

You need:
snmptrapd
pacemaker with snmp support

snmptrapd.conf:
disableAuthorization yes

traphandle  SNMPv2-SMI::enterprises.32723.1.1   /tmp/trap.sh
traphandle  SNMPv2-SMI::enterprises.32723.1.2   /tmp/trap.sh
traphandle  SNMPv2-SMI::enterprises.32723.1.3   /tmp/trap.sh
traphandle  SNMPv2-SMI::enterprises.32723.1.4   /tmp/trap.sh
traphandle  SNMPv2-SMI::enterprises.32723.1.5   /tmp/trap.sh
traphandle  SNMPv2-SMI::enterprises.32723.1.6   /tmp/trap.sh
traphandle  SNMPv2-SMI::enterprises.32723.1.7   /tmp/trap.sh

/tmp/trap.sh - Any sh script to parse result.
For example:
#!/bin/sh

read host
read ip


while read oid val
do
echo -e $host $ip == $oid == $val\n  /tmp/trap.out
done

crm_mon --daemonize -S snmptrapd-ip-addr -- to send traps.

OR you can use your monitoring system and send traps directly to it.

P.S. This works for me on CentOS 5.x with pacemaker 1.1.5 and snmp-5.3.2.

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] SNMP monitoring

2011-07-04 Thread Proskurin Kirill

Hello all.

Im try to figure out how to monitor cluster via SNMP.
I understand what I need to use crm_mon -S snmpdtrap-ip but I kind of 
new at SNMP and still can`t understand how to get it work.


Could someone write a simple example? Like a snmpdtrapd config.
Or may be more detail one to put it into a pacemaker docs(SNMP chapter 
is empty there)?


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Not connected to AIS

2011-06-28 Thread Proskurin Kirill

On 06/27/2011 09:15 AM, Andrew Beekhof wrote:

On Fri, Jun 24, 2011 at 6:56 PM, Proskurin Kirill
k.prosku...@corp.mail.ru  wrote:

Hello.

I have a strange problem.
One node in cluster are not work right.


In logs:
Jun 23 20:25:25 mysender39.example.com lrmd: [10371]: WARN: For LSB init
script, no additional parameters are needed.
Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output:
(onlineconf.init:3:stop:stdout) Stopping onlineconf_updater:
Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output:
(onlineconf.init:3:stop:stdout) [
Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output:
(onlineconf.init:3:stop:stdout)   OK
Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output:
(onlineconf.init:3:stop:stdout) ]

Jun 23 20:25:25 mysender39.example.com crmd: [30682]: info:
process_lrm_event: LRM operation onlineconf.init:3_stop_0 (call=181, rc=0,
cib-update=683339, confirm
ed=true) ok
Jun 23 20:25:25 mysender39.example.com cib: [30678]: ERROR:
send_ais_message: Not connected to AIS

And then many errors and this string over and over.


Not enough information.
Please include a crm_report for the time between 20:20:00 and 20:30:00
on June 23.


I attached logs to this mail. Hope it helps.


--
Best regards,
Proskurin Kirill


report.tar.bz2
Description: application/bzip
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Not connected to AIS

2011-06-24 Thread Proskurin Kirill

Hello.

I have a strange problem.
One node in cluster are not work right.


In logs:
Jun 23 20:25:25 mysender39.example.com lrmd: [10371]: WARN: For LSB init 
script, no additional parameters are needed.
Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: 
(onlineconf.init:3:stop:stdout) Stopping onlineconf_updater:
Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: 
(onlineconf.init:3:stop:stdout) [
Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: 
(onlineconf.init:3:stop:stdout)   OK
Jun 23 20:25:25 mysender39.example.com lrmd: [30679]: info: RA output: 
(onlineconf.init:3:stop:stdout) ]


Jun 23 20:25:25 mysender39.example.com crmd: [30682]: info: 
process_lrm_event: LRM operation onlineconf.init:3_stop_0 (call=181, 
rc=0, cib-update=683339, confirm

ed=true) ok
Jun 23 20:25:25 mysender39.example.com cib: [30678]: ERROR: 
send_ais_message: Not connected to AIS


And then many errors and this string over and over.
But at crm_mod all seems quite:
Last updated: Fri Jun 24 12:35:05 2011
Stack: openais
Current DC: mysender6.example.com - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
4 Nodes configured, 4 expected votes
7 Resources configured.

Online: [ mysender6.example.com mysender31.example.com 
mysender38.example.com mysender39.example.com ]


And clone resource at this not is unmanaged.

onlineconf.init:3  (lsb:onlineconf):   Started 
mysender39.example.com (unmanaged) FAILED


Failed actions:
onlineconf.init:3_monitor_5000 (node=mysender39.example.com, 
call=180, rc=7, status=complete): not running
onlineconf.init:3_stop_0 (node=mysender39.example.com, call=-1, 
rc=1, status=Timed Out): unknown error


At logs:

Jun 24 12:43:15 mysender39.example.com attrd: [30680]: WARN: 
attrd_cib_callback: Update 333725 for 
fail-count-onlineconf.init:2=(null) failed: Remote node did not respond


But if I run it by hands it is answers immediately:
# /etc/init.d/onlineconf status
onlineconf_updater is stopped

I do /etc/init.d/corosync restart
I wait for 5 min but it still Waiting for corosync services to unload
So i kill  with -9 and restart.

And all start normal again.
What was wrong?

Corosync-1.2.7
Pacemaker-1.0.11

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Resource monitor stop working

2011-06-24 Thread Proskurin Kirill

Hello all.

Another problem.
Just find out what one of my clone resource are not work and pacemakers 
not see this - it is says what all cones are started. If i run status 
from console - all is ok.


I still can`t understand how to fix it.


I attached log from DC with really strange problems.

My config:
node mysender31.example.com
node mysender38.example.com
node mysender39.example.com
node mysender6.example.com
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip=10.6.1.214 cidr_netmask=32 nic=eth0:0 \
op monitor interval=15 timeout=30 on-fail=restart
primitive cleardb_delete_history_old.init lsb:cleardb_delete_history_old \
op monitor interval=15 timeout=30 on-fail=restart \
meta target-role=Started
primitive gettopupdated.init lsb:gettopupdate-my \
op monitor interval=15 timeout=30 on-fail=restart
primitive onlineconf.init lsb:onlineconf \
op monitor interval=15
primitive qm_manager.init lsb:qm_manager \
op monitor interval=15 timeout=30 on-fail=restart \
meta target-role=Started
primitive qm_master.init lsb:qm_master \
op monitor interval=15 timeout=30 on-fail=restart
primitive silverbox-stat.1.init lsb:silverbox-stat.1 \
op monitor interval=15 timeout=30 on-fail=restart \
meta target-role=Started
clone gettopupdated.clone gettopupdated.init
clone onlineconf.clone onlineconf.init
clone qm_master.clone qm_master.init \
meta clone-max=2
location CLEARDB_RUNS_ONLY_ON_MS6 cleardb_delete_history_old.init \
rule $id=CLEARDB_RUNS_ONLY_ON_MS6-rule -inf: #uname ne 
mysender6.example.com

location QM-PREFER-MS39 qm_manager.init 100: mysender39.example.com
location QM_MASTER_DENY_MS38 qm_master.clone -inf: mysender38.example.com
location QM_MASTER_DENY_MS39 qm_master.clone -inf: mysender39.example.com
location SILVERBOX-STAT_RUNS_ONLY_ON_MS38 silverbox-stat.1.init \
rule $id=SILVERBOX-STAT_RUNS_ONLY_ON_MS38-rule -inf: #uname 
ne mysender38.example.com

colocation QM-IP inf: ClusterIP qm_manager.init
order IP-Before-Qm inf: ClusterIP qm_manager.init
property $id=cib-bootstrap-options \
dc-version=1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87 \
cluster-infrastructure=openais \
expected-quorum-votes=4 \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1308909119


--
Best regards,
Proskurin Kirill
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: info: determine_online_status: Node mysender38.example.com is online
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: info: determine_online_status: Node mysender31.example.com is online
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: info: determine_online_status: Node mysender39.example.com is online
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: info: determine_online_status: Node mysender6.example.com is online
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: WARN: unpack_rsc_op: Processing failed op onlineconf.init:2_monitor_5000 on mysender38.example.com: not runni
ng (7)
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation ClusterIP_monitor_0 found resource ClusterIP active on mysender38.mail.r
u
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation gettopupdated.init:3_monitor_0 found resource gettopupdated.init:3 activ
e on mysender38.example.com
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation silverbox-stat.1.init_monitor_0 found resource silverbox-stat.1.init act
ive on mysender38.example.com
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation qm_master.init:0_monitor_0 found resource qm_master.init:0 active on mys
ender38.example.com
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation cleardb_delete_history_old.init_monitor_0 found resource cleardb_delete_
history_old.init active on mysender38.example.com
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation qm_master.init:1_monitor_0 found resource qm_master.init:1 active on mys
ender31.example.com
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation cleardb_delete_history_old.init_monitor_0 found resource cleardb_delete_
history_old.init active on mysender31.example.com
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation onlineconf.init:1_monitor_0 found resource onlineconf.init:1 active on m
ysender31.example.com
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: WARN: unpack_rsc_op: Processing failed op onlineconf.init:1_monitor_5000 on mysender31.example.com: not runni
ng (7)
Jun 24 11:27:40 mysender6.example.com pengine: [23744]: notice: unpack_rsc_op: Operation qm_manager.init_monitor_0 found resource qm_manager.init active on mysen
der39.example.com
Jun 24 11:27:40 mysender6.example.com

[Pacemaker] Deleted nodes returns

2011-06-22 Thread Proskurin Kirill

Hello all.

I have a strange problem.
At the beginning of my cluster there is a nodes called mysender38.i and 
mysender39.i


Then I:
Stop them
Delete all from /var/lib/heartbeat/crm/*
crm_node --force --remove NODENAME
cibadmin --delete --obj_type nodes --crm_xml 'node uname=NODENAME/'
cibadmin --delete --obj_type status --crm_xml 'node_state 
uname=NODENAME/'

Changed they hostname
Start them

And they are gone and new one a running.
But *any time* I make changes in cluster configuration I get this:
OFFLINE: [ mysender39.i mysender38.i ]

And I need to crm_node --force --remove and so on again to make them 
disappear. It is a bug or I doing something wrong?


pacemaker-1.0.11
corosync-1.2.7


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Deleted nodes returns

2011-06-22 Thread Proskurin Kirill

On 06/22/2011 03:41 PM, Florian Haas wrote:

On 2011-06-22 12:41, Proskurin Kirill wrote:

Hello all.

I have a strange problem.
At the beginning of my cluster there is a nodes called mysender38.i and
mysender39.i

Then I:
Stop them
Delete all from /var/lib/heartbeat/crm/*
crm_node --force --remove NODENAME
cibadmin --delete --obj_type nodes --crm_xml 'node uname=NODENAME/'
cibadmin --delete --obj_type status --crm_xml 'node_state
uname=NODENAME/'
Changed they hostname
Start them

And they are gone and new one a running.
But *any time* I make changes in cluster configuration I get this:
OFFLINE: [ mysender39.i mysender38.i ]

And I need to crm_node --force --remove and so on again to make them
disappear. It is a bug or I doing something wrong?


Why do you do things the hard way rather than simply running crm node
deletenode?


Well I refer to a docs but I try that too and it is not helps at all.

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Hostname issues

2011-06-21 Thread Proskurin Kirill

Hello all

I have 4 nodes - all of them with two nic in two network. All of them 
have 2 DNS name - one for internal network and one for external.


This host *must* have a hostname of external network(for other software 
to work). Corosync must works on internal nic.


But it is ask for uname -n for node name and get external name.

How to avoide this? I can`t change hostname to int one and can`t run 
corosync on ext network.


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Groups

2011-06-20 Thread Proskurin Kirill

Hello all!

I`m new to pacemakers and have a small question.
I want what my resource will be run on all nodes except some.

For example we have 10 nodes: node1-10
I want it running on node1-5 but not on node5-10.
I can make a 5 location with -INFINITY: node5 ; -INFINITY: node6 and 
so on.


But it is not the way I want to do this.
It is possible to make some kind of group(not pacemaker group) of 
nodes, resources and so on and just add -INFINITY: groupname?


Or may be there is a option to just count it in a row like
-INFINITY: node6, node7, node8 ?

Or there is other way what I missed?

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] FS mount error

2010-07-22 Thread Proskurin Kirill
]: notice: clone_print: 
Master/Slave Set: WebData
Jul 22 08:18:43 node01 pengine: [1813]: notice: short_print: 
Masters: [ node02.domain.org ]
Jul 22 08:18:43 node01 pengine: [1813]: notice: short_print: 
Slaves: [ node01.domain.org ]
Jul 22 08:18:43 node01 pengine: [1813]: notice: native_print: 
WebFS#011(ocf::heartbeat:Filesystem):#011Stopped
Jul 22 08:18:43 node01 pengine: [1813]: info: get_failcount: WebFS has 
failed 100 times on node01.domain.org
Jul 22 08:18:43 node01 pengine: [1813]: WARN: common_apply_stickiness: 
Forcing WebFS away from node01.domain.org after 100 failures 
(max=100)
Jul 22 08:18:43 node01 pengine: [1813]: info: native_merge_weights: 
WebData: Rolling back scores from WebFS
Jul 22 08:18:43 node01 pengine: [1813]: info: native_merge_weights: 
wwwdrbd:0: Rolling back scores from WebFS
Jul 22 08:18:43 node01 pengine: [1813]: info: native_merge_weights: 
WebData: Rolling back scores from WebFS
Jul 22 08:18:43 node01 pengine: [1813]: info: master_color: Promoting 
wwwdrbd:0 (Master node02.domain.org)
Jul 22 08:18:43 node01 pengine: [1813]: info: master_color: WebData: 
Promoted 1 instances of a possible 1 to master
Jul 22 08:18:43 node01 pengine: [1813]: info: master_color: Promoting 
wwwdrbd:0 (Master node02.domain.org)
Jul 22 08:18:43 node01 pengine: [1813]: info: master_color: WebData: 
Promoted 1 instances of a possible 1 to master
Jul 22 08:18:43 node01 pengine: [1813]: notice: RecurringOp:  Start 
recurring monitor (60s) for WebSite on node02.domain.org
Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Leave 
resource ClusterIP#011(Started node02.domain.org)
Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Start 
WebSite#011(node02.domain.org)
Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Leave 
resource wwwdrbd:0#011(Master node02.domain.org)
Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Leave 
resource wwwdrbd:1#011(Slave node01.domain.org)
Jul 22 08:18:43 node01 pengine: [1813]: notice: LogActions: Start 
WebFS#011(node02.domain.org)
Jul 22 08:18:43 node01 pengine: [1813]: info: process_pe_message: 
Transition 199: PEngine Input stored in: /var/lib/pengine/pe-input-243.bz2
Jul 22 08:18:44 node01 crmd: [1814]: ERROR: stonithd_signon: Can't 
initiate connection to stonithd

Jul 22 08:18:44 node01 crmd: [1814]: notice: Not currently connected.
Jul 22 08:18:44 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in 
failed: triggered a retry
Jul 22 08:18:44 node01 crmd: [1814]: info: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 22 08:18:44 node01 crmd: [1814]: info: unpack_graph: Unpacked 
transition 199: 4 actions in 4 synapses
Jul 22 08:18:44 node01 crmd: [1814]: info: do_te_invoke: Processing 
graph 199 (ref=pe_calc-dc-1279783123-729) derived from 
/var/lib/pengine/pe-input-243.bz2
Jul 22 08:18:44 node01 crmd: [1814]: info: te_rsc_command: Initiating 
action 42: start WebFS_start_0 on node02.domain.org
Jul 22 08:18:44 node01 crmd: [1814]: info: te_rsc_command: Initiating 
action 5: probe_complete probe_complete on node02.domain.org - no waiting
Jul 22 08:18:44 node01 crmd: [1814]: info: te_connect_stonith: 
Attempting connection to fencing daemon...
Jul 22 08:18:45 node01 crmd: [1814]: ERROR: stonithd_signon: Can't 
initiate connection to stonithd

Jul 22 08:18:45 node01 crmd: [1814]: notice: Not currently connected.
Jul 22 08:18:45 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in 
failed: triggered a retry
Jul 22 08:18:45 node01 crmd: [1814]: info: te_connect_stonith: 
Attempting connection to fencing daemon...
Jul 22 08:18:46 node01 crmd: [1814]: ERROR: stonithd_signon: Can't 
initiate connection to stonithd

Jul 22 08:18:46 node01 crmd: [1814]: notice: Not currently connected.
Jul 22 08:18:46 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in 
failed: triggered a retry
Jul 22 08:18:46 node01 crmd: [1814]: info: te_connect_stonith: 
Attempting connection to fencing daemon...
Jul 22 08:18:47 node01 crmd: [1814]: ERROR: stonithd_signon: Can't 
initiate connection to stonithd

Jul 22 08:18:47 node01 crmd: [1814]: notice: Not currently connected.
Jul 22 08:18:47 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in 
failed: triggered a retry
Jul 22 08:18:47 node01 crmd: [1814]: info: te_connect_stonith: 
Attempting connection to fencing daemon...


--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] FS mount error

2010-07-22 Thread Proskurin Kirill

On 22/07/10 12:23, Michael Fung wrote:

crm resource cleanup WebFS


That not help.

node01:~# crm resource cleanup WebFS
Cleaning up WebFS on mail02.fxclub.org
Cleaning up WebFS on mail01.fxclub.org

Jul 22 09:33:24 node01 crm_resource: [3442]: info: Invoked: crm_resource 
-C -r WebFS -H node01.domain.org
Jul 22 09:33:25 node01 crmd: [1814]: ERROR: stonithd_signon: Can't 
initiate connection to stonithd

Jul 22 09:33:25 node01 crmd: [1814]: notice: Not currently connected.
Jul 22 09:33:25 node01 crmd: [1814]: ERROR: te_connect_stonith: Sign-in 
failed: triggered a retry
Jul 22 09:33:25 node01 crmd: [1814]: info: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 22 09:33:25 node01 crmd: [1814]: info: unpack_graph: Unpacked 
transition 647: 6 actions in 6 synapses
Jul 22 09:33:25 node01 crmd: [1814]: info: do_te_invoke: Processing 
graph 647 (ref=pe_calc-dc-1279787604-2520) derived from 
/var/lib/pengine/pe-input-691.bz2
Jul 22 09:33:25 node01 crmd: [1814]: info: te_rsc_command: Initiating 
action 2: stop WebFS_stop_0 on node02.domain.org
Jul 22 09:33:25 node01 crmd: [1814]: info: te_rsc_command: Initiating 
action 6: probe_complete probe_complete on node02.domain.org - no waiting


...

Jul 22 09:33:32 node01 crmd: [1814]: WARN: status_from_rc: Action 43 
(WebFS_start_0) on node02.domain.org failed (target: 0 vs. rc: 1): Error
Jul 22 09:33:32 node01 crmd: [1814]: WARN: update_failcount: Updating 
failcount for WebFS on node02.domain.org after failed start: rc=1 
(update=INFINITY, time=1279787612)
Jul 22 09:33:32 node01 crmd: [1814]: info: abort_transition_graph: 
match_graph_event:272 - Triggered transition abort (complete=0, 
tag=lrm_rsc_op, id=WebFS_start_0, 
magic=0:1;43:647:0:882b3ca6-0496-4e26-9137-0a10d6ce57e4, cib=0.144.897) 
: Event failed
Jul 22 09:33:32 node01 crmd: [1814]: info: update_abort_priority: Abort 
priority upgraded from 0 to 1
Jul 22 09:33:32 node01 crmd: [1814]: info: update_abort_priority: Abort 
action done superceeded by restart
Jul 22 09:33:32 node01 crmd: [1814]: info: match_graph_event: Action 
WebFS_start_0 (43) confirmed on node02.domain.org (rc=4)
Jul 22 09:33:32 node01 crmd: [1814]: info: run_graph: 

Jul 22 09:33:32 node01 crmd: [1814]: notice: run_graph: Transition 647 
(Complete=4, Pending=0, Fired=0, Skipped=2, Incomplete=0, 
Source=/var/lib/pengine/pe-input-691.bz2): Stopped
Jul 22 09:33:32 node01 crmd: [1814]: info: te_graph_trigger: Transition 
647 is now complete



--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker see double node`s

2010-07-14 Thread Proskurin Kirill

On 14/07/10 16:48, Florian Haas wrote:

I take it you switched cluster stacks, otherwise you wouldn't be seeing
each node twice, once with the $id attribute and once without.

Take a look at
http://www.clusterlabs.org/wiki/Initial_Configuration#A_Special_Note_for_People_Switching_Cluster_Stacks


Thanks - it is work like a charm.

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker