[Pacemaker] Broken links and broken bugzilla

2014-10-17 Thread Andrew Widdersheim
I noticed a few broken links and the bugzilla seems broken as well:

http://bugs.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://oss.clusterlabs.org/mailman/options/pacemaker

Pretty much a lot of stuff on the following seems to need some love:

http://clusterlabs.org/wiki/Mailing_lists   
  
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-23 Thread Andrew Widdersheim
After setting the crmd-transition-delay to 2 * my ping monitor interval the 
issues I was seeing before in testing have not re-occurred. Thanks again for 
the help.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-20 Thread Andrew Widdersheim
Have I just run into a shortcoming with pacemaker? Should I file a bug or RFE 
somewhere? Seems like there should be another parameter when setting up a pingd 
resource to tell the DC/policy engine to wait x amount of seconds so that all 
nodes have shared their connection state before it makes a decision about 
moving resources.   
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-16 Thread Andrew Widdersheim
The cluster has 3 connections total. The first connection is the outside 
interface where services can communicate and is also used for cluster 
communication using mcast. The second interface is a cross-over that is solely 
for cluster communication. The third connection is another cross-over solely 
for DRBD replication.

This issue happens when the first connection that is used for both the services 
and cluster communication is pulled on both nodes at the same time. 
  
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-16 Thread Andrew Widdersheim
Thanks for the help. Adding another node to the ping host_list may help in some 
situations but the root issues doesn't really get solved. Also, the location 
constraint you posted is very different than mine. Your constraint requires 
connectivity where as the one I am trying to use looks for best connectivity. 

I have used the location constraint you posted with success in the past but I 
don't want my resource to be shut off in the event of a network outage that is 
happening across all nodes at the same time. Don't get me wrong in some cluster 
configurationss I do use the configuration you posted but this setup is not one 
of them for specific reasons.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-16 Thread Andrew Widdersheim
Just tried the patch you gave and it worked fine. Any plans on putting this 
patch in officially or was this a one off? Aside from this patch I guess the 
only thing to get things to work is to install things slightly differently and 
adding a symlink from cluster-glue's lrmd to pacemakers.

 Subject: Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the 
 LRM 7
 From: and...@beekhof.net
 Date: Thu, 16 May 2013 15:20:59 +1000
 CC: pacemaker@oss.clusterlabs.org
 To: awiddersh...@hotmail.com
 
 
 On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com 
 wrote:
 
  I'll look into moving over to the cman option since that is preferred for 
  RHEL6.4 now if I'm not mistaken.
 
 Correct
 
  I'll also try out the patch provided and see how that goes. So was LRMD not 
  apart of pacemaker previously and later added? Was it originally apart of 
  heartbeat/cluster-glue? I'm just trying to figure out all of the pieces so 
  that I know how to fix if I choose to go down that road.
 
 
 Originally everything was part of heartbeat.
 Then what was then called the crm became pacemaker and the lrmd v1 became 
 part of cluster-glue (because the theory was that someone might use it for a 
 pacemaker alternative).
 That never happened and we stopped using almost everything else from 
 cluster-glue, so when lrmd v2 was written, it was done so as part of 
 pacemaker.
 
 or, tl;dr - yes and yes :)
  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-16 Thread Andrew Widdersheim
I'm attaching 3 patches I made fairly quickly to fix the installation issues 
and also an issue I noticed with the ping ocf from the latest pacemaker. 

One is for cluster-glue to prevent lrmd from building and later installing. May 
also want to modify this patch to take lrmd out of both spec files included 
when you download the source if you plan to build an rpm. I'm not sure if what 
I did here is the best way to approach this problem so if anyone has anything 
better please let me know.

One is for pacemaker to create the lrmd symlink when building with heartbeat 
support. Note the spec does not need anything changed here.

Finally, saw the following errors in messages with the latest ping ocf and the 
attached patch seems to fix the issue.

May 16 01:10:13 node2 lrmd[16133]:   notice: operation_finished: 
p_ping_monitor_5000:17758 [ /usr/lib/ocf/resource.d/pacemaker/ping: line 296: 
[: : integer expression expected ] 

cluster-glue-no-lrmd.patch
Description: Binary data


pacemaker-lrmd-hb.patch
Description: Binary data


pacemaker-ping-failure.patch
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Widdersheim
I am running the following versions:

pacemaker-1.1.10-rc2
cluster-glue-1.0.11
heartbeat-3.0.5

I was running pacemaker-1.1.6 and things were working fine but after updating 
to the latest I could not get pacemaker to start with the following message 
repeated in the logs:

crmd[8456]:  warning: do_lrm_control: Failed to sign on to the LRM 7 (30 max) 
times

Here is strace output from the crmd process:

0.23 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
0.21 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
0.000574 socket(PF_FILE, SOCK_STREAM, 0) = 6
0.42 fcntl(6, F_GETFD)         = 0
0.25 fcntl(6, F_SETFD, FD_CLOEXEC) = 0
0.21 fcntl(6, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
0.55 connect(6, {sa_family=AF_FILE, path=@lrmd}, 110) = -1 ECONNREFUSED 
(Connection refused)
0.50 close(6)                  = 0
0.31 shutdown(4294967295, 2 /* send and receive */) = -1 EBADF (Bad file 
descriptor)
0.24 close(4294967295)         = -1 EBADF (Bad file descriptor)
0.39 write(2, Could not establish lrmd connect..., 62) = 62
0.58 sendto(3, 28May 14 18:54:51 crmd[8456]: ..., 104, MSG_NOSIGNAL, 
NULL, 0) = 104
0.000327 times({tms_utime=0, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 
430616237
0.28 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
0.25 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
0.26 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
0.23 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
0.23 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
0.23 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)

I'm not quite sure what the issue is. At first I thought it might have been 
some type of permissions issues but I'm not quite sure that is the case 
anymore. Any help would be appreciated. I can forward a long any more details 
to help in troubleshooting. 
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-15 Thread Andrew Widdersheim
Sorry to bring up old issues but I am having the exact same problem as the 
original poster. A simultaneous disconnect on my two node cluster causes the 
resources to start to transition to the other node but mid flight the 
transition is aborted and resources are started again on the original node when 
the cluster realizes connectivity is same between the two nodes.

I have tried various dampen settings without having any luck. Seems like the 
nodes report the outages at slightly different times which results in a partial 
transition of resources instead of waiting to know the connectivity of all of 
the nodes in the cluster before taking action which is what I would have 
thought dampen would help solve.

Ideally the cluster wouldn't start the transition if another cluster node is 
having a connectivity issue as well and connectivity status is shared between 
all cluster nodes. Find my configuration below. Let me know there is something 
I can change to fix or if this behavior is expected.

primitive p_drbd ocf:linbit:drbd \
        params drbd_resource=r1 \
        op monitor interval=30s role=Slave \
        op monitor interval=10s role=Master
primitive p_fs ocf:heartbeat:Filesystem \
        params device=/dev/drbd/by-res/r1 directory=/drbd/r1 fstype=ext4 
options=noatime \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=180s \
        op monitor interval=30s timeout=40s
primitive p_mysql ocf:heartbeat:mysql \
        params binary=/usr/libexec/mysqld config=/drbd/r1/mysql/my.cnf 
datadir=/drbd/r1/mysql \
        op start interval=0 timeout=120s \
        op stop interval=0 timeout=120s \
        op monitor interval=30s \
        meta target-role=Started
primitive p_ping ocf:pacemaker:ping \
        params host_list=192.168.5.1 dampen=30s multiplier=1000 
debug=true \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=60s \
        op monitor interval=5s timeout=10s
group g_mysql_group p_fs p_mysql \
        meta target-role=Started
ms ms_drbd p_drbd \
        meta notify=true master-max=1 clone-max=2 target-role=Started
clone cl_ping p_ping
location l_connected g_mysql \
        rule $id=l_connected-rule pingd: defined pingd
colocation c_mysql_on_drbd inf: g_mysql ms_drbd:Master
order o_drbd_before_mysql inf: ms_drbd:promote g_mysql:start
property $id=cib-bootstrap-options \
        dc-version=1.1.6-1.el6-8b6c6b9b6dc2627713f870850d20163fad4cc2a2 \
        cluster-infrastructure=Heartbeat \
        no-quorum-policy=ignore \
        stonith-enabled=false \
        cluster-recheck-interval=5m \
        last-lrm-refresh=1368632470
rsc_defaults $id=rsc-options \
        migration-threshold=5 \
        resource-stickiness=200 
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Widdersheim
These are the libqb versions:
libqb-devel-0.14.2-3.el6.x86_64libqb-0.14.2-3.el6.x86_64
Here is a process listing where lrmd is running:[root@node1 ~]# ps auxwww | 
egrep heartbeat|pacemakerroot  9553  0.1  0.7  52420  7424 ?SLs  
May14   1:39 heartbeat: master control processroot  9556  0.0  0.7  52260  
7264 ?SL   May14   0:10 heartbeat: FIFO readerroot  9557  0.0  0.7  
52256  7260 ?SL   May14   1:01 heartbeat: write: mcast eth0root  
9558  0.0  0.7  52256  7260 ?SL   May14   0:14 heartbeat: read: mcast 
eth0root  9559  0.0  0.7  52256  7260 ?SL   May14   0:23 heartbeat: 
write: bcast eth1root  9560  0.0  0.7  52256  7260 ?SL   May14   
0:13 heartbeat: read: bcast eth1498   9563  0.0  0.2  36908  2392 ?
SMay14   0:10 /usr/lib64/heartbeat/ccm498   9564  0.0  1.0  85084 10704 
?SMay14   0:25 /usr/lib64/heartbeat/cibroot  9565  0.0  0.1  
44588  1896 ?SMay14   0:04 /usr/lib64/heartbeat/lrmd -rroot  
9566  0.0  0.3  83544  3988 ?SMay14   0:10 
/usr/lib64/heartbeat/stonithd498   9567  0.0  0.3  78668  3248 ?S   
 May14   0:10 /usr/lib64/heartbeat/attrd498  26534  0.0  0.3  92364  3748 ? 
   S16:05   0:00 /usr/lib64/heartbeat/crmd498  26535  0.0  0.2  
72840  2708 ?S16:05   0:00 /usr/libexec/pacemaker/pengine

Here are the logs at startup until the Failed to sign on message just starts 
to repeat over and over:May 15 16:07:06 node1 crmd[26621]:   notice: main: CRM 
Git Version: b060caeMay 15 16:07:06 node1 attrd[26620]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 
16:07:06 node1 attrd[26620]:   notice: main: Starting mainloop...May 15 
16:07:06 node1 stonith-ng[26619]:   notice: crm_cluster_connect: Connecting to 
cluster infrastructure: heartbeatMay 15 16:07:06 node1 cib[26617]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 
16:07:06 node1 lrmd: [26618]: WARN: Initializing connection to logging daemon 
failed. Logging daemon may not be runningMay 15 16:07:06 node1 lrmd: [26618]: 
info: max-children set to 4 (1 processors online)May 15 16:07:06 node1 lrmd: 
[26618]: info: enabling coredumpsMay 15 16:07:06 node1 lrmd: [26618]: info: 
Started.May 15 16:07:06 node1 cib[26617]:  warning: ccm_connect: CCM Activation 
failedMay 15 16:07:06 node1 cib[26617]:  warning: ccm_connect: CCM Connection 
failed 1 times (30 max)May 15 16:07:06 node1 ccm: [26616]: WARN: Initializing 
connection to logging daemon failed. Logging daemon may not be runningMay 15 
16:07:06 node1 ccm: [26616]: info: Hostname: node1May 15 16:07:07 node1 
crmd[26621]:  warning: do_cib_control: Couldn't complete CIB registration 1 
times... pause and retryMay 15 16:07:09 node1 cib[26617]:  warning: 
ccm_connect: CCM Activation failedMay 15 16:07:09 node1 cib[26617]:  warning: 
ccm_connect: CCM Connection failed 2 times (30 max)May 15 16:07:10 node1 
crmd[26621]:  warning: do_cib_control: Couldn't complete CIB registration 2 
times... pause and retryMay 15 16:07:13 node1 crmd[26621]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 
16:07:14 node1 cib[26617]:   notice: crm_update_peer_state: 
crm_update_ccm_node: Node node2[1] - state is now member (was (null))May 15 
16:07:14 node1 cib[26617]:   notice: crm_update_peer_state: 
crm_update_ccm_node: Node node1[0] - state is now member (was (null))May 15 
16:07:15 node1 crmd[26621]:  warning: do_lrm_control: Failed to sign on to the 
LRM 1 (30 max) times
Here is the repeating message peices:May 15 16:06:09 node1 crmd[26534]:
error: do_lrm_control: Failed to sign on to the LRM 30 (max) timesMay 15 
16:06:09 node1 crmd[26534]:error: do_log: FSA: Input I_ERROR from 
do_lrm_control() received in state S_STARTINGMay 15 16:06:09 node1 crmd[26534]: 
 warning: do_state_transition: State transition S_STARTING - S_RECOVERY [ 
input=I_ERROR cause=C_FSA_INTERNAL origin=do_lrm_control ]May 15 16:06:09 node1 
crmd[26534]:  warning: do_recover: Fast-tracking shutdown in response to 
errorsMay 15 16:06:09 node1 crmd[26534]:error: do_started: Start 
cancelled... S_RECOVERYMay 15 16:06:09 node1 crmd[26534]:error: do_log: 
FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERYMay 15 
16:06:09 node1 crmd[26534]:   notice: do_lrm_control: Disconnected from the 
LRMMay 15 16:06:09 node1 ccm: [9563]: info: client (pid=26534) removed from 
ccmMay 15 16:06:09 node1 crmd[26534]:error: do_exit: Could not recover from 
internal errorMay 15 16:06:09 node1 crmd[26534]:error: crm_abort: 
crm_glib_handler: Forked child 26540 to record non-fatal assert at logging.c:63 
: g_hash_table_size: assertion `hash_table != NULL' failedMay 15 16:06:09 node1 
crmd[26534]:error: crm_abort: crm_glib_handler: Forked child 26541 to 
record non-fatal assert at logging.c:63 : g_hash_table_destroy: assertion 
`hash_table 

Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-15 Thread Andrew Widdersheim
I attached logs from both nodes. Yes, we compiled 1.1.6 with heartbeat support 
for RHEL6.4. I tried 1.1.10 but had issues. I have another thread open on the 
mailing list for that issue as well. I'm not opposed to doing CMAN or corosync 
if those fix the problem.

We have been using this setup or very similar for about 2-3 years. Florian Haas 
actually came to our company to do a training for us when he was still at 
Linbit and this is how we set it up then and have continued to do so since.

We have never had an issue up until this point because all of our clusters in 
the past were setup so that connectivity was required and it was expected that 
resources would shut down during an event like this.
 May 15 13:42:00 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 
logd is not running
May 15 13:42:00 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 
2013/05/15_13:42:00 WARNING: 192.168.5.1 is inactive
May 15 13:42:09 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 
logd is not running
May 15 13:42:09 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 
2013/05/15_13:42:09 WARNING: 192.168.5.1 is inactive
May 15 13:42:18 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 
logd is not running
May 15 13:42:18 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 
2013/05/15_13:42:18 WARNING: 192.168.5.1 is inactive
May 15 13:42:27 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 
logd is not running
May 15 13:42:27 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 
2013/05/15_13:42:27 WARNING: 192.168.5.1 is inactive
May 15 13:42:30 node1 attrd: [27348]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: pingd (0)
May 15 13:42:30 node1 attrd: [27348]: notice: attrd_perform_update: Sent update 
238: pingd=0
May 15 13:42:30 node1 crmd: [27349]: info: abort_transition_graph: 
te_update_diff:164 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-f5a576b5-003b-447d-8029-19202823bbfa-pingd, name=pingd, value=0, 
magic=NA, cib=0.75.78) : Transient attribute: update
May 15 13:42:30 node1 crmd: [27349]: info: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
May 15 13:42:30 node1 crmd: [27349]: info: do_state_transition: All 2 cluster 
nodes are eligible to run resources.
May 15 13:42:30 node1 crmd: [27349]: info: do_pe_invoke: Query 362: Requesting 
the current CIB: S_POLICY_ENGINE
May 15 13:42:30 node1 crmd: [27349]: info: do_pe_invoke_callback: Invoking the 
PE: query=362, ref=pe_calc-dc-1368639750-564, seq=8, quorate=1
May 15 13:42:30 node1 pengine: [6643]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
May 15 13:42:30 node1 pengine: [6643]: notice: unpack_rsc_op: Operation 
p_drbd:1_last_failure_0 found resource p_drbd:1 active on node2
May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp:  Start recurring 
monitor (30s) for p_fs on node2
May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp:  Start recurring 
monitor (30s) for p_mysql on node2
May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp:  Start recurring 
monitor (30s) for p_drbd:0 on node1
May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp:  Start recurring 
monitor (10s) for p_drbd:1 on node2
May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp:  Start recurring 
monitor (30s) for p_drbd:0 on node1
May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp:  Start recurring 
monitor (10s) for p_drbd:1 on node2
May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Move
p_fs#011(Started node1 - node2)
May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Move
p_mysql#011(Started node1 - node2)
May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Demote  
p_drbd:0#011(Master - Slave node1)
May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Promote 
p_drbd:1#011(Slave - Master node2)
May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Leave   
p_ping:0#011(Started node2)
May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Leave   
p_ping:1#011(Started node1)
May 15 13:42:30 node1 crmd: [27349]: info: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
May 15 13:42:30 node1 crmd: [27349]: info: unpack_graph: Unpacked transition 
56: 40 actions in 40 synapses
May 15 13:42:30 node1 crmd: [27349]: info: do_te_invoke: Processing graph 56 
(ref=pe_calc-dc-1368639750-564) derived from /var/lib/pengine/pe-input-64.bz2
May 15 13:42:30 node1 crmd: [27349]: info: te_pseudo_action: Pseudo action 23 
fired and confirmed
May 15 13:42:30 node1 crmd: [27349]: info: te_rsc_command: Initiating action 7: 
cancel p_drbd:0_monitor_1 on node1 (local)
May 15 13:42:30 node1 lrmd: [18185]: WARN: For LSB init script, no additional 

Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Widdersheim
There are quite a few symlinks of heartbeat pieces back to pacemaker pieces 
like crmd as an example but lrmd was not one of them:
[root@node1 ~]# ls -lha /usr/lib64/heartbeat/crmdlrwxrwxrwx 1 root root 27 May 
14 17:31 /usr/lib64/heartbeat/crmd - /usr/libexec/pacemaker/crmd
[root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmd-rwxr-xr-x 1 root root 85K May 
14 17:19 /usr/lib64/heartbeat/lrmd
I just tried to symlink it back by hand but when I started heartbeat the logs 
had nothing about lrmd starting/trying to start nor did lrmd show in the 
process list anymore. Just more failure messages.
[root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmdlrwxrwxrwx 1 root root 27 May 
15 19:38 /usr/lib64/heartbeat/lrmd - /usr/libexec/pacemaker/lrmd
I then started lrmd manually as root with the verbose option turned on and 
looks like things started to connect and the cluster on node1 where I started 
lrmd manually began coming online and work a bit. I noticed when running 
pacemakers lrmd there is no longer a -r option which looking at my old ps 
command was how it was getting started:
[root@node1 ~]# /usr/libexec/pacemaker/lrmd --helplrmd - Pacemaker Remote 
daemon for extending pacemaker functionality to remote nodes.Usage: lrmd 
[options]Options: -?, --help This text -$, --version  
Version information -V, --verbose  Increase debug output -l, 
--logfile=valueSend logs to the additional named logfile
This is what heartbeat's lrmd looks like.
[root@node1 ~]# /usr/lib64/heartbeat/lrmd.bak 
--help/usr/lib64/heartbeat/lrmd.bak: invalid option -- '-'usage: lrmd [-srkhv]  
  s: statusr: restartk: killm: register to apphbd   
 i: the interval of apphbh: helpv: debug
Previous ps output:root  9565  0.0  0.1  44588  1896 ?SMay14   
0:04 /usr/lib64/heartbeat/lrmd -r
I'm not sure what initially tries to spawn lrmd but it is likely that will need 
to change as well. Is all of this the result of a bad installation or did I 
need to compile things differently or is pacemaker too new and heartbeat too 
old? Basically, what do I need to do to fix.
 ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Widdersheim
I'll look into moving over to the cman option since that is preferred for 
RHEL6.4 now if I'm not mistaken. I'll also try out the patch provided and see 
how that goes. So was LRMD not apart of pacemaker previously and later added? 
Was it originally apart of heartbeat/cluster-glue? I'm just trying to figure 
out all of the pieces so that I know how to fix if I choose to go down that 
road.  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Resource fails to stop

2012-07-26 Thread Andrew Widdersheim

One of my resources failed to stop due to it hitting the timeout setting. The 
resource went into a failed state and froze the cluster until I manually fixed 
the problem. My question is what is pacemaker's default action when it 
encounters a stop failure and STONITH is not enabled? Is it what I saw where 
the resource goes into a failed state and doesn't try to start it anywhere 
until manual intervention or does it continually try to stop it?

The reason I ask is I found the following link which suggests to me that after 
the failure timeout is reached when stopping a resource and STONITH is not 
enabled pacemaker will continually try to stop the resource until it succeeds:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html

If STONITH is not enabled, then the cluster has no way to continue and 
will not try to start the resource elsewhere, but will try to stop it 
again after the failure timeout.


I am using pacemaker 1.1.5.
  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource fails to stop

2012-07-26 Thread Andrew Widdersheim

Ah, that makes sense. Thanks for helping me wrap my head around it. 
Working on setting up STONITH now to avoid this in the future.  
  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource active on both nodes

2012-07-06 Thread Andrew Widdersheim

Update...

I did a crm resource cleanup of everything and these messages started to look 
more sane. The logs were now saying this were active only on what is currently 
the active node. 

From: awiddersh...@hotmail.com
To: pacemaker@oss.clusterlabs.org
Date: Thu, 5 Jul 2012 14:47:57 -0400
Subject: [Pacemaker] Resource active on both nodes





I'm seeing messages similar to the following:

Jul  5 14:34:06 server1 pengine: [423]: notice: unpack_rsc_op: Operation 
p_syslog-ng_monitor_0 found resource p_syslog-ng active on server2
Jul  5 14:34:06 server1 pengine: [423]: notice: unpack_rsc_op: Operation 
p_bacula_monitor_0 found resource p_bacula active on server2

The cluster is running fine currently and none of these resources are started 
on the server2 (currently the passive node) so I'm not clear on why I'm seeing 
these messages. 

All of the start and stop scripts are LSB compliant per: 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html

No resource failures when doing crm_mon -rf.

Is it possible at some point these resources were active on both nodes and was 
corrected but theses messages persist? Also none of these resources are set to 
start at boot or anything of that nature.

OS: RHEL6
Pacemaker: 1.1.5
Heartbeat: 3.0.5
  

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Resource active on both nodes

2012-07-05 Thread Andrew Widdersheim

I'm seeing messages similar to the following:

Jul  5 14:34:06 server1 pengine: [423]: notice: unpack_rsc_op: Operation 
p_syslog-ng_monitor_0 found resource p_syslog-ng active on server2
Jul  5 14:34:06 server1 pengine: [423]: notice: unpack_rsc_op: Operation 
p_bacula_monitor_0 found resource p_bacula active on server2

The cluster is running fine currently and none of these resources are started 
on the server2 (currently the passive node) so I'm not clear on why I'm seeing 
these messages. 

All of the start and stop scripts are LSB compliant per: 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html

No resource failures when doing crm_mon -rf.

Is it possible at some point these resources were active on both nodes and was 
corrected but theses messages persist? Also none of these resources are set to 
start at boot or anything of that nature.

OS: RHEL6
Pacemaker: 1.1.5
Heartbeat: 3.0.5
  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org