Re: [ClusterLabs] pacemaker remote configuration on ubuntu 14.04

2016-03-07 Thread Сергей Филатов
Thanks for an answer. Turned out the problem was not in ipv6.
Remote node is listening on 3121 port and it’s name is resolving fine.
Got authkey file at /etc/pacemaker on both remote and cluster nodes.
What can I check in addition? Is there any walkthrough for ubuntu?


> On 07 Mar 2016, at 09:40, Ken Gaillot  wrote:
> 
> On 03/06/2016 07:43 PM, Сергей Филатов wrote:
>> Hi,
>> I’m trying to set up pacemaker_remote resource on ubuntu 14.04
>> I followed "remote node walkthrough” guide 
>> (http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Remote/#idm140473081667280
>>  
>> )
>> After creating ocf:pacemaker:remote resource on cluster node, remote node 
>> doesn’t show up as online.
>> I guess I need to configure remote agent to listen on ipv4, where can I 
>> configure it?
>> Or is there any other steps to set up remote node besides the ones mentioned 
>> in guide?
>> tcp6   0  0 :::3121 :::*LISTEN   
>>21620/pacemaker_rem off (0.00/0/0)
>> 
>> pacemaker and pacemaker_remote are 1.12 version
> 
> 
> pacemaker_remote will try to bind to IPv6 addresses first, and only if
> that fails, will it bind to IPv4. There is no way to configure this
> behavior currently, though it obviously would be nice to have.
> 
> The only workarounds I can think of are to make IPv6 connections work
> between the cluster and the remote node, or disable IPv6 on the remote
> node. Using IPv6, there could be an issue if your name resolution
> returns both IPv4 and IPv6 addresses for the remote host; you could
> potentially work around that by adding an IPv6-only name for it, and
> using that as the server option to the remote resource.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Regular pengine warnings after a transient failure

2016-03-07 Thread Ken Gaillot
On 03/07/2016 02:03 PM, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
>> On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
>>
>>> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by 
>>> vm-eiffel_monitor_6 'create' on vhbl05: Foreign event 
>>> (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, 
>>> source=process_graph_event:600, 0)
>>
>> That means the action was initiated by a different node (the previous DC
>> presumably), so the new DC wants to recalculate everything.
> 
> Time travel was sort of possible in that situation, and recurring
> monitor operations are not logged, so this is indeed possible.  The main
> thing is that it wasn't mishandled.
> 
>>> recovery actions turned into start actions for the resources stopped
>>> during the previous transition.  However, almost all other recovery
>>> actions just disappeared without any comment.  This was actually
>>> correct, but I really wonder why the cluster decided to paper over
>>> the previous monitor operation timeouts.  Maybe the operations
>>> finished meanwhile and got accounted somehow, just not logged?
>>
>> I'm not sure why the PE decided recovery was not necessary. Operation
>> results wouldn't be accepted without being logged.
> 
> At which logging level?  I can't see recurring monitor operation logs in
> syslog (at default logging level: notice) nor in /var/log/pacemaker.log
> (which contains info level messages as well).
> 
> However, the info level logs contain more "Transition aborted" lines, as
> if only the first of them got logged with notice level.  This would make
> sense, since the later ones don't make any difference on an already
> aborted transition, so they aren't that important.  And in fact such
> lines were suppressed from the syslog I checked first, for example:
> 
> 12:55:39 [8479] vhbl07cib: info: cib_perform_op: Diff: --- 
> 0.613.120 2
> 12:55:39 [8479] vhbl07cib: info: cib_perform_op: Diff: +++ 
> 0.613.121 (null)
> 12:55:39 [8479] vhbl07cib: info: cib_perform_op: +  /cib:  
> @num_updates=121
> 12:55:39 [8479] vhbl07cib: info: cib_perform_op: ++ 
> /cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']:
>operation="monitor" crm-debug-origin="do_update_resource" 
> crm_feature_set="3.0.10" 
> transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" 
> transition-magic="0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" 
> on_node="vhbl05" call-id="645" rc-code="0" op-st
> 12:55:39 [8479] vhbl07cib: info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0, 
> origin=vhbl05/crmd/362, version=0.613.121)
> 12:55:39 [8484] vhbl07   crmd: info: abort_transition_graph: 
> Transition aborted by vm-elm_monitor_6 'create' on vhbl05: Foreign event 
> (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.121, 
> source=process_graph_event:600, 0)
> 12:55:39 [8484] vhbl07   crmd: info: process_graph_event:
> Detected action (0.473) vm-elm_monitor_6.645=ok: initiated by a different 
> node
> 
> I can very much imagine this cancelling the FAILED state induced by a
> monitor timeout like:
> 
> 12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++   
>  type="TransientDomain" class="ocf" provider="niif">
> 12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++   
>operation_key="vm-elm_monitor_6" operation="monitor" 
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" 
> transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" 
> transition-magic="2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" 
> on_node="vhbl05" call-id="645" rc-code="1" op-status="2" interval="6" 
> last-rc-change="1456833279" exe
> 12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++   
>operation_key="vm-elm_start_0" operation="start" 
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" 
> transition-key="472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" 
> transition-magic="0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" 
> on_node="vhbl05" call-id="602" rc-code="0" op-status="0" interval="0" 
> last-run="1456091121" last-rc-change="1456091121" e
> 12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++   
> 
> 
> The transition-keys match, does this mean that the above is a late
> result from the monitor operation which was considered timed-out
> previously?  How did it reach vhbl07, if the DC at that time was vhbl03?
> 
>> The pe-input files from the transitions around here should help.
> 
> They are available.  What shall I look for?

It's not the most user-friendly of tools, but crm_simulate can show how
the cluster would react to each transition

Re: [ClusterLabs] Regular pengine warnings after a transient failure

2016-03-07 Thread Andrew Beekhof
On Tue, Mar 8, 2016 at 7:03 AM, Ferenc Wágner  wrote:

> Ken Gaillot  writes:
>
> > On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
> >
> >> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by
> vm-eiffel_monitor_6 'create' on vhbl05: Foreign event
> (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98,
> source=process_graph_event:600, 0)
> >
> > That means the action was initiated by a different node (the previous DC
> > presumably),


I suspect s/previous/other/

With a stuck machine its entirely possible that the other nodes elected a
new leader.
Would I be right in guessing that fencing is disabled?


> so the new DC wants to recalculate everything.
>
> Time travel was sort of possible in that situation, and recurring
> monitor operations are not logged, so this is indeed possible.  The main
> thing is that it wasn't mishandled.
>
> >> recovery actions turned into start actions for the resources stopped
> >> during the previous transition.  However, almost all other recovery
> >> actions just disappeared without any comment.  This was actually
> >> correct, but I really wonder why the cluster decided to paper over
> >> the previous monitor operation timeouts.  Maybe the operations
> >> finished meanwhile and got accounted somehow, just not logged?
> >
> > I'm not sure why the PE decided recovery was not necessary. Operation
> > results wouldn't be accepted without being logged.
>
> At which logging level?  I can't see recurring monitor operation logs in
> syslog (at default logging level: notice) nor in /var/log/pacemaker.log
> (which contains info level messages as well).
>

The DC will log that the recurring monitor was successfully started, but
due to noise it doesn't log subsequent successes.


>
> However, the info level logs contain more "Transition aborted" lines, as
> if only the first of them got logged with notice level.  This would make
> sense, since the later ones don't make any difference on an already
> aborted transition, so they aren't that important.  And in fact such
> lines were suppressed from the syslog I checked first, for example:
>
> 12:55:39 [8479] vhbl07cib: info: cib_perform_op: Diff: ---
> 0.613.120 2
> 12:55:39 [8479] vhbl07cib: info: cib_perform_op: Diff: +++
> 0.613.121 (null)
> 12:55:39 [8479] vhbl07cib: info: cib_perform_op: +  /cib:
> @num_updates=121
> 12:55:39 [8479] vhbl07cib: info: cib_perform_op: ++
> /cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']:
>  operation="monitor" crm-debug-origin="do_update_resource"
> crm_feature_set="3.0.10"
> transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> transition-magic="0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> on_node="vhbl05" call-id="645" rc-code="0" op-st
> 12:55:39 [8479] vhbl07cib: info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0,
> origin=vhbl05/crmd/362, version=0.613.121)
> 12:55:39 [8484] vhbl07   crmd: info: abort_transition_graph:
>  Transition aborted by vm-elm_monitor_6 'create' on vhbl05: Foreign
> event (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7,
> cib=0.613.121, source=process_graph_event:600, 0)
> 12:55:39 [8484] vhbl07   crmd: info: process_graph_event:
> Detected action (0.473) vm-elm_monitor_6.645=ok: initiated by a
> different node
>
> I can very much imagine this cancelling the FAILED state induced by a
> monitor timeout like:
>
> 12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++
> type="TransientDomain" class="ocf" provider="niif">
> 12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++
>   id="vm-elm_last_failure_0" operation_key="vm-elm_monitor_6"
> operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.10"
> transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> transition-magic="2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> on_node="vhbl05" call-id="645" rc-code="1" op-status="2" interval="6"
> last-rc-change="1456833279" exe
> 12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++
>   operation_key="vm-elm_start_0" operation="start"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
> transition-key="472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> transition-magic="0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7"
> on_node="vhbl05" call-id="602" rc-code="0" op-status="0" interval="0"
> last-run="1456091121" last-rc-change="1456091121" e
> 12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++
>
>
> The transition-keys match, does this mean that the above is a late
> result from the monitor operation which was considered timed-out
> previously?  How did it reach vhbl07, if the 

Re: [ClusterLabs] Pacemaker issue lsb service

2016-03-07 Thread Kristoffer Grönlund
Thorsten Stremetzne  writes:

> Hello all,
>
>
> I have built an HA setup for a OpenVPN server.
> In my setup there are two hosts, running Ubuntu Linux, pacemaker & chorosync. 
> Also both hosts have a virtual IP which migrates to the host that is active, 
> when the other fails. This works well, but I also configured a primitive for 
> the openvpn-server init scrip, via
>
>
> crm configure primitive failover-openvpnas lsb::openvpnas op monitor 
> interval=15s
>

Hi,

Unfortunately most LSB init scripts are not cluster-compatible by
default, they often do not implement monitor actions correctly and may
report incorrect status when the resource is not running.

I would recommend using an OCF resource agent if possible, or worst case
wrapping the LSB init script in a custom OCF resource agent which
handles the corner cases. Another option if you are running a system
with systemd is to use a systemd service. I have heard reports that
there are some issues with using systemd services directly as well, but
the ones I have tried have worked out of the box.

Cheers,
Kristoffer

>
> The service will be added, but it will always fail, due to the syslog, the 
> init script will be called in a wrong way.
> I'm in troubles debugging how pacemaker will try to start/stop the service on 
> the hosts.
>
>
> Can someone please assist me with some ideas and suggestions?
>
>
> Thanks very much
>
>
> Thorsten
>
>
> Diese E-Mail kann vertrauliche und/oder rechtlich geschützte Informationen 
> enthalten. Wenn Sie nicht der beabsichtigte Empfänger sind oder diese E-Mail 
> irrtümlich erhalten haben, informieren Sie bitte sofort den Absender 
> telefonisch oder per E-Mail und löschen Sie diese E-Mail aus Ihrem System. 
> Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht 
> gestattet.
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Regular pengine warnings after a transient failure

2016-03-07 Thread Ferenc Wágner
Ken Gaillot  writes:

> On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
> 
>> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by 
>> vm-eiffel_monitor_6 'create' on vhbl05: Foreign event 
>> (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, 
>> source=process_graph_event:600, 0)
>
> That means the action was initiated by a different node (the previous DC
> presumably), so the new DC wants to recalculate everything.

Time travel was sort of possible in that situation, and recurring
monitor operations are not logged, so this is indeed possible.  The main
thing is that it wasn't mishandled.

>> recovery actions turned into start actions for the resources stopped
>> during the previous transition.  However, almost all other recovery
>> actions just disappeared without any comment.  This was actually
>> correct, but I really wonder why the cluster decided to paper over
>> the previous monitor operation timeouts.  Maybe the operations
>> finished meanwhile and got accounted somehow, just not logged?
>
> I'm not sure why the PE decided recovery was not necessary. Operation
> results wouldn't be accepted without being logged.

At which logging level?  I can't see recurring monitor operation logs in
syslog (at default logging level: notice) nor in /var/log/pacemaker.log
(which contains info level messages as well).

However, the info level logs contain more "Transition aborted" lines, as
if only the first of them got logged with notice level.  This would make
sense, since the later ones don't make any difference on an already
aborted transition, so they aren't that important.  And in fact such
lines were suppressed from the syslog I checked first, for example:

12:55:39 [8479] vhbl07cib: info: cib_perform_op: Diff: --- 
0.613.120 2
12:55:39 [8479] vhbl07cib: info: cib_perform_op: Diff: +++ 
0.613.121 (null)
12:55:39 [8479] vhbl07cib: info: cib_perform_op: +  /cib:  
@num_updates=121
12:55:39 [8479] vhbl07cib: info: cib_perform_op: ++ 
/cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']:
  
12:54:52 [8479] vhbl07cib: info: cib_perform_op: ++ 


The transition-keys match, does this mean that the above is a late
result from the monitor operation which was considered timed-out
previously?  How did it reach vhbl07, if the DC at that time was vhbl03?

> The pe-input files from the transitions around here should help.

They are available.  What shall I look for?

>> Basically, the cluster responded beyond my expectations, sparing lots of
>> unnecessary recoveries or fencing.  I'm happy, thanks for this wonderful
>> software!  But I'm left with these "Processing failed op monitor"
>> warnings emitted every 15 minutes (timer pops).  Is it safe and clever
>> to cleanup the affected resources?  Would that get rid of them without
>> invoking other effects, like recoveries for example?
>
> That's normal; it's how the cluster maintains the effect of a failure
> that has not been cleared. The logs can be confusing, because it's not
> apparent from that message alone whether the failure is new or old.

Ah, do you mean that these are the same thing that appears after "Failed
Actions:" at the end of the crm_mon output?  They certainly match, and
the logs are confusing indeed.

> Cleaning up the resource will end the failure condition, so what happens
> next depends on the configuration and state of the cluster. If the
> failure was preventing a preferred node from running the resource, the
> resource could move, depending on other factors such as stickiness.

These resources are (still) running fine, suffered only monitor failures
and are node-neutral, so it should be safe to cleanup them, I suppose.
-- 
Thanks for your quick and enlightening answer!  I feared the mere length
of my message would scare everybody away...
Regards,
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Regular pengine warnings after a transient failure

2016-03-07 Thread Ken Gaillot
On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
> Hi,
> 
> A couple of days ago the nodes of our Pacemaker 1.1.14 cluster
> (vhbl0[3-7]) experienced temporary storage outage, leading to processes
> stucking randomly for a couple of minutes and big load spikes.  There
> were 30 monitor operation timeouts altogether on vhbl05, and an internal
> error on the DC.  What follows is my longish analysis of the logs, which
> may be wrong, which I'd be glad to learn about.  Knowledgeable people
> may skip to the end for the main question and a short mention of the
> side questions.  So, Pacemaker logs start as:
> 
> 12:53:51 vhbl05 lrmd[9442]:  warning: vm-niifdc_monitor_6 process (PID 
> 1867) timed out
> 12:53:51 vhbl05 lrmd[9442]:  warning: vm-niiffs_monitor_6 process (PID 
> 1868) timed out
> 12:53:51 vhbl05 lrmd[9442]:  warning: vm-niifdc_monitor_6:1867 - timed 
> out after 2ms
> 12:53:51 vhbl05 lrmd[9442]:  warning: vm-niiffs_monitor_6:1868 - timed 
> out after 2ms
> 12:53:51 vhbl05 crmd[9445]:error: Operation vm-niifdc_monitor_6: 
> Timed Out (node=vhbl05, call=720, timeout=2ms)
> 12:53:52 vhbl05 crmd[9445]:error: Operation vm-niiffs_monitor_6: 
> Timed Out (node=vhbl05, call=717, timeout=2ms)
> 
> (precise interleaving is impossible, as the vhbl05 logs arrived at the
> log server with a delay of 78 s -- probably the syslog daemon was stuck)
> 
> 12:53:51 vhbl03 crmd[8530]:   notice: State transition S_IDLE -> 
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> 12:53:52 vhbl03 pengine[8529]:  warning: Processing failed op monitor for 
> vm-niifdc on vhbl05: unknown error (1)
> 12:53:52 vhbl03 pengine[8529]:   notice: Recover vm-niifdc#011(Started vhbl05)
> 12:53:52 vhbl03 pengine[8529]:   notice: Calculated Transition 909: 
> /var/lib/pacemaker/pengine/pe-input-262.bz2
> 
> The other nodes report in:
> 
> 12:53:57 vhbl04 crmd[9031]:   notice: High CPU load detected: 74.949997
> 12:54:16 vhbl06 crmd[8676]:   notice: High CPU load detected: 93.540001
> 
> while monitor operations keep timing out on vhbl05:
> 
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-FCcontroller_monitor_6 process 
> (PID 1976) timed out
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-FCcontroller_monitor_6:1976 - 
> timed out after 2ms
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-dwdm_monitor_6 process (PID 
> 1977) timed out
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-dwdm_monitor_6:1977 - timed out 
> after 2ms
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-eiffel_monitor_6 process (PID 
> 1978) timed out
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-eiffel_monitor_6:1978 - timed 
> out after 2ms
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-web7_monitor_6 process (PID 
> 2015) timed out
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-web7_monitor_6:2015 - timed out 
> after 2ms
> 12:54:13 vhbl05 crmd[9445]:error: Operation 
> vm-FCcontroller_monitor_6: Timed Out (node=vhbl05, call=640, 
> timeout=2ms)
> 12:54:13 vhbl05 crmd[9445]:error: Operation vm-dwdm_monitor_6: Timed 
> Out (node=vhbl05, call=636, timeout=2ms)
> 12:54:13 vhbl05 crmd[9445]:error: Operation vm-eiffel_monitor_6: 
> Timed Out (node=vhbl05, call=633, timeout=2ms)
> 12:54:13 vhbl05 crmd[9445]:error: Operation vm-web7_monitor_6: Timed 
> Out (node=vhbl05, call=638, timeout=2ms)
> 12:54:17 vhbl05 lrmd[9442]:  warning: vm-ftp.pws_monitor_6 process (PID 
> 2101) timed out
> 12:54:17 vhbl05 lrmd[9442]:  warning: vm-ftp.pws_monitor_6:2101 - timed 
> out after 2ms
> 12:54:17 vhbl05 crmd[9445]:error: Operation vm-ftp.pws_monitor_6: 
> Timed Out (node=vhbl05, call=637, timeout=2ms)
> 12:54:17 vhbl05 lrmd[9442]:  warning: vm-cirkusz_monitor_6 process (PID 
> 2104) timed out
> 12:54:17 vhbl05 lrmd[9442]:  warning: vm-cirkusz_monitor_6:2104 - timed 
> out after 2ms
> 12:54:17 vhbl05 crmd[9445]:error: Operation vm-cirkusz_monitor_6: 
> Timed Out (node=vhbl05, call=650, timeout=2ms)
> 
> Back on the DC:
> 
> 12:54:22 vhbl03 crmd[8530]:  warning: Request 3308 to pengine 
> (0x7f88810214a0) failed: Resource temporarily unavailable (-11)
> 12:54:22 vhbl03 crmd[8530]:error: Could not contact the pengine: -11
> 12:54:22 vhbl03 crmd[8530]:error: FSA: Input I_ERROR from 
> do_pe_invoke_callback() received in state S_POLICY_ENGINE
> 12:54:22 vhbl03 crmd[8530]:  warning: State transition S_POLICY_ENGINE -> 
> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_pe_invoke_callback ]
> 12:54:22 vhbl03 crmd[8530]:  warning: Fast-tracking shutdown in response to 
> errors
> 12:54:22 vhbl03 crmd[8530]:  warning: Not voting in election, we're in state 
> S_RECOVERY
> 12:54:22 vhbl03 crmd[8530]:error: FSA: Input I_TERMINATE from 
> do_recover() received in state S_RECOVERY
> 12:54:22 vhbl03 crmd[8530]:   notice: Stopped 0 recurring operations at 
> shutdown (32 ops remaining)
> 12:5

Re: [ClusterLabs] pacemaker remote configuration on ubuntu 14.04

2016-03-07 Thread Ken Gaillot
On 03/06/2016 07:43 PM, Сергей Филатов wrote:
> Hi,
> I’m trying to set up pacemaker_remote resource on ubuntu 14.04
> I followed "remote node walkthrough” guide 
> (http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Remote/#idm140473081667280
>  
> )
> After creating ocf:pacemaker:remote resource on cluster node, remote node 
> doesn’t show up as online.
> I guess I need to configure remote agent to listen on ipv4, where can I 
> configure it?
> Or is there any other steps to set up remote node besides the ones mentioned 
> in guide?
> tcp6   0  0 :::3121 :::*LISTEN
>   21620/pacemaker_rem off (0.00/0/0)
> 
> pacemaker and pacemaker_remote are 1.12 version


pacemaker_remote will try to bind to IPv6 addresses first, and only if
that fails, will it bind to IPv4. There is no way to configure this
behavior currently, though it obviously would be nice to have.

The only workarounds I can think of are to make IPv6 connections work
between the cluster and the remote node, or disable IPv6 on the remote
node. Using IPv6, there could be an issue if your name resolution
returns both IPv4 and IPv6 addresses for the remote host; you could
potentially work around that by adding an IPv6-only name for it, and
using that as the server option to the remote resource.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with crm_mon or ClusterMon resource agent

2016-03-07 Thread Ken Gaillot
On 03/06/2016 08:36 AM, Debabrata Pani wrote:
> Hi,
> 
> I would like to understand if anybody has got this working recently.
> 
> Looks like I have missed something in the description and hence the
> problem statement is not clear to the group.
> 
> Can I enable some logs in crm_mon to improve the description of the
> problem ?
> 
> Regards,
> Debabrata Pani
> 
> 
> 
> On 04/03/16 11:48, "Debabrata Pani"  wrote:
> 
>> Hi,
>>
>> I wanted  to configure ClusterMon resource agent so that I can get
>> information about events in the pacemaker cluster.
>> *Objective is to generate traps for some specific resource agents and/or
>> conditions*
>>
>>
>> My cluster installation details :
>> pacemakerd --version
>> Pacemaker 1.1.11
>> Written by Andrew Beekhof
>>
>>
>> corosync -v
>> Corosync Cluster Engine, version '1.4.7'
>> Copyright (c) 2006-2009 Red Hat, Inc.
>>
>>
>> crm_mon --version
>> Pacemaker 1.1.11
>> Written by Andrew Beekhof
>>
>>
>>
>> I followed the following documentation
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explai
>> n
>> ed/ch07.html
>> http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpace
>> m
>> akerclustermon-andor-external-agent/
>>
>> My current cluster status shows:
>>
>> MonitorClusterChange(ocf::pacemaker:ClusterMon):Started stvm4
>>
>> The resource agent is configure
>>
>> The crm_mon is indeed running as a daemon on the node stvm4
>>
>> [root@stvm4 panidata]# ps -ef | grep crm_mon | grep -v grep
>> root  2908 1  0 10:59 ?00:00:00 /usr/sbin/crm_mon -p
>> /tmp/ClusterMon_MonitorClusterChange.pid -d -i 0 -E
>> /root/panidata/testscript.sh -e anonymous -h
>> /tmp/ClusterMon_MonitorClusterChange.html

The "-i 0" refers to the "update" property of your resource, which is
how often crm_mon should recheck the cluster. I'm not sure what zero
would do, but it would be better around 10-30 seconds.

Is /root/panidata/testscript.sh executable? Does it work when run from
the command line?

>> My test script is the following
>> #!/bin/bash
>> echo "running" >> /root/running.log
>> echo "CRM_notify_recipient=$CRM_notify_recipient"
>> ..
>>
>>
>>
>> As I trigger events by shutting down one or the other service, I see the
>> html file "/tmp/ClusterMon_MonitorClusterChange.html² getting updated each
>> time an event is triggered.
>> So the timestamp of the file keeps on changing.
>>
>> But I am not sure if the script is getting executed. Because I don¹t see
>> any ³/root/running.log² file.
>>
>> Things I have tried:
>> * Using ³logger² command instead of echo.
>> * Running the crm_mon command with -d and other parameters manually to
>> check , if it is the problem with resource agent etc.
>>
>> Queries:
>> * Is this a know issue ?
>> * Am I doing something incorrect ?
>>
>>
>>
>>
>> Regards,
>> Debabrata Pani


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Regular pengine warnings after a transient failure

2016-03-07 Thread Ferenc Wágner
Hi,

A couple of days ago the nodes of our Pacemaker 1.1.14 cluster
(vhbl0[3-7]) experienced temporary storage outage, leading to processes
stucking randomly for a couple of minutes and big load spikes.  There
were 30 monitor operation timeouts altogether on vhbl05, and an internal
error on the DC.  What follows is my longish analysis of the logs, which
may be wrong, which I'd be glad to learn about.  Knowledgeable people
may skip to the end for the main question and a short mention of the
side questions.  So, Pacemaker logs start as:

12:53:51 vhbl05 lrmd[9442]:  warning: vm-niifdc_monitor_6 process (PID 
1867) timed out
12:53:51 vhbl05 lrmd[9442]:  warning: vm-niiffs_monitor_6 process (PID 
1868) timed out
12:53:51 vhbl05 lrmd[9442]:  warning: vm-niifdc_monitor_6:1867 - timed out 
after 2ms
12:53:51 vhbl05 lrmd[9442]:  warning: vm-niiffs_monitor_6:1868 - timed out 
after 2ms
12:53:51 vhbl05 crmd[9445]:error: Operation vm-niifdc_monitor_6: Timed 
Out (node=vhbl05, call=720, timeout=2ms)
12:53:52 vhbl05 crmd[9445]:error: Operation vm-niiffs_monitor_6: Timed 
Out (node=vhbl05, call=717, timeout=2ms)

(precise interleaving is impossible, as the vhbl05 logs arrived at the
log server with a delay of 78 s -- probably the syslog daemon was stuck)

12:53:51 vhbl03 crmd[8530]:   notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
12:53:52 vhbl03 pengine[8529]:  warning: Processing failed op monitor for 
vm-niifdc on vhbl05: unknown error (1)
12:53:52 vhbl03 pengine[8529]:   notice: Recover vm-niifdc#011(Started vhbl05)
12:53:52 vhbl03 pengine[8529]:   notice: Calculated Transition 909: 
/var/lib/pacemaker/pengine/pe-input-262.bz2

The other nodes report in:

12:53:57 vhbl04 crmd[9031]:   notice: High CPU load detected: 74.949997
12:54:16 vhbl06 crmd[8676]:   notice: High CPU load detected: 93.540001

while monitor operations keep timing out on vhbl05:

12:54:13 vhbl05 lrmd[9442]:  warning: vm-FCcontroller_monitor_6 process 
(PID 1976) timed out
12:54:13 vhbl05 lrmd[9442]:  warning: vm-FCcontroller_monitor_6:1976 - 
timed out after 2ms
12:54:13 vhbl05 lrmd[9442]:  warning: vm-dwdm_monitor_6 process (PID 1977) 
timed out
12:54:13 vhbl05 lrmd[9442]:  warning: vm-dwdm_monitor_6:1977 - timed out 
after 2ms
12:54:13 vhbl05 lrmd[9442]:  warning: vm-eiffel_monitor_6 process (PID 
1978) timed out
12:54:13 vhbl05 lrmd[9442]:  warning: vm-eiffel_monitor_6:1978 - timed out 
after 2ms
12:54:13 vhbl05 lrmd[9442]:  warning: vm-web7_monitor_6 process (PID 2015) 
timed out
12:54:13 vhbl05 lrmd[9442]:  warning: vm-web7_monitor_6:2015 - timed out 
after 2ms
12:54:13 vhbl05 crmd[9445]:error: Operation vm-FCcontroller_monitor_6: 
Timed Out (node=vhbl05, call=640, timeout=2ms)
12:54:13 vhbl05 crmd[9445]:error: Operation vm-dwdm_monitor_6: Timed 
Out (node=vhbl05, call=636, timeout=2ms)
12:54:13 vhbl05 crmd[9445]:error: Operation vm-eiffel_monitor_6: Timed 
Out (node=vhbl05, call=633, timeout=2ms)
12:54:13 vhbl05 crmd[9445]:error: Operation vm-web7_monitor_6: Timed 
Out (node=vhbl05, call=638, timeout=2ms)
12:54:17 vhbl05 lrmd[9442]:  warning: vm-ftp.pws_monitor_6 process (PID 
2101) timed out
12:54:17 vhbl05 lrmd[9442]:  warning: vm-ftp.pws_monitor_6:2101 - timed out 
after 2ms
12:54:17 vhbl05 crmd[9445]:error: Operation vm-ftp.pws_monitor_6: Timed 
Out (node=vhbl05, call=637, timeout=2ms)
12:54:17 vhbl05 lrmd[9442]:  warning: vm-cirkusz_monitor_6 process (PID 
2104) timed out
12:54:17 vhbl05 lrmd[9442]:  warning: vm-cirkusz_monitor_6:2104 - timed out 
after 2ms
12:54:17 vhbl05 crmd[9445]:error: Operation vm-cirkusz_monitor_6: Timed 
Out (node=vhbl05, call=650, timeout=2ms)

Back on the DC:

12:54:22 vhbl03 crmd[8530]:  warning: Request 3308 to pengine (0x7f88810214a0) 
failed: Resource temporarily unavailable (-11)
12:54:22 vhbl03 crmd[8530]:error: Could not contact the pengine: -11
12:54:22 vhbl03 crmd[8530]:error: FSA: Input I_ERROR from 
do_pe_invoke_callback() received in state S_POLICY_ENGINE
12:54:22 vhbl03 crmd[8530]:  warning: State transition S_POLICY_ENGINE -> 
S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_pe_invoke_callback ]
12:54:22 vhbl03 crmd[8530]:  warning: Fast-tracking shutdown in response to 
errors
12:54:22 vhbl03 crmd[8530]:  warning: Not voting in election, we're in state 
S_RECOVERY
12:54:22 vhbl03 crmd[8530]:error: FSA: Input I_TERMINATE from do_recover() 
received in state S_RECOVERY
12:54:22 vhbl03 crmd[8530]:   notice: Stopped 0 recurring operations at 
shutdown (32 ops remaining)
12:54:22 vhbl03 crmd[8530]:   notice: Recurring action vm-phm6:654 
(vm-phm6_monitor_6) incomplete at shutdown
[31 more similar lines]
12:54:22 vhbl03 crmd[8530]:error: 32 resources were active at shutdown.
12:54:22 vhbl03 crmd[8530]:   not