Re: [Pacemaker] Java application failover problem

Andrew Beekhof Mon, 08 Jul 2013 21:02:39 -0700

Can you include a crm_report for your test scenario?
a) I need the pe files, but also b) parsing line wrapped logs is seriously 
painful
 
On 05/07/2013, at 7:09 PM, Martin Gazak <[email protected]> wrote:


> Hello,
> we are facing the problem with the simple (I hope) cluster configuration
> with 2 nodes ims0 and ims1 and 3 primitives (no shared storage or
> something like this where data corruption is a danger):
> 
> - master-slave Java application ims (to be run normally on both nodes in
> as master/slave, with our own OCF script) with embedded web server (to
> be accessed by clients)
> 
> - ims-ip and ims-ip-src: shared IP address and outgoing address to be
> run on the ims master solely
> 
> Below are listed the software versions, crm configuration and portions
> of corosync log.
> 
> The problem is that although most of the time the setup works (i.e if
> master ims application dies, slave one is promoted and ip addresses are
> remapped) but sometimes when master ims application stops (fails or is
> killed), the failover does not occur - the slave ims application remains
> the slave and the shared IP address remains mapped on the node with died
> ims.
> 
> I even created a testbed of 2 servers, killing the ims application from
> cron every 15 minutes on supposed MAIN server to simulate the failure
> and observe the failover and to replicate the problem (sometimes it
> works properly for hours/days).
> 
> For example today (July 4, 23:45 local time) the ims at ims0 was killed,
> but remained Master - no failover of IP addresses was performed and ims
> on ims1 remained Slave:
> ============
> Last updated: Fri Jul  5 02:07:18 2013
> Last change: Thu Jul  4 23:33:46 2013
> Stack: openais
> Current DC: ims0 - partition with quorum
> Version: 1.1.7-61a079313275f3e9d0e85671f62c721d32ce3563
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
> ============
> 
> Online: [ ims1 ims0 ]
> 
> Master/Slave Set: ms-ims [ims]
>     Masters: [ ims0 ]
>     Slaves: [ ims1 ]
> Clone Set: clone-cluster-mon [cluster-mon]
>     Started: [ ims0 ims1 ]
> Resource Group: on-ims-master
>     ims-ip     (ocf::heartbeat:IPaddr2):       Started ims0
>     ims-ip-src (ocf::heartbeat:IPsrcaddr):     Started ims0
> 
> The command 'crm node standby' on ims0 did not fix the thing: ims0
> remained master (although standby):
> 
> Node ims0: standby
> Online: [ ims1 ]
> 
> Master/Slave Set: ms-ims [ims]
>     ims:0      (ocf::microstepmis:imsMS):      Slave ims0 FAILED
>     Slaves: [ ims1 ]
> Clone Set: clone-cluster-mon [cluster-mon]
>     Started: [ ims1 ]
>     Stopped: [ cluster-mon:0 ]
> 
> Failed actions:
>    ims:0_demote_0 (node=ims0, call=3179, rc=7, status=complete): not
> running
> 
> Stoppping openais service on ims0 completely did the thing.
> 
> Could someone provide me with a hint, what to do ?
> - provide more information (logs, ocf script) ?
> - change something in configuration ?
> - change the environment / versions ?
> 
> Thanks a lot
> 
> Martin Gazak
> 
> 
> Software versions:
> ------------------
> libpacemaker3-1.1.7-42.1
> pacemaker-1.1.7-42.1
> corosync-1.4.3-21.1
> libcorosync4-1.4.3-21.1
> SUSE Linux Enterprise Server 11 (x86_64)
> VERSION = 11
> PATCHLEVEL = 2
> 
> 
> 
> Configuration:
> --------------
> node ims0 \
>        attributes standby="off"
> node ims1 \
>        attributes standby="off"
> primitive cluster-mon ocf:pacemaker:ClusterMon \
>        params htmlfile="/opt/ims/tomcat/webapps/ims/html/crm_status.html" \
>        op monitor interval="10"
> primitive ims ocf:microstepmis:imsMS \
>        op monitor interval="1" role="Master" timeout="20" \
>        op monitor interval="2" role="Slave" timeout="20" \
>        op start interval="0" timeout="1800s" \
>        op stop interval="0" timeout="120s" \
>        op promote interval="0" timeout="180s" \
>        meta failure-timeout="360s"
> primitive ims-ip ocf:heartbeat:IPaddr2 \
>        params ip="192.168.141.13" nic="bond1" iflabel="ims"
> cidr_netmask="24" \
>        op monitor interval="15s" \
>        meta failure-timeout="60s"
> primitive ims-ip-src ocf:heartbeat:IPsrcaddr \
>        params ipaddress="192.168.141.13" cidr_netmask="24" \
>        op monitor interval="15s" \
>        meta failure-timeout="60s"
> group on-ims-master ims-ip ims-ip-src
> ms ms-ims ims \
>        meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Started"
> migration-threshold="1"
> clone clone-cluster-mon cluster-mon
> colocation ims_master inf: on-ims-master ms-ims:Master
> order ms-ims-before inf: ms-ims:promote on-ims-master:start
> property $id="cib-bootstrap-options" \
>        dc-version="1.1.7-61a079313275f3e9d0e85671f62c721d32ce3563" \
>        cluster-infrastructure="openais" \
>        expected-quorum-votes="2" \
>        no-quorum-policy="ignore" \
>        stonith-enabled="false" \
>        cluster-recheck-interval="1m" \
>        default-resource-stickiness="1000" \
>        last-lrm-refresh="1372951736" \
>        maintenance-mode="false"
> 
> 
> corosync.log from ims0:
> -----------------------
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims:0_monitor_1000 (call=3046, rc=7, cib-update=6229,
> confirmed=false) not running
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_graph_event: Detected
> action ims:0_monitor_1000 from a different transition: 4024 vs. 4035
> Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
> process_graph_event:476 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=ims:0_last_failure_0,
> magic=0:7;7:4024:8:e3f096a7-4eb5-4810-9310-eb144f595e20, cib=0.717.6) :
> Old event
> Jul 04 23:45:02 ims0 crmd: [3935]: WARN: update_failcount: Updating
> failcount for ims:0 on ims0 after failed monitor: rc=7 (update=value++,
> time=1372952702)
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-ims:0 (1)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jul 04 23:45:02 ims0 pengine: [3933]: WARN: unpack_rsc_op: Processing
> failed op ims:0_last_failure_0 on ims0: not running (7)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Recover ims:0
> (Master ims0)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Restart
> ims-ip        (Started ims0)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Restart
> ims-ip-src    (Started ims0)
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jul 04 23:45:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
> 4036 (ref=pe_calc-dc-1372952702-11907) derived from
> /var/lib/pengine/pe-input-2819.bz2
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
> action 51: stop ims-ip-src_stop_0 on ims0 (local)
> Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_perform_update: Sent
> update 4439: fail-count-ims:0=1
> Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-ims:0 (1372952702)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: cancel_op: operation
> monitor[3049] on ims-ip-src for client 3935, its parameters:
> CRM_meta_name=[monitor] cidr_netmask=[24] crm_feature_set=[3.0.6]
> CRM_meta_timeout=[20000] CRM_meta_interval=[15000]
> ipaddress=[192.168.141.13]  cancelled
> Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_perform_update: Sent
> update 4441: last-failure-ims:0=1372952702
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip-src stop[3052] (pid
> 12111)
> Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
> te_update_diff:176 - Triggered transition abort (complete=0, tag=nvpair,
> id=status-ims0-fail-count-ims.0, name=fail-count-ims:0, value=1,
> magic=NA, cib=0.717.7) : Transient attribute: update
> Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
> te_update_diff:176 - Triggered transition abort (complete=0, tag=nvpair,
> id=status-ims0-last-failure-ims.0, name=last-failure-ims:0,
> value=1372952702, magic=NA, cib=0.717.8) : Transient attribute: update
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims-ip-src_monitor_15000 (call=3049, status=1, cib-update=0,
> confirmed=true) Cancelled
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: process_pe_message:
> Transition 4036: PEngine Input stored in: /var/lib/pengine/pe-input-2819.bz2
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation stop[3052] on
> ims-ip-src for client 3935: pid 12111 exited with return code 0
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims-ip-src_stop_0 (call=3052, rc=0, cib-update=6231,
> confirmed=true) ok
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
> 4036 (Complete=3, Pending=0, Fired=0, Skipped=32, Incomplete=19,
> Source=/var/lib/pengine/pe-input-2819.bz2): Stopped
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: get_failcount: Failcount
> for ms-ims on ims0 has expired (limit was 360s)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_rsc_op: Clearing
> expired failcount for ims:0 on ims0
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: get_failcount: Failcount
> for ms-ims on ims0 has expired (limit was 360s)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_rsc_op: Clearing
> expired failcount for ims:0 on ims0
> Jul 04 23:45:02 ims0 pengine: [3933]: WARN: unpack_rsc_op: Processing
> failed op ims:0_last_failure_0 on ims0: not running (7)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: get_failcount: Failcount
> for ms-ims on ims0 has expired (limit was 360s)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: get_failcount: Failcount
> for ms-ims on ims0 has expired (limit was 360s)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Recover ims:0
> (Master ims0)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Restart
> ims-ip        (Started ims0)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Start
> ims-ip-src    (ims0)
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jul 04 23:45:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
> 4037 (ref=pe_calc-dc-1372952702-11909) derived from
> /var/lib/pengine/pe-input-2820.bz2
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_crm_command: Executing
> crm-event (3): clear_failcount on ims0
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
> action 49: stop ims-ip_stop_0 on ims0 (local)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: cancel_op: operation
> monitor[3047] on ims-ip for client 3935, its parameters:
> cidr_netmask=[24] nic=[bond1] crm_feature_set=[3.0.6]
> ip=[192.168.141.13] iflabel=[ims] CRM_meta_name=[monitor]
> CRM_meta_timeout=[20000] CRM_meta_interval=[15000]  cancelled
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip stop[3053] (pid 12154)
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims-ip_monitor_15000 (call=3047, status=1, cib-update=0,
> confirmed=true) Cancelled
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
> action 72: notify ims:0_pre_notify_demote_0 on ims0 (local)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims:0 notify[3054] (pid 12155)
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
> action 74: notify ims:1_pre_notify_demote_0 on ims1
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation notify[3054] on ims:0
> for client 3935: pid 12155 exited with return code 0
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims:0_notify_0 (call=3054, rc=0, cib-update=0, confirmed=true) ok
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: process_pe_message:
> Transition 4037: PEngine Input stored in: /var/lib/pengine/pe-input-2820.bz2
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: RA output: (ims-ip:stop:stderr)
> 2013/07/04_23:45:02 INFO: IP status = ok, IP_CIP=
> 
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation stop[3053] on ims-ip
> for client 3935: pid 12154 exited with return code 0
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims-ip_stop_0 (call=3053, rc=0, cib-update=6233,
> confirmed=true) ok
> Jul 04 23:45:02 ims0 crmd: [3935]: info: handle_failcount_op: Removing
> failcount for ims:0
> Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-ims:0 (<null>)
> Jul 04 23:45:02 ims0 cib: [3929]: info: cib_process_request: Operation
> complete: op cib_delete for section
> //node_state[@uname='ims0']//lrm_resource[@id='ims:0']/lrm_rsc_op[@id='ims:0_last_failure_0']
> (origin=local/crmd/6234, version=0.717.11): ok (rc=0)
> Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
> te_update_diff:321 - Triggered transition abort (complete=0,
> tag=lrm_rsc_op, id=ims:0_last_failure_0,
> magic=0:7;7:4024:8:e3f096a7-4eb5-4810-9310-eb144f595e20, cib=0.717.11) :
> Resource op removal
> Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_perform_update: Sent
> delete 4443: node=ims0, attr=fail-count-ims:0, id=<n/a>, set=(null),
> section=status
> Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
> te_update_diff:194 - Triggered transition abort (complete=0,
> tag=transient_attributes, id=ims0, magic=NA, cib=0.717.12) : Transient
> attribute: removal
> Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-ims:0 (<null>)
> Jul 04 23:45:02 ims0 attrd: [3932]: notice: attrd_perform_update: Sent
> delete 4445: node=ims0, attr=last-failure-ims:0, id=<n/a>, set=(null),
> section=status
> Jul 04 23:45:02 ims0 crmd: [3935]: info: abort_transition_graph:
> te_update_diff:194 - Triggered transition abort (complete=0,
> tag=transient_attributes, id=ims0, magic=NA, cib=0.717.13) : Transient
> attribute: removal
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
> 4037 (Complete=7, Pending=0, Fired=0, Skipped=28, Incomplete=19,
> Source=/var/lib/pengine/pe-input-2820.bz2): Stopped
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Start
> ims-ip        (ims0)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: LogActions: Start
> ims-ip-src    (ims0)
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jul 04 23:45:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
> 4038 (ref=pe_calc-dc-1372952702-11915) derived from
> /var/lib/pengine/pe-input-2821.bz2
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
> action 47: start ims-ip_start_0 on ims0 (local)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip start[3055] (pid 12197)
> Jul 04 23:45:02 ims0 pengine: [3933]: notice: process_pe_message:
> Transition 4038: PEngine Input stored in: /var/lib/pengine/pe-input-2821.bz2
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: RA output:
> (ims-ip:start:stderr) 2013/07/04_23:45:02 INFO: Adding IPv4 address
> 192.168.141.13/24 with broadcast address 192.168.141.255 to device bond1
> (with label bond1:ims)
> 
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: RA output:
> (ims-ip:start:stderr) 2013/07/04_23:45:02 INFO: Bringing device bond1 up
> 
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: RA output:
> (ims-ip:start:stderr) 2013/07/04_23:45:02 INFO:
> /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
> /var/run/resource-agents/send_arp-192.168.141.13 bond1 192.168.141.13
> auto not_used not_used
> 
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation start[3055] on ims-ip
> for client 3935: pid 12197 exited with return code 0
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims-ip_start_0 (call=3055, rc=0, cib-update=6236,
> confirmed=true) ok
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
> action 48: monitor ims-ip_monitor_15000 on ims0 (local)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip monitor[3056] (pid
> 12255)
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
> action 49: start ims-ip-src_start_0 on ims0 (local)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip-src start[3057] (pid
> 12256)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation monitor[3056] on
> ims-ip for client 3935: pid 12255 exited with return code 0
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims-ip_monitor_15000 (call=3056, rc=0, cib-update=6237,
> confirmed=false) ok
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation start[3057] on
> ims-ip-src for client 3935: pid 12256 exited with return code 0
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims-ip-src_start_0 (call=3057, rc=0, cib-update=6238,
> confirmed=true) ok
> Jul 04 23:45:02 ims0 crmd: [3935]: info: te_rsc_command: Initiating
> action 50: monitor ims-ip-src_monitor_15000 on ims0 (local)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: rsc:ims-ip-src monitor[3058]
> (pid 12336)
> Jul 04 23:45:02 ims0 lrmd: [3931]: info: operation monitor[3058] on
> ims-ip-src for client 3935: pid 12336 exited with return code 0
> Jul 04 23:45:02 ims0 crmd: [3935]: info: process_lrm_event: LRM
> operation ims-ip-src_monitor_15000 (call=3058, rc=0, cib-update=6239,
> confirmed=false) ok
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
> 4038 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-2821.bz2): Complete
> Jul 04 23:45:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jul 04 23:46:02 ims0 crmd: [3935]: info: crm_timer_popped: PEngine
> Recheck Timer (I_PE_CALC) just popped (60000ms)
> Jul 04 23:46:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jul 04 23:46:02 ims0 crmd: [3935]: info: do_state_transition: Progressed
> to state S_POLICY_ENGINE after C_TIMER_POPPED
> Jul 04 23:46:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jul 04 23:46:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jul 04 23:46:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
> 4039 (ref=pe_calc-dc-1372952762-11920) derived from
> /var/lib/pengine/pe-input-2822.bz2
> Jul 04 23:46:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
> 4039 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-2822.bz2): Complete
> Jul 04 23:46:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jul 04 23:46:02 ims0 pengine: [3933]: notice: process_pe_message:
> Transition 4039: PEngine Input stored in: /var/lib/pengine/pe-input-2822.bz2
> Jul 04 23:47:02 ims0 crmd: [3935]: info: crm_timer_popped: PEngine
> Recheck Timer (I_PE_CALC) just popped (60000ms)
> Jul 04 23:47:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jul 04 23:47:02 ims0 crmd: [3935]: info: do_state_transition: Progressed
> to state S_POLICY_ENGINE after C_TIMER_POPPED
> Jul 04 23:47:02 ims0 pengine: [3933]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jul 04 23:47:02 ims0 pengine: [3933]: notice: process_pe_message:
> Transition 4040: PEngine Input stored in: /var/lib/pengine/pe-input-2822.bz2
> Jul 04 23:47:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jul 04 23:47:02 ims0 crmd: [3935]: info: do_te_invoke: Processing graph
> 4040 (ref=pe_calc-dc-1372952822-11921) derived from
> /var/lib/pengine/pe-input-2822.bz2
> Jul 04 23:47:02 ims0 crmd: [3935]: notice: run_graph: ==== Transition
> 4040 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-2822.bz2): Complete
> Jul 04 23:47:02 ims0 crmd: [3935]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> corosync.log from ims1:
> -----------------------
> Jul 04 23:45:02 ims1 lrmd: [3913]: info: rsc:ims:1 notify[1424] (pid 25381)
> Jul 04 23:45:02 ims1 lrmd: [3913]: info: operation notify[1424] on ims:1
> for client 3917: pid 25381 exited with return code 0
> Jul 04 23:45:02 ims1 crmd: [3917]: info: process_lrm_event: LRM
> operation ims:1_notify_0 (call=1424, rc=0, cib-update=0, confirmed=true) ok
> Jul 04 23:49:35 ims1 cib: [3911]: info: cib_stats: Processed 324
> operations (92.00us average, 0% utilization) in the last 10min
> Jul 04 23:59:35 ims1 cib: [3911]: info: cib_stats: Processed 295
> operations (67.00us average, 0% utilization) in the last 10min
> Jul 05 00:00:03 ims1 crmd: [3917]: info: process_lrm_event: LRM
> operation ims:1_monitor_2000 (call=1423, rc=7, cib-update=778,
> confirmed=false) not running
> Jul 05 00:00:03 ims1 attrd: [3914]: notice: attrd_ais_dispatch: Update
> relayed from ims0
> Jul 05 00:00:03 ims1 attrd: [3914]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-ims:1 (1)
> Jul 05 00:00:03 ims1 attrd: [3914]: notice: attrd_perform_update: Sent
> update 2037: fail-count-ims:1=1
> Jul 05 00:00:03 ims1 attrd: [3914]: notice: attrd_ais_dispatch: Update
> relayed from ims0
> 
> 
> 
> -- 
> 
> Regards,
> 
> Martin Gazak
> MicroStep-MIS, spol. s r.o.
> System Development Manager
> Tel.: +421 2 602 00 128
> Fax: +421 2 602 00 180
> [email protected]
> http://www.microstep-mis.com
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: [email protected]
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Java application failover problem

Reply via email to