Re: [Linux-HA] Master/Slave - Master node not monitored after a failure

radurad Mon, 28 Jan 2013 01:21:22 -0800

Hi,

thank you for replying, I'll post inline also.

Dan Frincu-2 wrote:
> 
> Hi,
> 
> On Thu, Jan 24, 2013 at 2:07 PM, radurad <radu....@gmail.com> wrote:
>>
>> Hi,
>>
>> Using following installation under CentOS
>>
>> corosync-1.4.1-7.el6_3.1.x86_64
>> resource-agents-3.9.2-12.el6.x86_64
>>
>> and having the following configuration for a Master/Slave mysql
>>
>> primitive mysqld ocf:heartbeat:mysql \
>>         params binary="/usr/bin/mysqld_safe" config="/etc/my.cnf"
>> socket="/var/lib/mysql/mysql.sock" datadir="/var/lib/mysql" user="mysql"
>> replication_user="root" replication_passwd="testtest" \
>>         op monitor interval="5s" role="Slave" timeout="31s" \
>>         op monitor interval="6s" role="Master" timeout="30s"
>> ms ms_mysql mysqld \
>>         meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> property $id="cib-bootstrap-options" \
>>         dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14"
>> \
>>         cluster-infrastructure="openais" \
>>         expected-quorum-votes="2" \
>>         no-quorum-policy="ignore" \
>>         stonith-enabled="false" \
>>         last-lrm-refresh="1359026356" \
>>         start-failure-is-fatal="false" \
> 
> That should not be set to false.
> 
> I've tried with it set to true and false and there's no difference as I'm
> not seeing any start failure.
> 
>>         cluster-recheck-interval="60s"
>> rsc_defaults $id="rsc-options" \
>>         failure-timeout="50s"
>>
>> Having only one node online (the Master; with a slave online the problem
>> also occurs, but for simplification I've left only the Master online)
>>
>> I run into the bellow problem:
>> - Stopping once the mysql process results in corosync restarting the
>> mysql
>> again and promoting it to Master.
> 
> Pacemaker's lrmd restarts MySQL.
> 
> You are right, I'm using corosync as the process name because I'm not
> entirely sure about what each underlying process is doing. I'll try to use
> the exact process name, hope I'll not get the mixed up.
> 
>> - Stopping again the mysql process results in nothing; the failure is not
>> detected, corosync takes no action and still sees the node as Master and
>> the
>> mysql running.
> 
> How do you actually stop MySQL?
> 
> Either by using the service script or by killing the process. Again, it
> doesn't makes any difference, the behavior is the same.
> 
>> - The operation monitor is not running after the first failure, as there
>> are
>> not entries in log of type:  INFO: MySQL monitor succeeded (master).
> 
> That sounds strange, could you try the most recent MySQL RA or the one
> from Percona and see if it still does this?
> 
> I'm using the latest resource agents (3.9.2), but I'll try with Percona
> agents also.
> 
>> - Changing something in configuration results in corosync detecting
>> immediately that mysql is not running and promotes it. Also the operation
>> monitor will run until the first failure and which the same problem
>> occurs.
>>
>> If you need more information let me know. I could attach the log in the
>> messages files also.
> 
> Are you sure it is even working with this configuration?
> 
> I've found no other issue except this one. What is working:
> - slave is always monitored and restarted.
> - failover is working (once I set migration-threshold and allow a
> migration)
> - replication is working (after failover also)
> 
> 
> HTH,
> Dan
> 
> 
> 

I'll attach some logs output, maybe it will help understand what is
happening:

Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:start process
16845 exited with return code 0.
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: process_lrm_event: LRM
operation mysqld:0_start_0 (call=92, rc=0, cib-update=5552, confirmed=true)
ok
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: te_rsc_command: Initiating
action 41: notify mysqld:0_post_notify_start_0 on imssmp1 (local)
Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: rsc:mysqld:0:93: notify
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: te_rsc_command: Initiating
action 43: notify mysqld:1_post_notify_start_0 on imssmp2
Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:notify process
17853 exited with return code 0.
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: process_lrm_event: LRM
operation mysqld:0_notify_0 (call=93, rc=0, cib-update=0, confirmed=true) ok
Jan 28 10:03:31 imssmp1 crmd[2579]:   notice: run_graph: ==== Transition
5371 (Complete=11, Pending=0, Fired=0, Skipped=8, Incomplete=5,
Source=/var/lib/pengine/pe-input-971.bz2)
: Stopped
Jan 28 10:03:31 imssmp1 crmd[2579]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=notify_cr
md ]
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: unpack_config: Startup
probes: enabled
Jan 28 10:03:31 imssmp1 pengine[2578]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: unpack_config: Node scores:
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: unpack_domains: Unpacking
domains
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: determine_online_status:
Node imssmp1 is online
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: determine_online_status:
Node imssmp2 is online
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: get_failcount: ms_mysql has
failed 4 times on imssmp1
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: get_failcount: ms_mysql has
failed 4 times on imssmp1
Jan 28 10:03:31 imssmp1 pengine[2578]:  warning: unpack_rsc_op: Processing
failed op mysqld:0_last_failure_0 on imssmp1: not running (7)
Jan 28 10:03:31 imssmp1 pengine[2578]:  warning: unpack_rsc_op: Forcing
mysqld:0 to stop after a failed demote action
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: clone_print:  Master/Slave
Set: ms_mysql [mysqld]
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: short_print:      Slaves: [
imssmp1 imssmp2 ]
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: get_failcount: ms_mysql has
failed 4 times on imssmp1
Jan 28 10:03:31 imssmp1 pengine[2578]:   notice: common_apply_stickiness:
ms_mysql can fail 999996 more times on imssmp1 before being forced off
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: get_failcount: ms_mysql has
failed 4 times on imssmp1
Jan 28 10:03:31 imssmp1 pengine[2578]:   notice: common_apply_stickiness:
ms_mysql can fail 999996 more times on imssmp1 before being forced off
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: master_color: Promoting
mysqld:0 (Slave imssmp1)
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: master_color: ms_mysql:
Promoted 1 instances of a possible 1 to master
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: RecurringOp:  Start
recurring monitor (6s) for mysqld:0 on imssmp1
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: RecurringOp:  Start
recurring monitor (6s) for mysqld:0 on imssmp1
Jan 28 10:03:31 imssmp1 pengine[2578]:   notice: LogActions: Promote
mysqld:0#011(Slave -> Master imssmp1)
Jan 28 10:03:31 imssmp1 pengine[2578]:     info: LogActions: Leave  
mysqld:1#011(Slave imssmp2)
Jan 28 10:03:31 imssmp1 crmd[2579]:    error: log_data_element: Output
truncated: available=15, needed=24
Jan 28 10:03:31 imssmp1 crmd[2579]:    error: log_data_element: Output
truncated: available=16, needed=24
Jan 28 10:03:31 imssmp1 pengine[2578]:   notice: process_pe_message:
Transition 5372: PEngine Input stored in: /var/lib/pengine/pe-input-972.bz2
Jan 28 10:03:31 imssmp1 crmd[2579]:    error: log_data_element: Output
truncated: available=15, needed=24
Jan 28 10:03:31 imssmp1 crmd[2579]:    error: log_data_element: Output
truncated: available=16, needed=24
Jan 28 10:03:31 imssmp1 crmd[2579]:   notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_
response ]
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: do_te_invoke: Processing graph
5372 (ref=pe_calc-dc-1359363811-5625) derived from
/var/lib/pengine/pe-input-972.bz2
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: te_rsc_command: Initiating
action 49: notify mysqld:0_pre_notify_promote_0 on imssmp1 (local)
Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: rsc:mysqld:0:94: notify
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: te_rsc_command: Initiating
action 51: notify mysqld:1_pre_notify_promote_0 on imssmp2
Jan 28 10:03:31 imssmp1 mysql(mysqld: 0)[17878]: INFO: This will be new
master
Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:notify process
17878 exited with return code 0.
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: process_lrm_event: LRM
operation mysqld:0_notify_0 (call=94, rc=0, cib-update=0, confirmed=true) ok
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: te_rsc_command: Initiating
action 9: promote mysqld:0_promote_0 on imssmp1 (local)
Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: rsc:mysqld:0:95: promote
Jan 28 10:03:31 imssmp1 attrd[2577]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: master-mysqld:0 (3601)
Jan 28 10:03:31 imssmp1 attrd[2577]:   notice: attrd_perform_update: Sent
update 225: master-mysqld:0=3601
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: abort_transition_graph:
te_update_diff:176 - Triggered transition abort (complete=0, tag=nvpair,
id=status-imssmp1-master-mysqld.0,
name=master-mysqld:0, value=3601, magic=NA, cib=0.134.55) : Transient
attribute: update
Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:promote process
17911 exited with return code 0.
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: process_lrm_event: LRM
operation mysqld:0_promote_0 (call=95, rc=0, cib-update=5554,
confirmed=true) ok
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: te_rsc_command: Initiating
action 50: notify mysqld:0_post_notify_promote_0 on imssmp1 (local)
Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: rsc:mysqld:0:96: notify
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: te_rsc_command: Initiating
action 52: notify mysqld:1_post_notify_promote_0 on imssmp2
Jan 28 10:03:31 imssmp1 mysql(mysqld: 0)[17946]: INFO: Ignoring post-promote
notification for my own promotion.
Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:notify process
17946 exited with return code 0.
Jan 28 10:03:31 imssmp1 crmd[2579]:     info: process_lrm_event: LRM
operation mysqld:0_notify_0 (call=96, rc=0, cib-update=0, confirmed=true) ok
Jan 28 10:03:31 imssmp1 crmd[2579]:   notice: run_graph: ==== Transition
5372 (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=0,
Source=/var/lib/pengine/pe-input-972.bz2)
: Stopped
Jan 28 10:03:31 imssmp1 crmd[2579]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=notify_cr
md ]

What is only present in this case in logs are the abort_transaction_graph
after a Sending flush op to all hosts.

Have a nice day,
Radu Rad.
-- 
View this message in context: 
http://old.nabble.com/Master-Slave---Master-node-not-monitored-after-a-failure-tp34939865p34953070.html
Sent from the Linux-HA mailing list archive at Nabble.com.

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Master/Slave - Master node not monitored after a failure

Reply via email to