Hi, thank you for replying, I'll post inline also.
Dan Frincu-2 wrote: > > Hi, > > On Thu, Jan 24, 2013 at 2:07 PM, radurad <radu....@gmail.com> wrote: >> >> Hi, >> >> Using following installation under CentOS >> >> corosync-1.4.1-7.el6_3.1.x86_64 >> resource-agents-3.9.2-12.el6.x86_64 >> >> and having the following configuration for a Master/Slave mysql >> >> primitive mysqld ocf:heartbeat:mysql \ >> params binary="/usr/bin/mysqld_safe" config="/etc/my.cnf" >> socket="/var/lib/mysql/mysql.sock" datadir="/var/lib/mysql" user="mysql" >> replication_user="root" replication_passwd="testtest" \ >> op monitor interval="5s" role="Slave" timeout="31s" \ >> op monitor interval="6s" role="Master" timeout="30s" >> ms ms_mysql mysqld \ >> meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" >> \ >> cluster-infrastructure="openais" \ >> expected-quorum-votes="2" \ >> no-quorum-policy="ignore" \ >> stonith-enabled="false" \ >> last-lrm-refresh="1359026356" \ >> start-failure-is-fatal="false" \ > > That should not be set to false. > > I've tried with it set to true and false and there's no difference as I'm > not seeing any start failure. > >> cluster-recheck-interval="60s" >> rsc_defaults $id="rsc-options" \ >> failure-timeout="50s" >> >> Having only one node online (the Master; with a slave online the problem >> also occurs, but for simplification I've left only the Master online) >> >> I run into the bellow problem: >> - Stopping once the mysql process results in corosync restarting the >> mysql >> again and promoting it to Master. > > Pacemaker's lrmd restarts MySQL. > > You are right, I'm using corosync as the process name because I'm not > entirely sure about what each underlying process is doing. I'll try to use > the exact process name, hope I'll not get the mixed up. > >> - Stopping again the mysql process results in nothing; the failure is not >> detected, corosync takes no action and still sees the node as Master and >> the >> mysql running. > > How do you actually stop MySQL? > > Either by using the service script or by killing the process. Again, it > doesn't makes any difference, the behavior is the same. > >> - The operation monitor is not running after the first failure, as there >> are >> not entries in log of type: INFO: MySQL monitor succeeded (master). > > That sounds strange, could you try the most recent MySQL RA or the one > from Percona and see if it still does this? > > I'm using the latest resource agents (3.9.2), but I'll try with Percona > agents also. > >> - Changing something in configuration results in corosync detecting >> immediately that mysql is not running and promotes it. Also the operation >> monitor will run until the first failure and which the same problem >> occurs. >> >> If you need more information let me know. I could attach the log in the >> messages files also. > > Are you sure it is even working with this configuration? > > I've found no other issue except this one. What is working: > - slave is always monitored and restarted. > - failover is working (once I set migration-threshold and allow a > migration) > - replication is working (after failover also) > > > HTH, > Dan > > > I'll attach some logs output, maybe it will help understand what is happening: Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:start process 16845 exited with return code 0. Jan 28 10:03:31 imssmp1 crmd[2579]: info: process_lrm_event: LRM operation mysqld:0_start_0 (call=92, rc=0, cib-update=5552, confirmed=true) ok Jan 28 10:03:31 imssmp1 crmd[2579]: info: te_rsc_command: Initiating action 41: notify mysqld:0_post_notify_start_0 on imssmp1 (local) Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: rsc:mysqld:0:93: notify Jan 28 10:03:31 imssmp1 crmd[2579]: info: te_rsc_command: Initiating action 43: notify mysqld:1_post_notify_start_0 on imssmp2 Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:notify process 17853 exited with return code 0. Jan 28 10:03:31 imssmp1 crmd[2579]: info: process_lrm_event: LRM operation mysqld:0_notify_0 (call=93, rc=0, cib-update=0, confirmed=true) ok Jan 28 10:03:31 imssmp1 crmd[2579]: notice: run_graph: ==== Transition 5371 (Complete=11, Pending=0, Fired=0, Skipped=8, Incomplete=5, Source=/var/lib/pengine/pe-input-971.bz2) : Stopped Jan 28 10:03:31 imssmp1 crmd[2579]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_cr md ] Jan 28 10:03:31 imssmp1 pengine[2578]: info: unpack_config: Startup probes: enabled Jan 28 10:03:31 imssmp1 pengine[2578]: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 28 10:03:31 imssmp1 pengine[2578]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Jan 28 10:03:31 imssmp1 pengine[2578]: info: unpack_domains: Unpacking domains Jan 28 10:03:31 imssmp1 pengine[2578]: info: determine_online_status: Node imssmp1 is online Jan 28 10:03:31 imssmp1 pengine[2578]: info: determine_online_status: Node imssmp2 is online Jan 28 10:03:31 imssmp1 pengine[2578]: info: get_failcount: ms_mysql has failed 4 times on imssmp1 Jan 28 10:03:31 imssmp1 pengine[2578]: info: get_failcount: ms_mysql has failed 4 times on imssmp1 Jan 28 10:03:31 imssmp1 pengine[2578]: warning: unpack_rsc_op: Processing failed op mysqld:0_last_failure_0 on imssmp1: not running (7) Jan 28 10:03:31 imssmp1 pengine[2578]: warning: unpack_rsc_op: Forcing mysqld:0 to stop after a failed demote action Jan 28 10:03:31 imssmp1 pengine[2578]: info: clone_print: Master/Slave Set: ms_mysql [mysqld] Jan 28 10:03:31 imssmp1 pengine[2578]: info: short_print: Slaves: [ imssmp1 imssmp2 ] Jan 28 10:03:31 imssmp1 pengine[2578]: info: get_failcount: ms_mysql has failed 4 times on imssmp1 Jan 28 10:03:31 imssmp1 pengine[2578]: notice: common_apply_stickiness: ms_mysql can fail 999996 more times on imssmp1 before being forced off Jan 28 10:03:31 imssmp1 pengine[2578]: info: get_failcount: ms_mysql has failed 4 times on imssmp1 Jan 28 10:03:31 imssmp1 pengine[2578]: notice: common_apply_stickiness: ms_mysql can fail 999996 more times on imssmp1 before being forced off Jan 28 10:03:31 imssmp1 pengine[2578]: info: master_color: Promoting mysqld:0 (Slave imssmp1) Jan 28 10:03:31 imssmp1 pengine[2578]: info: master_color: ms_mysql: Promoted 1 instances of a possible 1 to master Jan 28 10:03:31 imssmp1 pengine[2578]: info: RecurringOp: Start recurring monitor (6s) for mysqld:0 on imssmp1 Jan 28 10:03:31 imssmp1 pengine[2578]: info: RecurringOp: Start recurring monitor (6s) for mysqld:0 on imssmp1 Jan 28 10:03:31 imssmp1 pengine[2578]: notice: LogActions: Promote mysqld:0#011(Slave -> Master imssmp1) Jan 28 10:03:31 imssmp1 pengine[2578]: info: LogActions: Leave mysqld:1#011(Slave imssmp2) Jan 28 10:03:31 imssmp1 crmd[2579]: error: log_data_element: Output truncated: available=15, needed=24 Jan 28 10:03:31 imssmp1 crmd[2579]: error: log_data_element: Output truncated: available=16, needed=24 Jan 28 10:03:31 imssmp1 pengine[2578]: notice: process_pe_message: Transition 5372: PEngine Input stored in: /var/lib/pengine/pe-input-972.bz2 Jan 28 10:03:31 imssmp1 crmd[2579]: error: log_data_element: Output truncated: available=15, needed=24 Jan 28 10:03:31 imssmp1 crmd[2579]: error: log_data_element: Output truncated: available=16, needed=24 Jan 28 10:03:31 imssmp1 crmd[2579]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_ response ] Jan 28 10:03:31 imssmp1 crmd[2579]: info: do_te_invoke: Processing graph 5372 (ref=pe_calc-dc-1359363811-5625) derived from /var/lib/pengine/pe-input-972.bz2 Jan 28 10:03:31 imssmp1 crmd[2579]: info: te_rsc_command: Initiating action 49: notify mysqld:0_pre_notify_promote_0 on imssmp1 (local) Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: rsc:mysqld:0:94: notify Jan 28 10:03:31 imssmp1 crmd[2579]: info: te_rsc_command: Initiating action 51: notify mysqld:1_pre_notify_promote_0 on imssmp2 Jan 28 10:03:31 imssmp1 mysql(mysqld: 0)[17878]: INFO: This will be new master Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:notify process 17878 exited with return code 0. Jan 28 10:03:31 imssmp1 crmd[2579]: info: process_lrm_event: LRM operation mysqld:0_notify_0 (call=94, rc=0, cib-update=0, confirmed=true) ok Jan 28 10:03:31 imssmp1 crmd[2579]: info: te_rsc_command: Initiating action 9: promote mysqld:0_promote_0 on imssmp1 (local) Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: rsc:mysqld:0:95: promote Jan 28 10:03:31 imssmp1 attrd[2577]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-mysqld:0 (3601) Jan 28 10:03:31 imssmp1 attrd[2577]: notice: attrd_perform_update: Sent update 225: master-mysqld:0=3601 Jan 28 10:03:31 imssmp1 crmd[2579]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=0, tag=nvpair, id=status-imssmp1-master-mysqld.0, name=master-mysqld:0, value=3601, magic=NA, cib=0.134.55) : Transient attribute: update Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:promote process 17911 exited with return code 0. Jan 28 10:03:31 imssmp1 crmd[2579]: info: process_lrm_event: LRM operation mysqld:0_promote_0 (call=95, rc=0, cib-update=5554, confirmed=true) ok Jan 28 10:03:31 imssmp1 crmd[2579]: info: te_rsc_command: Initiating action 50: notify mysqld:0_post_notify_promote_0 on imssmp1 (local) Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: rsc:mysqld:0:96: notify Jan 28 10:03:31 imssmp1 crmd[2579]: info: te_rsc_command: Initiating action 52: notify mysqld:1_post_notify_promote_0 on imssmp2 Jan 28 10:03:31 imssmp1 mysql(mysqld: 0)[17946]: INFO: Ignoring post-promote notification for my own promotion. Jan 28 10:03:31 imssmp1 lrmd: [2576]: info: Managed mysqld:0:notify process 17946 exited with return code 0. Jan 28 10:03:31 imssmp1 crmd[2579]: info: process_lrm_event: LRM operation mysqld:0_notify_0 (call=96, rc=0, cib-update=0, confirmed=true) ok Jan 28 10:03:31 imssmp1 crmd[2579]: notice: run_graph: ==== Transition 5372 (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-972.bz2) : Stopped Jan 28 10:03:31 imssmp1 crmd[2579]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_cr md ] What is only present in this case in logs are the abort_transaction_graph after a Sending flush op to all hosts. Have a nice day, Radu Rad. -- View this message in context: http://old.nabble.com/Master-Slave---Master-node-not-monitored-after-a-failure-tp34939865p34953070.html Sent from the Linux-HA mailing list archive at Nabble.com. _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems