Re: [Linux-ha-dev] mysql m/s rapid failover problem

Marek Marczykowski Fri, 18 Mar 2011 12:46:59 -0700

On 11/03/11 19:05, Raoul Bhatia [IPAX] wrote:
> hi,
> 
> disclaimer: it has been some time since i thoroughly studied
> the inner workings of mysql replication, but i think that my
> current knowledge is sufficient to diagnose this problem.
> perhaps i'm lacking skills to find the most decent solution ;)
> 
> 
> PROBLEM DESCRIPTION:
> 
> i use the mysql ra rev 5ffc6ee2fae98adae8f2 from git [2]
> i have my mysql m/s up and running:
>> root@wdb01 ~ # crm status
> ...
>> Online: [ wdb01 wdb02 ]
> ...
>>  Master/Slave Set: ms-wdb-mysql
>>      Masters: [ wdb01 ]
>>      Slaves: [ wdb02 ]
> 
> my databases are in sync and i truncate the test table:
>> # ssh wdb01c "mysql nagiostest -Be 'truncate table test_table'"
>> # ssh wdb01c "mysql nagiostest -Be 'select * from test_table'"
>> # ssh wdb02c "mysql nagiostest -Be 'select * from test_table'"
> 
> inserting some values and query from both nodes to verify that
> everything is in sync:
>> # ssh wdb01c "mysql nagiostest -Be 'insert into test_table values (1),(2)'"
>> # ssh wdb01c "mysql nagiostest -Be 'select * from test_table'"
>> id
>> 1
>> 2
>> # ssh wdb02c "mysql nagiostest -Be 'select * from test_table'"
>> id
>> 1
>> 2
> 
> great, now i do a migration that affects ms-wdb-mysql
>> # crm resource migrate wdb-ip0 PT1M
>> Migration will take effect until: 2011-03-11 17:21:50Z
>> WARNING: Creating rsc_location constraint 'cli-standby-wdb-ip0' with a score 
>> of -INFINITY for resource wdb-ip0 on wdb01.
>>         This will prevent wdb-ip0 from running on wdb01 until the constraint 
>> is removed using the 'crm_resource -U' command or manually with cibadmin
>>         This will be the case even if wdb01 is the last node in the cluster
>>         This message can be disabled with -Q
> 
> checking the status:
>> # crm status
> ...
>> Online: [ wdb01 wdb02 ]
> ...
>>  Master/Slave Set: ms-wdb-mysql
>>      Masters: [ wdb02 ]
>>      Slaves: [ wdb01 ]
> 
> checking the data:
>> # ssh wdb01c "mysql nagiostest -e 'select * from test_table'"
>> id
>> 1
>> 2
>> # ssh wdb02c "mysql nagiostest -e 'select * from test_table'"
>> id
>> 1
>> 2
> 
> -> *everything is fine*
> -> insert more values on the new master wdb02:
>> # ssh wdb02c "mysql nagiostest -Be 'insert into test_table values 
>> (3),(4),(5)'"
>> # ssh wdb02c "mysql nagiostest -e 'select * from test_table'"
>> id
>> 1
>> 2
>> 3
>> 4
>> 5
>> root@wdb01 ~ # ssh wdb01c "mysql nagiostest -e 'select * from test_table'"
>> id
>> 1
>> 2
>> 3
>> 4
>> 5
> 
> -> great!
> -> another migration:
>> root@wdb01 ~ # crm resource migrate wdb-ip0 PT1M
>> Migration will take effect until: 2011-03-11 17:27:13Z
>> WARNING: Creating rsc_location constraint 'cli-standby-wdb-ip0' with a score 
>> of -INFINITY for resource wdb-ip0 on wdb02.
>>         This will prevent wdb-ip0 from running on wdb02 until the constraint 
>> is removed using the 'crm_resource -U' command or manually with cibadmin
>>         This will be the case even if wdb02 is the last node in the cluster
>>         This message can be disabled with -Q
> 
> check status:
>> # crm status
> ...
>> Online: [ wdb01 wdb02 ]
> ...
>>      Masters: [ wdb01 ]
>>      Slaves: [ wdb02 ]
> 
> -> fine
> -> insert another value on wdb01 and query:
>> # ssh wdb01c "mysql nagiostest -Be 'insert into test_table values (6)'"^C
>> # ssh wdb01c "mysql nagiostest -e 'select * from test_table'"
>> id
>> 1
>> 2
>> 3
>> 4
>> 5
>> 6
> -> fine, now query wdb02:
> 
>> root@wdb01 ~ # ssh wdb02c "mysql nagiostest -e 'select * from test_table'"
>> id
>> 1
>> 2
>> 6
> 
> *error*
> 
> things get worse for the next failover, and so on :)
> 
> 
> MY CONCLUSION:
> the source and some thinking leads me to the following conclusion:
> 
> 1. when a new master is promoted, its slave binlogs do not get purged
> this is no problem in case of *one* failover, because
> 
> a. the slave will (hopefully) be consistant
> b. gets promoted to master and
> c. the old master (with no slave replication logs) will happily start
> the replication starting with an empty replication log.
> 
> 
> *but*
> 2. when the roles switch,
> a. the consistent slave is promoted
> b. the now new slave will use set_master() from the ra [1]
> ->
>>     ocf_run $MYSQL $MYSQL_OPTIONS_LOCAL $MYSQL_OPTIONS_REPL \
>>         -e "CHANGE MASTER TO MASTER_HOST='$master_host', \
>>                              MASTER_USER='$OCF_RESKEY_replication_user', \
>>                              
>> MASTER_PASSWORD='$OCF_RESKEY_replication_passwd'"
> 
> c. and will then start the replication, starting with
> the default RELAY_LOG_FILE and RELAY_LOG_POS=0, as well as
> the MASTER_LOG_FILE and MASTER_LOG_POS=0 information from
> master.info/relay-log.info leading to:
> 
> d. truncate table, insert 1,2 and finally insert 6
> 
> 
> 
> SOLVING THIS ISSUE?
> i thought about how to solve this problem, and i currently came up with
> the following idea:
> 
> 1. besides using reset master upon failover, also use reset slave to
> purge any existing relay logs upon changing the master.
> 
> 2. maybe it is possible to save and communicate the correct
> RELAY_LOG_FILE, RELAY_LOG_POS and/or MASTER_LOG_FILE, MASTER_LOG_POS
> information in the cib?


I don't know if this is enough, but my version saves MASTER_LOG_FILE and
MASTER_LOG_POS. See
http://marmarek.w.staszic.waw.pl/patches/ha-mysql-ra/
especially:
http://marmarek.w.staszic.waw.pl/patches/ha-mysql-ra/15_mysql-ra-replication-state-for-more-than-two-nodes.patch.

-- 
Best Regards,
Marek Marczykowski          |   gg:2873965      | RLU #390519
marmarek at staszic waw pl  | xmpp:marmarek at staszic waw pl

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] mysql m/s rapid failover problem

Reply via email to