Re: [openstack-dev] [Fuel-dev] [Openstack-dev] New RA for Galera

Bogdan Dobrelya Thu, 29 May 2014 01:42:07 -0700

On 05/27/14 16:44, Bartosz Kupidura wrote:
> Hello,
> Responses inline.
> 
> 
> Wiadomość napisana przez Vladimir Kuklin <vkuk...@mirantis.com> w dniu 27 maj 
> 2014, o godz. 15:12:
> 
>> Hi, Bartosz
>>
>> First of all, we are using openstack-dev for such discussions.
>>
>> Second, there is also Percona's RA for Percona XtraDB Cluster, which looks 
>> like pretty similar, although it is written in Perl. May be we could derive 
>> something useful from it.
>>
>> Next, if you are working on this stuff, let's make it as open for the 
>> community as possible. There is a blueprint for Galera OCF script: 
>> https://blueprints.launchpad.net/fuel/+spec/reliable-galera-ocf-script. It 
>> would be awesome if you wrote down the specification and sent  newer galera 
>> ocf code change request to fuel-library gerrit.
> 
> Sure, I will update this blueprint. 
> Change request in fuel-library: https://review.openstack.org/#/c/95764/


That is a really nice catch, Bartosz, thank you. I believe we should
review the new OCF script thoroughly and consider omitting
cs_commits/cs_shadows as well. What would be the downsides?

> 
>>
>> Speaking of crm_attribute stuff. I am very surprised that you are saying 
>> that node attributes are altered by crm shadow commit. We are using similar 
>> approach in our scripts and have never faced this issue.
> 
> This is probably because you update crm_attribute very rarely. And with my 
> approach GTID attribute is updated every 60s on every node (3 updates in 60s, 
> in standard HA setup). 
> 
> You can try to update any attribute in loop during deploying cluster to 
> trigger fail with corosync diff.

It sounds reasonable and we should verify it.
I've updated the statuses for related bugs and attached them to the
aforementioned blueprint as well:
https://bugs.launchpad.net/fuel/+bug/1283062/comments/7
https://bugs.launchpad.net/fuel/+bug/1281592/comments/6


> 
>>
>> Corosync 2.x support is in our roadmap, but we are not sure that we will use 
>> Corosync 2.x earlier than 6.x release series start.
> 
> Yeah, moreover corosync CMAP is not synced between cluster nodes (or maybe im 
> doing something wrong?). So we need other solution for this...
> 

We should use CMAN for Corosync 1.x, perhaps.

>>
>>
>> On Tue, May 27, 2014 at 3:08 PM, Bartosz Kupidura <bkupid...@mirantis.com> 
>> wrote:
>> Hello guys!
>> I would like to start discussion on a new resource agent for 
>> galera/pacemaker.
>>
>> Main features:
>> * Support cluster boostrap
>> * Support reboot any node in cluster
>> * Support reboot whole cluster
>> * To determine which node have latest DB version, we should use galera GTID 
>> (Global Transaction ID)
>> * Node with latest GTID is galera PC (primary component) in case of 
>> reelection
>> * Administrator can manually set node as PC
>>
>> GTID:
>> * get GTID from mysqld --wsrep-recover or SQL query 'SHOW STATUS LIKE 
>> ‚wsrep_local_state_uuid''
>> * store GTID as crm_attribute for node (crm_attribute --node $HOSTNAME 
>> --lifetime $LIFETIME --name gtid --update $GTID)
>> * on every monitor/stop/start action update GTID for given node
>> * GTID can have 3 format:
>>  - XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX:123 - standard cluster-id:commit-id
>>  - XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX:-1 - standard non initialized 
>> cluster, 00000000-0000-0000-0000-000000000000:-1
>>  - XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX:INF - commit-id manually set to INF, 
>> force RA to create new cluster, with master on given node
>>
>> Check if reelection of PC is needed:
>> * (node is located in partition with quorum OR we have only 1 node 
>> configured in cluster) AND galera resource is not running on any node
>> * GTID is manually set to INF on given node
>>
>> Check if given node is PC:
>> * have highest GTID in cluster, in case we have more than one node with 
>> „highest” GTID, we use CRC32 to choose proper PC.
>> * GTID is manually set to INF
>> * in case node with highest GTID will not come back after cluster reboot 
>> (for example disk failure) administrator should set GTID to INF on other node
>>
>> I have almost ready RA: http://zynzel.spof.pl/mysql-wss
>>
>> Tested with vanila centos galera/pacemaker/corosync - OK
>> Tested with Fuel 4.1 - Fail
>>
>>
>> Fuel 4.1 with that RA will not deploy correctly, because we use 
>> crm_attribute to store GTID, and in manifest we use cs_shadow/cs_commit for 
>> every pacemaker resource.
>> This lead to cs_commit problem with different configuration in shadow copy 
>> and running configuration (running config changed by RA).
>> "Could not commit shadow instance [..] to the CIB: Application of an update 
>> diff failed”
>>
>> To solve this we can go in 2 ways:
>> 1) dont use cs_commit/cs_shadow in manifests
>> 2) store GTID in other way than crm_attribute
>>
>> IMHO 2) is better (less invasive) and we can store GTID in corosync CMAP 
>> (http://www.polarhome.com/service/man/generic.php?qf=corosync-cmapctl), but 
>> this require corosync 2.X
>>
>>
>> --
>> Mailing list: https://launchpad.net/~fuel-dev
>> Post to     : fuel-...@lists.launchpad.net
>> Unsubscribe : https://launchpad.net/~fuel-dev
>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>>
>> -- 
>> Yours Faithfully,
>> Vladimir Kuklin,
>> Fuel Library Tech Lead,
>> Mirantis, Inc.
>> +7 (495) 640-49-04
>> +7 (926) 702-39-68
>> Skype kuklinvv
>> 45bk3, Vorontsovskaya Str.
>> Moscow, Russia,
>> www.mirantis.com
>> www.mirantis.ru
>> vkuk...@mirantis.com
> 
> 


-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Fuel-dev] [Openstack-dev] New RA for Galera

Reply via email to