Re: [openstack-dev] [Fuel] HA cluster disk monitoring, failover and recovery

Andrew Beekhof Tue, 17 Nov 2015 13:51:02 -0800

> On 18 Nov 2015, at 4:52 AM, Alex Schultz <[email protected]> wrote:
> 
> On Tue, Nov 17, 2015 at 11:12 AM, Vladimir Kuklin <[email protected]> 
> wrote:
>> Bogdan
>> 
>> I think we should firstly check whether attribute deletion leads to node
>> starting its services or not. From what I read in the official Pacemaker
>> documentation, it should work out of the box without the need to restart the
>> node.
> 
> It does start up the services when the attribute is cleared. QA has a
> test to validate this as part of this change.
> 
>> And by the way the quote above mentions 'use ONE of the following methods'
>> meaning that we could actually use attribute deletion. The 2nd and the 3rd
>> options do the same - they clear short-living node attribute. So we need to
>> figure out why OCF script does not update the corresponding attribute by
>> itself.
>> 
> 
> https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/SysInfo#L215-L227
> 
> It doesn't have something that updates it to green because essentially
> when this condition hits, the sysinfo service is also stopped. It has
> no way of knowing when it is cleared because all the resources are
> stopped and there is no longer a service running to reset the
> attribute.


There needs to be a way to mark cluster resources (specifically the sysinfo 
one) as being immune to the “red” condition.
Alas it hasn’t bubbled up the priority list yet.  Complaining definitely helps 
make it more visible :)

In the short-term, a cron job that called the agent would probably do the trick.

>  We would need something outside of pacemaker to mark it OK
> or perhaps write a custom health strategy[0][1] that would not stop
> the sysinfo task and update the ocf script to update the status to
> green if all disks are OK.
> 
> -Alex
> 
> [0] 
> https://github.com/openstack/fuel-library/blob/master/deployment/puppet/cluster/manifests/sysinfo.pp#L50-L55
> [1] http://clusterlabs.org/wiki/SystemHealth
> 
>> 
>> 
>> On Tue, Nov 17, 2015 at 7:03 PM, Bogdan Dobrelya <[email protected]>
>> wrote:
>>> 
>>> On 17.11.2015 15:28, Kyrylo Galanov wrote:
>>>> Hi Team,
>>> 
>>> Hello
>>> 
>>>> 
>>>> I have been testing fail-over after free disk space is less than 512 mb.
>>>> (https://review.openstack.org/#/c/240951/)
>>>> Affected node is stopped correctly and services migrate to a healthy
>>>> node.
>>>> 
>>>> However, after free disk space is more than 512 mb again the node does
>>>> not recover it's state to operating. Moreover, starting the resources
>>>> manually would rather fail. In a nutshell, the pacemaker service / node
>>>> should be restarted. Detailed information is available
>>>> here:
>>>> https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_configuration_basics_monitor_health.html
>>>> 
>>>> How do we address this issue?
>>> 
>>> According to the docs you provided,
>>> " After a node's health status has turned to red, solve the issue that
>>> led to the problem. Then clear the red status to make the node eligible
>>> again for running resources. Log in to the cluster node and use one of
>>> the following methods:
>>> 
>>>    Execute the following command:
>>> 
>>>    crm node status-attr NODE delete #health_disk
>>> 
>>>    Restart OpenAIS on that node.
>>> 
>>>    Reboot the node.
>>> 
>>> The node will be returned to service and can run resources again. "
>>> 
>>> So this looks like an expected behaviour!
>>> 
>>> What else could be done:
>>> - We should check if we have this nuance documented, and submit a bug to
>>> fuel-docs team, if not yet there.
>>> - Submitting a bug and inspecting logs would be nice to do as well.
>>> I believe some optimizations may be done, bearing in mind this pacemaker
>>> cluster-recheck-interval and failure-timeout story [0].
>>> 
>>> [0]
>>> http://blog.kennyrasschaert.be/blog/2013/12/18/pacemaker-high-failability/
>>> 
>>>> 
>>>> 
>>>> Best regards,
>>>> Kyrylo
>>>> 
>>>> 
>>>> 
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>> [email protected]?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> Bogdan Dobrelya,
>>> Irc #bogdando
>>> 
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: [email protected]?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
>> 
>> 
>> 
>> --
>> Yours Faithfully,
>> Vladimir Kuklin,
>> Fuel Library Tech Lead,
>> Mirantis, Inc.
>> +7 (495) 640-49-04
>> +7 (926) 702-39-68
>> Skype kuklinvv
>> 35bk3, Vorontsovskaya Str.
>> Moscow, Russia,
>> www.mirantis.com
>> www.mirantis.ru
>> [email protected]
>> 
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: [email protected]?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Fuel] HA cluster disk monitoring, failover and recovery

Reply via email to