On Fri, Nov 9, 2012 at 10:03 AM, Andrew Beekhof <and...@beekhof.net> wrote: > On Thu, Nov 8, 2012 at 5:24 PM, Gao,Yan <y...@suse.com> wrote: >> Hi Andrew, >> >> On 11/08/12 13:09, Andrew Beekhof wrote: >>> On Tue, Nov 6, 2012 at 10:30 PM, Gao,Yan <y...@suse.com> wrote: >>>> Hi, >>>> >>>> Currently, we can manage VMs via the VM agents. But the services running >>>> within VMs are not very easy to be monitored. If we could use >>>> nagios/icinga probes from the host to the guest, that would allow us to >>>> achieve this. >>>> >>>> Lars, Dejan and I have been discussing on this for some time. There have >>>> been quite some thoughts on how to implement it. Now we are inclined to >>>> a proposal from Lars. Please let me introduce the idea here, and see >>>> what you think about it. >>>> >>>> First, we could add a resource agent class. The RAs belonging to this >>>> class wrap around nagois/icinga probes. They can be configured as >>>> special monitor operations for the VMs. The behaviors should be like: >>>> >>>> 1. The special monitor operations start working after the VMs and the >>>> services inside are started. >>>> >>>> 2. Any failure of the monitor operations is treated as the failure of >>>> the VM, which triggers the recovery of the VM. >>>> >>>> Let me show a example: >>>> >>>> primitive db-vm ocf:heartbeat:VirtualDomain \ >>>> params config="db-vm" hypervisor="xen:///" \ >>>> ip="192.168.1.122" \ >>>> op monitor nagios:ftp interval="30s" params user="test" >>>> >>>> The "nagios:ftp" specifies which monitor agent is used to monitor the >>>> VM. It's an optional attributes group expressing "class/provider/type" >>>> of the monitor agent, which defaults to "ocf:heartbeat:VirtualDomain" >>>> for this VM (if so, the monitor would be a normal one like we usually >>>> configure). We can add more monitors like "nagios:www" type and so on. >>> >>> What do you propose the XML should look like? >> Should be like: >> ... >> <op id="vm-monitor-30" name="monitor" class="nagios" type="ftp" >> interval="30s" ignore-first-failures="true"> >> <instance_attributes id="vm-monitor-30-params"> >> <nvpair id="vm-monitor-30-params" name="user" value="test"> >> </instance_attributes> >> </op> >> ... >> >>> >>>> We can specify particular "params" for a monitor. And the "ip" is >>>> actually not a useful parameter for the VirtualDomain, we put it there >>>> for its monitor operations to inherit, so that we don't have to specify >>>> for each monitor respectively. >>> >>> You plan to add 'ip' to the VirtualDomain metadata? >> It should be in the metatdata of nagios:ftp and also other monitor >> agents. We'd like parameters inheritance to avoid configuration repetition. > > That sounds overly complex (you now need to do two metadata lookups to > determine the parameter lists).
Actually more - assuming a VM can contain multiple services which each one being checked by a nagios script. > I think I'd prefer to avoid that if > possible. > >> >>> >>>> >>>> >>>> Other issues: >>>> - As we can see, there's some time window between when the VM is >>>> started, but prior to the monitored service starting. A solution is >>>> adding a "first-failure" flag for the monitor operation, which could >>>> allow us to ignore the *first* failures of a monitor until it has >>>> returned healthy once, unless the time is out. Ideally, it could be >>>> handled in LRM. >>> >>> What happens if there is never a first success? >>> The cluster will never find out. >> It'll reach the timeout and return. > > Which timeout? Not the one in <op...> since the whole operation might > repeat many times over before succeeding. > >> We should give a reasonable monitor >> timeout I think. >> >>> >>>> >>>> - A limitation is we would have to specify different monitor interval >>>> values for the services within a VM. Probably we could fix it in some >>>> way finally. >>>> >>>> >>>> Anyway, this's the most straightforward solution we can think of so far >>>> (Please correct me if I'm missing anything). It's open for discussion. >>>> Any comments and suggestions are welcome and appreciated. >>> >>> Doesn't look too bad. Some finer points to discuss but I'm sure we >>> can reach agreement. >> Nice, thanks! >> >> Regards, >> Gao,Yan >> -- >> Gao,Yan <y...@suse.com> >> Software Engineer >> China Server Team, SUSE. >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org