Hi,

In our production setups we have seen some crashes of the KVM agent. This could happen for all kinds of reasons, but that's not what I wanted to discuss.

Also see this issue: https://issues.apache.org/jira/browse/CLOUDSTACK-3954

What I've been writing for a PoC in our company is a small helper written in Python which runs on port 8251.

The Investigator can query this webservice (attached) which will simply tell it which VMs are running on that host.

It's online here: http://stack01.ceph.widodh.nl:8251/

You can also do a query like this: http://stack01.ceph.widodh.nl:8251/ping/i-2-6570-VM

This way we can more reliably verify if a specific VM is still running if the Agent stops responding for some reason. A ICMP echo-request isn't safe since the Security Groups could prevent ICMP from coming through.

I'd rather not have the management server query libvirt directly, since that would open a potential security whole. This webservice is read-only and on my production setups I have libvirt listening on the private bridge only.

What do you think?

Wido
#!/usr/bin/python

'''
    This is a helper for the CloudStack Agent for High Availability checks

    It will run on port 8251 and it can tell the Management server whichs
    instances are running and if a particular instance is still running here.

    This is for cases where the main Agent crashes or becomes unrepsonsive and
    the HA Investigators start doing their work

    It provides an alternative way to see which Instances are still running on this host
'''
from BaseHTTPServer import BaseHTTPRequestHandler,HTTPServer
import json
import libvirt

tcp_port = 8251

class RequestHandler(BaseHTTPRequestHandler):

    def do_GET(self):
        content_type = 'application/json'

        conn = libvirt.openReadOnly('qemu:///system')
        if conn == None:
            self.send_response(503)
            self.end_headers()
            return

        virtdomains = map(conn.lookupByID, conn.listDomainsID())
        domains = []
        for domain in virtdomains:
            domains.append(domain.name())

        if self.path == '/':
            self.send_response(200)
            self.send_header('Content-type', content_type)
            self.end_headers()
            self.wfile.write('{ "instances": ' + json.dumps(domains) + '}')

        elif self.path.startswith('/ping/'):
            if self.path.count('/') == 2:
                alive, domain = self.path.lstrip('/').split('/')
                self.send_response(200)
                self.send_header('Content-type', content_type)
                self.end_headers()

                running = False
                for dom in domains:
                    if dom == domain:
                        running = True

                if running == True:
                    result = "true"
                else:
                    result = "false"

                self.wfile.write('{ "alive": ' + result + '" }')
            else:
                self.send_response(405)

        else:
            self.send_response(405)

        conn.close()

try:
    server = HTTPServer(('', tcp_port), RequestHandler)
    print 'Started httpserver on port ' , tcp_port

    server.serve_forever()

except (KeyboardInterrupt, SystemExit):
    print '^C received, shutting down the web server'
    server.socket.close()

Reply via email to