That is an excellent point I do think that catching this in the functional testing at the gate would be a great idea.

On 07/30/2013 08:55 PM, Sean Dague wrote:
I would definitely encourage you to think about how we could apply at tool like this in the OpenStack gate itself as you go through the process of openning it up. If we could catch those kinds of corruptions before the commits land we move the cost of finding those problem way down.

It obviously won't be able to do the scale you guys are doing, but I'd bet a large number of these corruptions are findable in the gate.

On 07/30/2013 10:48 PM, Jacob Bushman wrote:
I haven't opened it because currently it is too tied to our proprietary
platform.  I have actually submitted a talk for the summit and planned
on having an open version ready for this.

It is good to hear that I am not the only one out there dealing with
these sorts of issues and trying to find solutions.

On 07/30/2013 05:37 PM, Joshua Harlow wrote:
I would love that tool, is it opened??

I've thought about such a tool myself actually. Something that keeps
enough info on the compute node to be able to analyze the actual state of
the cluster and find discrepancies for what the varying openstack db's
believe is the 'state' of the clusters.

Seems like a great analysis tool. What corrective actions does it do (if
any?), aka, DB says X instances, really Y, then?? (delete them??)

On 7/30/13 11:59 AM, "Jacob Bushman" <[email protected]> wrote:

In our deployment we have a custom solution for the orchestration of
Openstack through the API that connects with billing and other external
systems on the back end.

We have found that most of the corruption is introduced by messaging
issues in Openstack. There are a myriad of edge cases where the status in the database can become out of sync with what is actually running on
a compute node for instance.

The basic concept of the auditing tools is to compare the information in
the database with the actual state of the compute node and identify
discrepancies.

This is accomplished by parsing the instance XML, external ids of the
tap device and gathering relevant data from the compute node. Then
passing this through an API to our orchestration system and using a
combination of Openstack API calls and DB queries to audit the compute
nodes and make sure the database and the compute nodes are in sync.

On 07/30/2013 11:17 AM, Joshua Harlow wrote:
Do u have a writeup of the corruption issues you have seen.

I would most definitely appreciate said tools.

Any little overview of what they do/are??

On 7/30/13 9:44 AM, "Jacob Bushman" <[email protected]> wrote:

I have been working with various corruption issues within openstack.
Issues like failed or partial provisions, quantum port / ip corruption and database corruption. There are several edge cases that I have run
into where the existing periodic task to clean up corruption were
inadequate for our use case.

We really needed a more unified way to query through the entire stack.
To handle this on the scale that I am working with I have developed
out
of band auditing tools.

I feel something like this belongs in Openstack and would be useful to
the community.  I am wondering what other tools are available and if
this is something that is of interest.

~ Jacob

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to