We upgraded our cloud from Essex -> Folsom yesterday and had a major loss of 
data that I thought I'd share.

With Essex the flag remove_used_base_images had a default of False, with Folsom 
it was changed to True. We hadn't explicitly set this so we had whatever the 
default was.

After the upgrade which went relatively smoothly (a lot easier than diablo -> 
essex) almost all our base images were deleted by the image cache clean up.
I can't explain how this happened. We lost a total of about 70 images that 
affected ~200 running instances.

We have since disabled this flag until we can find out what went wrong. I can 
see it in the logs and if this flag is enabled it would delete a lot of in use 
base files still.

We have an nfs mounted /var/lib/nova/instances directory where the _base dir is 
located so I'm wondering if this had something to do with it? 
Is the image cache cleanup meant to work in a shared instance storage 
environment?


We also came across an issue where some compute nodes were reporting bogus 
resource stats. Eg:

2012-11-13 05:04:38 INFO nova.compute.manager [-] Updating host status
2012-11-13 05:06:14 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 
-739665
2012-11-13 05:06:14 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 
12654
2012-11-13 05:06:14 AUDIT nova.compute.resource_tracker [-] Free VCPUS: -188
2012-11-13 05:06:14 INFO nova.compute.resource_tracker [-] Compute_service 
record updated for np-rcc6

This happened to be addressed by the following bug, it turns out it does a 
regex for the db filter.
https://bugs.launchpad.net/nova/+bug/1060363

So a compute node of np-rcc5 would also pull in np-rcc50, np-rcc51.. and so on 
and so on. 


All in all apart from our huge data loss the upgrade went pretty well. 

The main issues we have now are usability issues with the dashboard:
Pagination doesn't work
The green notification boxes that appear top right get in the way of the links 
behind them
The new containers view is confusing and you can no longer see how much data in 
a specific container like you used to.
The launch instance box sometimes gets the bottom cut off making it useless
Same with if you have lots of security groups in terms of the launch instance 
box

I should also add we have moved to a using nova cells, this went pretty 
smoothly and we're awaiting eagerly for the cells code to hit trunk so we can 
contribute our enhancements to cells.


Cheers,
Sam





_______________________________________________
Mailing list: https://launchpad.net/~openstack
Post to     : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp

Reply via email to