Attachments are stripped. Can you paste (say at http://apaste.info/)
From: Matt Foley <mfo...@hortonworks.com<mailto:mfo...@hortonworks.com>> Reply-To: "users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>" <users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>> Date: Monday, September 16, 2013 4:58 PM To: "users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>" <users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>> Subject: Help! After network outage, can't start System VMs; focused debug info attached We had a planned network outage this weekend, which inadvertently resulted in making the NFS Shared Primary Storage (used by System VMs) unavailable for a day and a half. (Guest VMs use local storage only, but System VMs use shared storage only.) Cloudstack was not brought down prior to the outage. After network came back, we gracefully brought down all services including cloudstack-management, mysql, and NFS, then actually rebooted all servers in the cluster and the NFS server (to make sure no stale file handles), then brought up services in the appropriate order. Also checked mysql for table corruption, and found none. Confirmed that the NFS volumes are mountable from all hosts, and in fact Shared Primary Storage is being mounted by cloudstack on hosts as usual, under /mnt/<uuid>. Nevertheless, when try to bring up the cluster, we fail to start the system VMs, with errors "InsufficientServerCapacityException: Unable to create a deployment for VM". The cause is not really insufficient capacity, as actual usage of resources is tiny; these error messages are false explanations of the failure to create primary storage volume for the System VMs. Digging into management-server.log, the core issue seems to be the ~160 line snippet from the log attached to this message as cloudstack_debug_2013.09.16.log. The only Shared Primary Storage pool is pool 201, named "cs-primary". It is mounted on all hosts as /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid. The log shows the management server correctly identifying a particular host as being able to access pool 201, then trying to allocate a primary storage volume using the template with uuid f23a16e7-b628-429e-83e1-698935588465. It fails, but I cannot tell why. I suspect its claim that "Template 3 has already been downloaded to pool 201" is false, but I don't know how to check this (or fix if wrong). Any guidance for further debugging or fixing this would be GREATLY appreciated. Thanks, --Matt CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.