Re: Help! After network outage, can't start System VMs; focused debug info attached

Chiradeep Vittal Mon, 16 Sep 2013 17:20:14 -0700

Attachments are stripped. Can you paste (say at http://apaste.info/)

From: Matt Foley <mfo...@hortonworks.com<mailto:mfo...@hortonworks.com>>
Reply-To: "users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>" 
<users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>>
Date: Monday, September 16, 2013 4:58 PM
To: "users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>" 
<users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>>
Subject: Help! After network outage, can't start System VMs; focused debug info 
attached

We had a planned network outage this weekend, which inadvertently resulted in 
making the NFS Shared Primary Storage (used by System VMs) unavailable for a 
day and a half.  (Guest VMs use local storage only, but System VMs use shared 
storage only.)  Cloudstack was not brought down prior to the outage.

After network came back, we gracefully brought down all services including 
cloudstack-management, mysql, and NFS, then actually rebooted all servers in 
the cluster and the NFS server (to make sure no stale file handles), then 
brought up services in the appropriate order.  Also checked mysql for table 
corruption, and found none.  Confirmed that the NFS volumes are mountable from 
all hosts, and in fact Shared Primary Storage is being mounted by cloudstack on 
hosts as usual, under /mnt/<uuid>.

Nevertheless, when try to bring up the cluster, we fail to start the system 
VMs, with errors "InsufficientServerCapacityException: Unable to create a 
deployment for VM".  The cause is not really insufficient capacity, as actual 
usage of resources is tiny; these error messages are false explanations of the 
failure to create primary storage volume for the System VMs.

Digging into management-server.log, the core issue seems to be the ~160 line 
snippet from the log attached to this message as 
cloudstack_debug_2013.09.16.log.  The only Shared Primary Storage pool is pool 
201, named "cs-primary".  It is mounted on all hosts as 
/mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log shows 
the management server correctly identifying a particular host as being able to 
access pool 201, then trying to allocate a primary storage volume using the 
template with uuid f23a16e7-b628-429e-83e1-698935588465.  It fails, but I 
cannot tell why.  I suspect its claim that "Template 3 has already been 
downloaded to pool 201" is false, but I don't know how to check this (or fix if 
wrong).

Any guidance for further debugging or fixing this would be GREATLY appreciated.
Thanks,
--Matt

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.

Re: Help! After network outage, can't start System VMs; focused debug info attached

Reply via email to