Re: outage feedback and questions

Laurent Steff Tue, 09 Jul 2013 16:09:26 -0700

Hi Dean,

And thanks for your answer.


Yes the network troubles lead to issue with the main storage
on clusters (iscsi).

So is that a fact if the main storage is lost on KVM, VMs are stopped
and domain destroyed ?

It was an hypothesis as I found traces in 

apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java

which "kills -9 qemu processes" if main storage is not found, but I was not 
sure when the function was called.

It's on the function  checkingMountPoint, which calls destroyVMs if mount point 
not found.

Regards,

----- Mail original -----
> De: "Dean Kamali" <[email protected]>
> À: [email protected]
> Envoyé: Lundi 8 Juillet 2013 16:34:04
> Objet: Re: outage feedback and questions
> 
> Survivors VMs are on the same KVM/GFS2 Cluster.
> SSVM is one of them. Messages on the console indicates she was
> temporarily
> in read-only mode
> 
> Do you have an issue with storage?
> 
> I wouldn't expect a failure in switch could cause all of this, it
> will
> cause loss of network connectivity but it shouldn't cause your vms to
> go
> down.
> 
> This behavior usually happens when you lose your primary storage.
> 
> 
> 
> 
> On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> <[email protected]>wrote:
> 
> > Hello,
> >
> > Cloudstack is used in our company as a core component of a
> > "Continuous
> > Integration"
> > Service.
> >
> > We are mainly happy with it, for a lot of reasons too long to
> > describe. :)
> >
> > We encountered recently a major service outage on Cloudstack mainly
> > linked
> > to bad practices on our side, and the aim of this post is :
> >
> > - ask questions about things we didn't understand yet
> > - gather some practical best practices we missed
> > - if problems detected are still present on Cloudstack 4.x, helping
> > to robustify Cloudstack with our feedback
> >
> > we know that 3.x version is not supported and plan to move ASAP in
> > 4.x
> > version.
> >
> > It's quite a long mail, and it may be badly directed (dev mailing
> > list ?
> > multiple bugs ?)
> >
> > Any response is appreciated ;)
> >
> > Regards,
> >
> >
> > --------------------long
> > part----------------------------------------
> >
> > Architecture :
> > --------------
> >
> > Old and non Apache CloudStack 3.0.2 release
> > 1 Zone, 1 physical network, 1 pod
> > 1 Virtual Router VM, 1 SSVM
> > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
> > Management Server on Vmware virtual machine
> >
> >
> >
> > Incidents :
> > -----------
> >
> > Day 1 : Management Server DoSed by internal synchronization scripts
> > (ldap
> > to Cloudstack)
> > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
> > rebooted
> > (never rebooted in more than a year). Cloudstack
> > is running again normally (vm creation/stop/start/console/...)
> > Day 4 : (week-end) Network outage on core datacenter switch.
> > Network
> > unstable 2 days.
> >
> > Symptoms :
> > ----------
> >
> > Day 7 : The network is operationnal but most of VMs down (250 of
> > 300)
> > since Day 4.
> > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> >
> > VirtualRouter VM fileystem was on of them. Filesystem corruption
> > prevented
> > it to reboot normally.
> >
> > Survivors VMs are on the same KVM/GFS2 Cluster.
> > SSVM is one of them. Messages on the console indicates she was
> > temporarily
> > in read-only mode
> >
> > Hard way to revival (actions):
> > -----------------------------
> >
> > 1. VirtualRouter VM destructed by an administrator, to let
> > CloudStack
> > recreate it from template.
> >
> > BUT :)
> >
> > the SystemVM KVM Template is not available. Status in GUI is
> > "CONNECTION
> > REFUSED".
> > The url from where it was downloaded during install is no more
> > valid (old
> > and unavailable
> > internal mirror server  instead of http://download.cloud.com)
> >
> > => we are unable to start again VMs stopped and create new ones
> >
> > 2. Manual download on the Managment Server of the template, like in
> > a
> > fresh install
> >
> > ---
> > /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> > -m /mnt/secondary/  -u
> > http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> > kvm -F
> > ---
> >
> > It's no sufficient. mysql table template_host_ref does not change.
> > Even
> > when changing url in mysql tables.
> > We still have "CONNECTION REFUSED" on template status in mysql and
> > on the
> > GUI
> >
> > 3. after analysis, we needed to alter manualy mysql tables
> > (template_id of
> > systemVM KVM was x) :
> >
> > ---
> > update template_host_ref set download_state='DOWNLOADED' where
> > template_id=x;
> > update template_host_ref set job_id='NULL' where template_id=x; <=
> > may be
> > useless
> > update template_host_ref set job_id='NULL' where template_id=x; <=
> > may be
> > useless
> > ---
> >
> > 4. As in MySQL, status on GUI is DOWNLOADED
> >
> > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
> > VM and
> > we can let users
> > start manually their stopped VM
> >
> >
> > Questions :
> > -----------
> >
> > 1. What did stop and destroyed the libvirt domains of our VMs ?
> > There's
> > some part
> > of code who could do this, but I'm not sure
> >
> > 2. Is it possible that Cloudstack triggered autonomously the
> > re-download
> > of the
> > systemVM template ? Or has it to be an human interaction.
> >
> > 3. In 4.x is the risk of a corrupted, or systemVM template with a
> > bad
> > status
> > still present. Is there any warning more than a simple "connexion
> > refused"
> > not
> > really visible as an alert ?
> >
> > 4. Is Cloudstack retrying by default to restart VMs who should be
> > up, or do
> > we need configuration for this ?
> >
> >
> > --------------------end of long
> > part----------------------------------------
> >
> >
> > --
> > Laurent Steff
> >
> > DSI/SESI
> > http://www.inria.fr/
> >
> 

-- 
Laurent Steff

DSI/SESI
INRIA
Tél.  : +33 1 39 63 50 81
Port. : +33 6 87 66 77 85
http://www.inria.fr/

Re: outage feedback and questions

Reply via email to