Hi Dean, And thanks for your answer.
Yes the network troubles lead to issue with the main storage on clusters (iscsi). So is that a fact if the main storage is lost on KVM, VMs are stopped and domain destroyed ? It was an hypothesis as I found traces in apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java which "kills -9 qemu processes" if main storage is not found, but I was not sure when the function was called. It's on the function checkingMountPoint, which calls destroyVMs if mount point not found. Regards, ----- Mail original ----- > De: "Dean Kamali" <dean.kam...@gmail.com> > À: users@cloudstack.apache.org > Envoyé: Lundi 8 Juillet 2013 16:34:04 > Objet: Re: outage feedback and questions > > Survivors VMs are on the same KVM/GFS2 Cluster. > SSVM is one of them. Messages on the console indicates she was > temporarily > in read-only mode > > Do you have an issue with storage? > > I wouldn't expect a failure in switch could cause all of this, it > will > cause loss of network connectivity but it shouldn't cause your vms to > go > down. > > This behavior usually happens when you lose your primary storage. > > > > > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff > <laurent.st...@inria.fr>wrote: > > > Hello, > > > > Cloudstack is used in our company as a core component of a > > "Continuous > > Integration" > > Service. > > > > We are mainly happy with it, for a lot of reasons too long to > > describe. :) > > > > We encountered recently a major service outage on Cloudstack mainly > > linked > > to bad practices on our side, and the aim of this post is : > > > > - ask questions about things we didn't understand yet > > - gather some practical best practices we missed > > - if problems detected are still present on Cloudstack 4.x, helping > > to robustify Cloudstack with our feedback > > > > we know that 3.x version is not supported and plan to move ASAP in > > 4.x > > version. > > > > It's quite a long mail, and it may be badly directed (dev mailing > > list ? > > multiple bugs ?) > > > > Any response is appreciated ;) > > > > Regards, > > > > > > --------------------long > > part---------------------------------------- > > > > Architecture : > > -------------- > > > > Old and non Apache CloudStack 3.0.2 release > > 1 Zone, 1 physical network, 1 pod > > 1 Virtual Router VM, 1 SSVM > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage > > Management Server on Vmware virtual machine > > > > > > > > Incidents : > > ----------- > > > > Day 1 : Management Server DoSed by internal synchronization scripts > > (ldap > > to Cloudstack) > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and > > rebooted > > (never rebooted in more than a year). Cloudstack > > is running again normally (vm creation/stop/start/console/...) > > Day 4 : (week-end) Network outage on core datacenter switch. > > Network > > unstable 2 days. > > > > Symptoms : > > ---------- > > > > Day 7 : The network is operationnal but most of VMs down (250 of > > 300) > > since Day 4. > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased). > > > > VirtualRouter VM fileystem was on of them. Filesystem corruption > > prevented > > it to reboot normally. > > > > Survivors VMs are on the same KVM/GFS2 Cluster. > > SSVM is one of them. Messages on the console indicates she was > > temporarily > > in read-only mode > > > > Hard way to revival (actions): > > ----------------------------- > > > > 1. VirtualRouter VM destructed by an administrator, to let > > CloudStack > > recreate it from template. > > > > BUT :) > > > > the SystemVM KVM Template is not available. Status in GUI is > > "CONNECTION > > REFUSED". > > The url from where it was downloaded during install is no more > > valid (old > > and unavailable > > internal mirror server instead of http://download.cloud.com) > > > > => we are unable to start again VMs stopped and create new ones > > > > 2. Manual download on the Managment Server of the template, like in > > a > > fresh install > > > > --- > > /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt > > -m /mnt/secondary/ -u > > http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h > > kvm -F > > --- > > > > It's no sufficient. mysql table template_host_ref does not change. > > Even > > when changing url in mysql tables. > > We still have "CONNECTION REFUSED" on template status in mysql and > > on the > > GUI > > > > 3. after analysis, we needed to alter manualy mysql tables > > (template_id of > > systemVM KVM was x) : > > > > --- > > update template_host_ref set download_state='DOWNLOADED' where > > template_id=x; > > update template_host_ref set job_id='NULL' where template_id=x; <= > > may be > > useless > > update template_host_ref set job_id='NULL' where template_id=x; <= > > may be > > useless > > --- > > > > 4. As in MySQL, status on GUI is DOWNLOADED > > > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter > > VM and > > we can let users > > start manually their stopped VM > > > > > > Questions : > > ----------- > > > > 1. What did stop and destroyed the libvirt domains of our VMs ? > > There's > > some part > > of code who could do this, but I'm not sure > > > > 2. Is it possible that Cloudstack triggered autonomously the > > re-download > > of the > > systemVM template ? Or has it to be an human interaction. > > > > 3. In 4.x is the risk of a corrupted, or systemVM template with a > > bad > > status > > still present. Is there any warning more than a simple "connexion > > refused" > > not > > really visible as an alert ? > > > > 4. Is Cloudstack retrying by default to restart VMs who should be > > up, or do > > we need configuration for this ? > > > > > > --------------------end of long > > part---------------------------------------- > > > > > > -- > > Laurent Steff > > > > DSI/SESI > > http://www.inria.fr/ > > > -- Laurent Steff DSI/SESI INRIA Tél. : +33 1 39 63 50 81 Port. : +33 6 87 66 77 85 http://www.inria.fr/