courtesy to geoff.higginbottom@shapeblue.comfor answering this question first
On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <dean.kam...@gmail.com> wrote: > Well, I have asked in the mailing list sometime ago, about > cloudstack behaviour when I lose connectively to primary storage, then > hypervisor start rebooting randomly. > > I believe this what is very similar to what happend in your case. > > This is actually 'by design'. The logic is that if the storage goes > offline, then all VMs must have also failed, and a 'forced' reboot of the > Host 'might' automatically fix things. > > This is great if you only have one Primary Storage, but typically you > have more than one, so whilst the reboot might fix the failed storage, it > will also kill off all the perfectly good VMs which were still happily > running. > > The answer what I got was for xenserver not KVM, it included removing the > reboot -f option for a config file. > > > > The fix for XenServer Hosts is to: > > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts, > commenting out the two entries which have "reboot -f" > > 2. Identify the PID of the script - pidof -x xenheartbeat.sh > > 3. Restart the Script - kill <pid> > > 4. Force reconnect Host from the UI, the script will then re-launch on > reconnect > > > > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <laurent.st...@inria.fr>wrote: > >> Hi Dean, >> >> And thanks for your answer. >> >> Yes the network troubles lead to issue with the main storage >> on clusters (iscsi). >> >> So is that a fact if the main storage is lost on KVM, VMs are stopped >> and domain destroyed ? >> >> It was an hypothesis as I found traces in >> >> >> apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java >> >> which "kills -9 qemu processes" if main storage is not found, but I was >> not sure when the function was called. >> >> It's on the function checkingMountPoint, which calls destroyVMs if mount >> point not found. >> >> Regards, >> >> ----- Mail original ----- >> > De: "Dean Kamali" <dean.kam...@gmail.com> >> > À: users@cloudstack.apache.org >> > Envoyé: Lundi 8 Juillet 2013 16:34:04 >> > Objet: Re: outage feedback and questions >> > >> > Survivors VMs are on the same KVM/GFS2 Cluster. >> > SSVM is one of them. Messages on the console indicates she was >> > temporarily >> > in read-only mode >> > >> > Do you have an issue with storage? >> > >> > I wouldn't expect a failure in switch could cause all of this, it >> > will >> > cause loss of network connectivity but it shouldn't cause your vms to >> > go >> > down. >> > >> > This behavior usually happens when you lose your primary storage. >> > >> > >> > >> > >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff >> > <laurent.st...@inria.fr>wrote: >> > >> > > Hello, >> > > >> > > Cloudstack is used in our company as a core component of a >> > > "Continuous >> > > Integration" >> > > Service. >> > > >> > > We are mainly happy with it, for a lot of reasons too long to >> > > describe. :) >> > > >> > > We encountered recently a major service outage on Cloudstack mainly >> > > linked >> > > to bad practices on our side, and the aim of this post is : >> > > >> > > - ask questions about things we didn't understand yet >> > > - gather some practical best practices we missed >> > > - if problems detected are still present on Cloudstack 4.x, helping >> > > to robustify Cloudstack with our feedback >> > > >> > > we know that 3.x version is not supported and plan to move ASAP in >> > > 4.x >> > > version. >> > > >> > > It's quite a long mail, and it may be badly directed (dev mailing >> > > list ? >> > > multiple bugs ?) >> > > >> > > Any response is appreciated ;) >> > > >> > > Regards, >> > > >> > > >> > > --------------------long >> > > part---------------------------------------- >> > > >> > > Architecture : >> > > -------------- >> > > >> > > Old and non Apache CloudStack 3.0.2 release >> > > 1 Zone, 1 physical network, 1 pod >> > > 1 Virtual Router VM, 1 SSVM >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage >> > > Management Server on Vmware virtual machine >> > > >> > > >> > > >> > > Incidents : >> > > ----------- >> > > >> > > Day 1 : Management Server DoSed by internal synchronization scripts >> > > (ldap >> > > to Cloudstack) >> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and >> > > rebooted >> > > (never rebooted in more than a year). Cloudstack >> > > is running again normally (vm creation/stop/start/console/...) >> > > Day 4 : (week-end) Network outage on core datacenter switch. >> > > Network >> > > unstable 2 days. >> > > >> > > Symptoms : >> > > ---------- >> > > >> > > Day 7 : The network is operationnal but most of VMs down (250 of >> > > 300) >> > > since Day 4. >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased). >> > > >> > > VirtualRouter VM fileystem was on of them. Filesystem corruption >> > > prevented >> > > it to reboot normally. >> > > >> > > Survivors VMs are on the same KVM/GFS2 Cluster. >> > > SSVM is one of them. Messages on the console indicates she was >> > > temporarily >> > > in read-only mode >> > > >> > > Hard way to revival (actions): >> > > ----------------------------- >> > > >> > > 1. VirtualRouter VM destructed by an administrator, to let >> > > CloudStack >> > > recreate it from template. >> > > >> > > BUT :) >> > > >> > > the SystemVM KVM Template is not available. Status in GUI is >> > > "CONNECTION >> > > REFUSED". >> > > The url from where it was downloaded during install is no more >> > > valid (old >> > > and unavailable >> > > internal mirror server instead of http://download.cloud.com) >> > > >> > > => we are unable to start again VMs stopped and create new ones >> > > >> > > 2. Manual download on the Managment Server of the template, like in >> > > a >> > > fresh install >> > > >> > > --- >> > > >> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt >> > > -m /mnt/secondary/ -u >> > > >> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h >> > > kvm -F >> > > --- >> > > >> > > It's no sufficient. mysql table template_host_ref does not change. >> > > Even >> > > when changing url in mysql tables. >> > > We still have "CONNECTION REFUSED" on template status in mysql and >> > > on the >> > > GUI >> > > >> > > 3. after analysis, we needed to alter manualy mysql tables >> > > (template_id of >> > > systemVM KVM was x) : >> > > >> > > --- >> > > update template_host_ref set download_state='DOWNLOADED' where >> > > template_id=x; >> > > update template_host_ref set job_id='NULL' where template_id=x; <= >> > > may be >> > > useless >> > > update template_host_ref set job_id='NULL' where template_id=x; <= >> > > may be >> > > useless >> > > --- >> > > >> > > 4. As in MySQL, status on GUI is DOWNLOADED >> > > >> > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter >> > > VM and >> > > we can let users >> > > start manually their stopped VM >> > > >> > > >> > > Questions : >> > > ----------- >> > > >> > > 1. What did stop and destroyed the libvirt domains of our VMs ? >> > > There's >> > > some part >> > > of code who could do this, but I'm not sure >> > > >> > > 2. Is it possible that Cloudstack triggered autonomously the >> > > re-download >> > > of the >> > > systemVM template ? Or has it to be an human interaction. >> > > >> > > 3. In 4.x is the risk of a corrupted, or systemVM template with a >> > > bad >> > > status >> > > still present. Is there any warning more than a simple "connexion >> > > refused" >> > > not >> > > really visible as an alert ? >> > > >> > > 4. Is Cloudstack retrying by default to restart VMs who should be >> > > up, or do >> > > we need configuration for this ? >> > > >> > > >> > > --------------------end of long >> > > part---------------------------------------- >> > > >> > > >> > > -- >> > > Laurent Steff >> > > >> > > DSI/SESI >> > > http://www.inria.fr/ >> > > >> > >> >> -- >> Laurent Steff >> >> DSI/SESI >> INRIA >> Tél. : +33 1 39 63 50 81 >> Port. : +33 6 87 66 77 85 >> http://www.inria.fr/ >> > >