On Jun 6, 2019 12:52, souvaliotima...@mail.com wrote: > > Hello, > > I came upon a problem the previous month that I figured it would be good to > discuss here. I'm sorry I didn't post here earlier but time slipped me. > > I have set up a glustered, hyperconverged oVirt environment for experimental > use as a means to see its behaviour and get used to its management and > performance before setting it up as a production environment for use in our > organization. The environment is up and running since 2018 October. The three > nodes are HP ProLiant DL380 G7 and have the following characteristics: > > Mem: 22GB > CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx > HDD: 5x 300GB > Network: BCM5709C with dual-port Gigabit > OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt Node > 4.2.3.1 > > As I was working on the environment, the engine stopped working. > Not long before the time the HE stopped, I was in the web interface managing > my VMs, when the browser froze and the HE was also not responding to ICMP > requests. > > The first thing I did was to connect via ssh to all nodes and run the command > #hosted-engine --vm-status > which showed that the HE was down in nodes 1 and 2 and up on the 3rd node. > > After executing > #virsh -r list > the VM list that was shown contained two of the VMs I had previously created > and were up; the HE was nowhere. > > I tried to restart the HE with the > #hosted-engine --vm-start > but it didn't work. > > I then put all nodes in maintenance mode with the command > #hosted-engine --set-maintenance --mode=global > (I guess I should have done that earlier) and re-run > #hosted-engine --vm-start > that had the same result as it previously did. > > After checking the mails the system sent to the root user, I saw there were > several mails on the 3rd node (where the HE had been), informing of the HE's > state. The messages were changing between EngineDown-EngineStart, > EngineStart-EngineStarting, EngineStarting-EngineMaybeAway, > EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown, > EngineDown-EngineStart and so forth. > > I continued by searching the following logs in all nodes : > /var/log/libvirt/qemu/HostedEngine.log > /var/log/libvirt/qemu/win10.log > /var/log/libvirt/qemu/DNStest.log > /var/log/vdsm/vdsm.log > /var/log/ovirt-hosted-engine-ha/agent.log > > After that I spotted and error that had started appearing almost a month ago > in node #2: > ERROR Internal server error Traceback (most recent call last): File > "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in > _handle_request res = method(**params) File > "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in > _dynamicMethod result = fn(*methodArgs) File > "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in > logicalVolumeList return self._gluster.logicalVolumeList() File > "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper > rv = func(*args, **kwargs) File > "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in > logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File > "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in > __call__ return callMethod() File > "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in > <lambda> getattr(self._supervdsmProxy._svdsm, self._funcName)(*args, > AttributeError: 'AutoProxy[instance]' object has no attribute > 'glusterLogicalVolumeList' > > > The outputs of the following commands were also checked as a way to see if > there was a mandatory process missing/killed, a memory problem or even disk > space shortage that led to the sudden death of a process > #ps -A > #top > #free -h > #df -hT > > Finally, after some time delving in the logs, the output of the > #journalctl --dmesg > showed the following message > "Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child. > Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB, > file-rss:2336kB, shmem-rss:12kB" > which after that the ovirtmgmt started not responding. If you run out of memory, you should take that serious.Droping the cache seems like a workaround and not a fix. Check if KSM is enabled - this will merge your VM's memory pages for an exchange for CPU cycles - still better than getting a VM killed. Also, you can protect the HostedEngine from OOM killer.
> I tried to restart the vhostd by executing > #/etc/rc.d/init.d/vhostmd start > but it didn't work. > > Finally, I decided to run the HE restart command on the other nodes as well > (I'd figured that since the HE was last running on the node #3, that's where > I should try to restart it). So, I run > #hosted-engine --vm-start > and the output was > "Command VM.getStats with args {'vmID':'...<το ID της HE>....'} failed: > (code=1,message=Virtual machine does not exist: {'vmID':'...<το ID της > HE>....'})" > And then I run the command again and the output was > "VM exists and its status is Powering Up." > > After that I executed > #virsh -r list > and the output was the following: > Id Name State > ---------------------------------------------------- > 2 HostedEngine running > > After the HE's restart two mails came that stated: > ReinitializeFSMEngineStarting and EngineStarting-EngineUp > > After that and after checking that we had access to the web interface again, > we executed > hosted-engine --set-maintenance --mode=none > to get out of the maintenance mode. > > The thing is, I still am not 1000% sure what the problem was that led to the > shutdown of the hosted engine and I think that maybe some of the steps I took > were not needed. I believe it was because the process qemu-kvm was killed > after there was not enough memory for it but is this the real cause? I wasn't > doing anything unusual before the shutdown to believe it was because of the > new VM that was still in shutdown mode or anything of the sort. Also, I > believe it may be because of memory shortage because I hadn't executed the > #sync ; echo 3 > /proc/sys/vm/drop_caches > command for a couple of weeks. > > What are your thoughts on this? Could you point me to where to search for > more information on the topic or tell me what is the right process to follow > when something like this happens? Check the sar (there is a graphical util called 'ksar' and check cpu , memory, swap, context switches , I/O and network usage). Crreate simple systemd service to monitor your nodes, or even better put a real monitoring software so you can proactively take any actions. > Also, I have set up a few VMs but only three are Up and they have no users > yet, even so the buffers fill almost to the brim when the usage is almost > non-existant. If you have an environment that has some users or you use the > VMs as virtual servers of some sort, what is the consumption of the memory? > What's the optimal size for the memory? What is your tuned profile ? Any customizations there ? Best Regards, Strahil Nikolov > Thank you all very much. > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/PKRB26GSDQ5JVHD75HEPK346NTI7UQK2/ _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/R6KLODYO4T5TKCSIULXQD2SEWGS74WTQ/