On Jan 28, 2014, at 19:18 , Dafna Ron <d...@redhat.com> wrote: > yes - engine lost communication with vdsm and it has no way of knowing if the > host is down or if there was a network issue so a network issue would cause > the same errors that I see in the logs. > > The error you put on the iso is the reason the vm's have failed migration - > if a vm is run with a cd and the cd is gone than the vm will not be able to > be migrated.
which, as I learned last week, is not entirely correct. Pure libvirt VM seems to work fineā¦so it must be somewhere something in oVirt:( looking into it but just for future reference we want it to work:) > > after the engine restart, do you still see a problem with the size or did the > report of size changed? > > Dafna > > On 01/28/2014 01:02 PM, Neil wrote: >> Hi Dafna, >> >> Thanks for coming back to me. I'll try answer your queries one by one. >> >> On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <d...@redhat.com> wrote: >>> you had a problem with your storage on the 14th of Jan and one of the hosts >>> rebooted (if you have the vdsm log from that day than I can see what >>> happened on vdsm side) >>> in engine, I could see a problem with the export domain and this should not >>> have cause a reboot. >> 1.) I don't unfortunately have logs going back that far. Looking at >> all 3 hosts uptime, the one with the least uptime is 21 days, the >> others are all over 40 days, so there definitely wasn't a host that >> rebooted on the 14th of Jan, would a network issue or Firewall issue >> also cause the error you've seen to look as if a host rebooted? There >> was a bonding mode change on the 14th of January, so perhaps this >> caused the issue? >> >> >>> Can you tell me if you had a problem with the data >>> domain as well or was it just the export domain? were you having any vm's >>> exported/imported at that time? >>> In any case - this is a bug. >> 2.) I think this was the same day that the bonding mode was changed on >> the host while the host was live (by mistake), and had SPM running on >> it. I haven't done any importing or exporting for a few years on this >> oVirt setup. >> >> >>> As for the vm's - if the vm's are no longer in migrating state than please >>> restart ovirt-engine service (looks like a cache issue) >> 3.) Restarted ovirt-engine, logging now appears to be normal without any >> errors. >> >> >>> if they are in migrating state - there should have been a timeout a long >>> time ago. >>> can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both >>> all hosts? >> 4.) Ran on all hosts... >> >> node01.blabla.com >> 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up >> 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up >> ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up >> 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up >> 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up >> >> Id Name State >> ---------------------------------------------------- >> 1 adam running >> 2 reports running >> 4 tux running >> 6 Moodle running >> 7 babbage running >> >> node02.blabla.com >> dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up >> 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up >> ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up >> 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up >> >> Id Name State >> ---------------------------------------------------- >> 9 proxy02 running >> 10 spam running >> 12 software running >> 13 zimbra running >> >> node03.blabla.com >> e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up >> 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up >> >> Id Name State >> ---------------------------------------------------- >> 13 oliver running >> 14 dhcp running >> >> >>> Last thing is that your ISO domain seems to be having issues as well. >>> This should not effect the host status but if any of the vm's were booted >>> from an iso or have an iso attached in the boot sequence this will explain >>> the migration issue. >> There was an ISO domain issue a while back, but this was corrected >> about 2 weeks ago after iptables re-enabled itself on boot after >> running updates, I've checked now and the ISO domain appears to be >> fine and I can see all the images stored within. >> >> I've stumbled across what appears to be another error and all three >> hosts are showing this over and over in /var/log/messages, and I'm not >> sure if it's related? ... >> >> Jan 28 14:58:59 node01 vdsm vm.Vm ERROR >> vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: >> <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most >> recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, >> in collect#012 statsFunction()#012 File >> "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = >> self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", >> line 509, in _highWrite#012 if not vmDrive.blockDev or >> vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no >> attribute 'format' >> >> I've attached the full vdsm log from node02 to this reply. >> >> Please shout if you need anything else. >> >> Thank you. >> >> Regards. >> >> Neil Wilson. >> >>> On 01/28/2014 09:28 AM, Neil wrote: >>>> Hi guys, >>>> >>>> Sorry for the very late reply, I've been out of the office doing >>>> installations. >>>> Unfortunately due to the time delay, my oldest logs are only as far >>>> back as the attached. >>>> >>>> I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm >>>> not sure what info is required, so the full log is attached. >>>> >>>> Please shout if you need any info or further details. >>>> >>>> Thank you very much. >>>> >>>> Regards. >>>> >>>> Neil Wilson. >>>> >>>> >>>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbour...@redhat.com> >>>> wrote: >>>>> Could you please attach the engine.log from the same time? >>>>> >>>>> thanks! >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Neil" <nwilson...@gmail.com> >>>>>> To: d...@redhat.com >>>>>> Cc: "users" <users@ovirt.org> >>>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM >>>>>> Subject: Re: [Users] Vm's being paused >>>>>> >>>>>> Hi Dafna, >>>>>> >>>>>> Thanks. >>>>>> >>>>>> The vdsm logs are quite large, so I've only attached the logs for the >>>>>> pause of the VM called Babbage on the 19th of Jan. >>>>>> >>>>>> As for snapshots, Babbage has one from June 2013 and Reports has two >>>>>> from June and Oct 2013. >>>>>> >>>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's >>>>>> have thin provisioned disks. >>>>>> >>>>>> Please shout if you'd like any further info or logs. >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Regards. >>>>>> >>>>>> Neil Wilson. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <d...@redhat.com> wrote: >>>>>>> Hi Neil, >>>>>>> >>>>>>> Can you please attach the vdsm logs? >>>>>>> also, as for the vm's, do they have any snapshots? >>>>>>> from your suggestion to allocate more luns, are you using iscsi or FC? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Dafna >>>>>>> >>>>>>> >>>>>>> On 01/22/2014 08:45 AM, Neil wrote: >>>>>>>> Thanks for the replies guys, >>>>>>>> >>>>>>>> Looking at my two VM's that have paused so far through the oVirt GUI >>>>>>>> the following sizes show under Disks. >>>>>>>> >>>>>>>> VM Reports: >>>>>>>> Virtual Size 35GB, Actual Size 41GB >>>>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G with >>>>>>>> 19G available (40%) usage. >>>>>>>> >>>>>>>> VM Babbage: >>>>>>>> Virtual Size is 40GB, Actual Size 53GB >>>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so >>>>>>>> under 50% usage. >>>>>>>> >>>>>>>> >>>>>>>> Do you see any issues with the above stats? >>>>>>>> >>>>>>>> Then my main Datacenter storage is as follows... >>>>>>>> >>>>>>>> Size: 6887 GB >>>>>>>> Available: 1948 GB >>>>>>>> Used: 4939 GB >>>>>>>> Allocated: 1196 GB >>>>>>>> Over Allocation: 61% >>>>>>>> >>>>>>>> Could there be a problem here? I can allocate additional LUNS if you >>>>>>>> feel the space isn't correctly allocated. >>>>>>>> >>>>>>>> Apologies for going on about this, but I'm really concerned that >>>>>>>> something isn't right and I might have a serious problem if an >>>>>>>> important machine locks up. >>>>>>>> >>>>>>>> Thank you and much appreciated. >>>>>>>> >>>>>>>> Regards. >>>>>>>> >>>>>>>> Neil Wilson. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <d...@redhat.com> wrote: >>>>>>>>> the storage space is configured in percentages and not physical size. >>>>>>>>> so if 20G is less than 10% (default config) of your storage it will >>>>>>>>> pause >>>>>>>>> the vms regardless of how much GB you still have. >>>>>>>>> this is configurable though so you can change it to less than 10% if >>>>>>>>> you >>>>>>>>> like. >>>>>>>>> >>>>>>>>> to answer the second question, vm's will not pause on ENOSpace error >>>>>>>>> if >>>>>>>>> they >>>>>>>>> run out of space internally but only if the external storage cannot >>>>>>>>> be >>>>>>>>> consumed. so only if you run out of space in the storage and and not >>>>>>>>> if >>>>>>>>> vm >>>>>>>>> runs out of space in its on fs. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 01/21/2014 09:51 AM, Neil wrote: >>>>>>>>>> Hi Dan, >>>>>>>>>> >>>>>>>>>> Sorry, attached is engine.log I've taken out the two sections where >>>>>>>>>> each of the VM's were paused. >>>>>>>>>> >>>>>>>>>> Does the error "VM babbage has paused due to no Storage space error" >>>>>>>>>> mean the main storage domain has run out of storage, or that the VM >>>>>>>>>> has run out? >>>>>>>>>> >>>>>>>>>> Both VM's appear to have been running on node01 when they were >>>>>>>>>> paused. >>>>>>>>>> My vdsm versions are all... >>>>>>>>>> >>>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch >>>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>>>>>>>> vdsm-4.13.0-11.el6.x86_64 >>>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64 >>>>>>>>>> >>>>>>>>>> I currently have a 61% over allocation ratio on my primary storage >>>>>>>>>> domain, with 1948GB available. >>>>>>>>>> >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>>> Regards. >>>>>>>>>> >>>>>>>>>> Neil Wilson. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson...@gmail.com> wrote: >>>>>>>>>>> Hi Dan, >>>>>>>>>>> >>>>>>>>>>> Sorry for only coming back to you now. >>>>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out of >>>>>>>>>>> disk space there is about 20Gigs free, and the usage barely grows >>>>>>>>>>> as >>>>>>>>>>> the VM only shares printers. The other VM that paused is also on >>>>>>>>>>> thin >>>>>>>>>>> provisioned disks and also has plenty space, this guest is running >>>>>>>>>>> Centos 6.3 64bit and only runs basic reporting. >>>>>>>>>>> >>>>>>>>>>> After the 2003 guest was rebooted, the network card showed up as >>>>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it again in >>>>>>>>>>> order to correct the issue. The Centos VM did not have the same >>>>>>>>>>> issue. >>>>>>>>>>> >>>>>>>>>>> I'm concerned that this might happen to a VM that's quite critical, >>>>>>>>>>> any thoughts or ideas? >>>>>>>>>>> >>>>>>>>>>> The only recent changes have been updating from Dreyou 3.2 to the >>>>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I >>>>>>>>>>> haven't had this issue. >>>>>>>>>>> >>>>>>>>>>> Any assistance is greatly appreciated. >>>>>>>>>>> >>>>>>>>>>> Thank you. >>>>>>>>>>> >>>>>>>>>>> Regards. >>>>>>>>>>> >>>>>>>>>>> Neil Wilson. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dya...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse disks? >>>>>>>>>>>> >>>>>>>>>>>> Pausing happens when the VM has an IO error or runs out of space >>>>>>>>>>>> on >>>>>>>>>>>> the >>>>>>>>>>>> storage domain, and it is done intentionally, so that the VM will >>>>>>>>>>>> not >>>>>>>>>>>> experience a disk corruption. If you have thin provisioned disks, >>>>>>>>>>>> and >>>>>>>>>>>> the VM >>>>>>>>>>>> writes to it's disks faster than the disks can grow, this is >>>>>>>>>>>> exactly >>>>>>>>>>>> what >>>>>>>>>>>> you will see >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> Hi guys, >>>>>>>>>>>>> >>>>>>>>>>>>> I've had two different Vm's randomly pause this past week and >>>>>>>>>>>>> inside >>>>>>>>>>>>> ovirt >>>>>>>>>>>>> the error received is something like 'vm ran out of storage and >>>>>>>>>>>>> was >>>>>>>>>>>>> paused'. >>>>>>>>>>>>> Resuming the vm's didn't work and I had to force them off and >>>>>>>>>>>>> then on >>>>>>>>>>>>> which >>>>>>>>>>>>> resolved the issue. >>>>>>>>>>>>> >>>>>>>>>>>>> Has anyone had this issue before? >>>>>>>>>>>>> >>>>>>>>>>>>> I realise this is very vague so if you could please let me know >>>>>>>>>>>>> which >>>>>>>>>>>>> logs >>>>>>>>>>>>> to send in. >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you >>>>>>>>>>>>> >>>>>>>>>>>>> Regards. >>>>>>>>>>>>> >>>>>>>>>>>>> Neil Wilson >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Users mailing list >>>>>>>>>>>>> Users@ovirt.org >>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Users mailing list >>>>>>>>>>>> Users@ovirt.org >>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Dafna Ron >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Dafna Ron >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> Users@ovirt.org >>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>> >>> >>> -- >>> Dafna Ron > > > -- > Dafna Ron > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users