Sorry, more on this issue, I see my logs are rapidly filling up my disk space on node02 with this error in /var/log/messages...
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path, 0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo#012 if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid argument: invalid path /rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8 not assigned to domain Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format' Not sure if this is related at all though? Thanks. Regards. Neil Wilson. On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson...@gmail.com> wrote: > Hi Dafna, > > Thanks for clarifying that, I found the migration issue and this was > resolved once I sorted out the ISO domain problem. > > I'm sorry I don't understand your last question? > "> after the engine restart, do you still see a problem with the size > or did the report of size changed?" > > The migration issue was resolved, it's now just trying to track down > why the two VM's paused on their own, one on the 8th of Jan(I think) > and one on the 19th of Jan. > > Thank you. > > > Regards. > > Neil Wilson. > > > On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <d...@redhat.com> wrote: >> yes - engine lost communication with vdsm and it has no way of knowing if >> the host is down or if there was a network issue so a network issue would >> cause the same errors that I see in the logs. >> >> The error you put on the iso is the reason the vm's have failed migration - >> if a vm is run with a cd and the cd is gone than the vm will not be able to >> be migrated. >> >> after the engine restart, do you still see a problem with the size or did >> the report of size changed? >> >> Dafna >> >> >> On 01/28/2014 01:02 PM, Neil wrote: >>> >>> Hi Dafna, >>> >>> Thanks for coming back to me. I'll try answer your queries one by one. >>> >>> On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <d...@redhat.com> wrote: >>>> >>>> you had a problem with your storage on the 14th of Jan and one of the >>>> hosts >>>> rebooted (if you have the vdsm log from that day than I can see what >>>> happened on vdsm side) >>>> in engine, I could see a problem with the export domain and this should >>>> not >>>> have cause a reboot. >>> >>> 1.) I don't unfortunately have logs going back that far. Looking at >>> all 3 hosts uptime, the one with the least uptime is 21 days, the >>> others are all over 40 days, so there definitely wasn't a host that >>> rebooted on the 14th of Jan, would a network issue or Firewall issue >>> also cause the error you've seen to look as if a host rebooted? There >>> was a bonding mode change on the 14th of January, so perhaps this >>> caused the issue? >>> >>> >>>> Can you tell me if you had a problem with the data >>>> domain as well or was it just the export domain? were you having any vm's >>>> exported/imported at that time? >>>> In any case - this is a bug. >>> >>> 2.) I think this was the same day that the bonding mode was changed on >>> the host while the host was live (by mistake), and had SPM running on >>> it. I haven't done any importing or exporting for a few years on this >>> oVirt setup. >>> >>> >>>> As for the vm's - if the vm's are no longer in migrating state than >>>> please >>>> restart ovirt-engine service (looks like a cache issue) >>> >>> 3.) Restarted ovirt-engine, logging now appears to be normal without any >>> errors. >>> >>> >>>> if they are in migrating state - there should have been a timeout a long >>>> time ago. >>>> can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on >>>> both >>>> all hosts? >>> >>> 4.) Ran on all hosts... >>> >>> node01.blabla.com >>> 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up >>> 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up >>> ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up >>> 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up >>> 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up >>> >>> Id Name State >>> ---------------------------------------------------- >>> 1 adam running >>> 2 reports running >>> 4 tux running >>> 6 Moodle running >>> 7 babbage running >>> >>> node02.blabla.com >>> dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up >>> 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up >>> ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up >>> 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up >>> >>> Id Name State >>> ---------------------------------------------------- >>> 9 proxy02 running >>> 10 spam running >>> 12 software running >>> 13 zimbra running >>> >>> node03.blabla.com >>> e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up >>> 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up >>> >>> Id Name State >>> ---------------------------------------------------- >>> 13 oliver running >>> 14 dhcp running >>> >>> >>>> Last thing is that your ISO domain seems to be having issues as well. >>>> This should not effect the host status but if any of the vm's were booted >>>> from an iso or have an iso attached in the boot sequence this will >>>> explain >>>> the migration issue. >>> >>> There was an ISO domain issue a while back, but this was corrected >>> about 2 weeks ago after iptables re-enabled itself on boot after >>> running updates, I've checked now and the ISO domain appears to be >>> fine and I can see all the images stored within. >>> >>> I've stumbled across what appears to be another error and all three >>> hosts are showing this over and over in /var/log/messages, and I'm not >>> sure if it's related? ... >>> >>> Jan 28 14:58:59 node01 vdsm vm.Vm ERROR >>> vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: >>> <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most >>> recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, >>> in collect#012 statsFunction()#012 File >>> "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = >>> self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", >>> line 509, in _highWrite#012 if not vmDrive.blockDev or >>> vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no >>> attribute 'format' >>> >>> I've attached the full vdsm log from node02 to this reply. >>> >>> Please shout if you need anything else. >>> >>> Thank you. >>> >>> Regards. >>> >>> Neil Wilson. >>> >>>> On 01/28/2014 09:28 AM, Neil wrote: >>>>> >>>>> Hi guys, >>>>> >>>>> Sorry for the very late reply, I've been out of the office doing >>>>> installations. >>>>> Unfortunately due to the time delay, my oldest logs are only as far >>>>> back as the attached. >>>>> >>>>> I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm >>>>> not sure what info is required, so the full log is attached. >>>>> >>>>> Please shout if you need any info or further details. >>>>> >>>>> Thank you very much. >>>>> >>>>> Regards. >>>>> >>>>> Neil Wilson. >>>>> >>>>> >>>>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbour...@redhat.com> >>>>> wrote: >>>>>> >>>>>> Could you please attach the engine.log from the same time? >>>>>> >>>>>> thanks! >>>>>> >>>>>> ----- Original Message ----- >>>>>>> >>>>>>> From: "Neil" <nwilson...@gmail.com> >>>>>>> To: d...@redhat.com >>>>>>> Cc: "users" <users@ovirt.org> >>>>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM >>>>>>> Subject: Re: [Users] Vm's being paused >>>>>>> >>>>>>> Hi Dafna, >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> The vdsm logs are quite large, so I've only attached the logs for the >>>>>>> pause of the VM called Babbage on the 19th of Jan. >>>>>>> >>>>>>> As for snapshots, Babbage has one from June 2013 and Reports has two >>>>>>> from June and Oct 2013. >>>>>>> >>>>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's >>>>>>> have thin provisioned disks. >>>>>>> >>>>>>> Please shout if you'd like any further info or logs. >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> Neil Wilson. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <d...@redhat.com> wrote: >>>>>>>> >>>>>>>> Hi Neil, >>>>>>>> >>>>>>>> Can you please attach the vdsm logs? >>>>>>>> also, as for the vm's, do they have any snapshots? >>>>>>>> from your suggestion to allocate more luns, are you using iscsi or >>>>>>>> FC? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Dafna >>>>>>>> >>>>>>>> >>>>>>>> On 01/22/2014 08:45 AM, Neil wrote: >>>>>>>>> >>>>>>>>> Thanks for the replies guys, >>>>>>>>> >>>>>>>>> Looking at my two VM's that have paused so far through the oVirt GUI >>>>>>>>> the following sizes show under Disks. >>>>>>>>> >>>>>>>>> VM Reports: >>>>>>>>> Virtual Size 35GB, Actual Size 41GB >>>>>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G with >>>>>>>>> 19G available (40%) usage. >>>>>>>>> >>>>>>>>> VM Babbage: >>>>>>>>> Virtual Size is 40GB, Actual Size 53GB >>>>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, >>>>>>>>> so >>>>>>>>> under 50% usage. >>>>>>>>> >>>>>>>>> >>>>>>>>> Do you see any issues with the above stats? >>>>>>>>> >>>>>>>>> Then my main Datacenter storage is as follows... >>>>>>>>> >>>>>>>>> Size: 6887 GB >>>>>>>>> Available: 1948 GB >>>>>>>>> Used: 4939 GB >>>>>>>>> Allocated: 1196 GB >>>>>>>>> Over Allocation: 61% >>>>>>>>> >>>>>>>>> Could there be a problem here? I can allocate additional LUNS if you >>>>>>>>> feel the space isn't correctly allocated. >>>>>>>>> >>>>>>>>> Apologies for going on about this, but I'm really concerned that >>>>>>>>> something isn't right and I might have a serious problem if an >>>>>>>>> important machine locks up. >>>>>>>>> >>>>>>>>> Thank you and much appreciated. >>>>>>>>> >>>>>>>>> Regards. >>>>>>>>> >>>>>>>>> Neil Wilson. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <d...@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>> the storage space is configured in percentages and not physical >>>>>>>>>> size. >>>>>>>>>> so if 20G is less than 10% (default config) of your storage it will >>>>>>>>>> pause >>>>>>>>>> the vms regardless of how much GB you still have. >>>>>>>>>> this is configurable though so you can change it to less than 10% >>>>>>>>>> if >>>>>>>>>> you >>>>>>>>>> like. >>>>>>>>>> >>>>>>>>>> to answer the second question, vm's will not pause on ENOSpace >>>>>>>>>> error >>>>>>>>>> if >>>>>>>>>> they >>>>>>>>>> run out of space internally but only if the external storage cannot >>>>>>>>>> be >>>>>>>>>> consumed. so only if you run out of space in the storage and and >>>>>>>>>> not >>>>>>>>>> if >>>>>>>>>> vm >>>>>>>>>> runs out of space in its on fs. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 01/21/2014 09:51 AM, Neil wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Dan, >>>>>>>>>>> >>>>>>>>>>> Sorry, attached is engine.log I've taken out the two sections >>>>>>>>>>> where >>>>>>>>>>> each of the VM's were paused. >>>>>>>>>>> >>>>>>>>>>> Does the error "VM babbage has paused due to no Storage space >>>>>>>>>>> error" >>>>>>>>>>> mean the main storage domain has run out of storage, or that the >>>>>>>>>>> VM >>>>>>>>>>> has run out? >>>>>>>>>>> >>>>>>>>>>> Both VM's appear to have been running on node01 when they were >>>>>>>>>>> paused. >>>>>>>>>>> My vdsm versions are all... >>>>>>>>>>> >>>>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch >>>>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>>>>>>>>> vdsm-4.13.0-11.el6.x86_64 >>>>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64 >>>>>>>>>>> >>>>>>>>>>> I currently have a 61% over allocation ratio on my primary storage >>>>>>>>>>> domain, with 1948GB available. >>>>>>>>>>> >>>>>>>>>>> Thank you. >>>>>>>>>>> >>>>>>>>>>> Regards. >>>>>>>>>>> >>>>>>>>>>> Neil Wilson. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Dan, >>>>>>>>>>>> >>>>>>>>>>>> Sorry for only coming back to you now. >>>>>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out >>>>>>>>>>>> of >>>>>>>>>>>> disk space there is about 20Gigs free, and the usage barely grows >>>>>>>>>>>> as >>>>>>>>>>>> the VM only shares printers. The other VM that paused is also on >>>>>>>>>>>> thin >>>>>>>>>>>> provisioned disks and also has plenty space, this guest is >>>>>>>>>>>> running >>>>>>>>>>>> Centos 6.3 64bit and only runs basic reporting. >>>>>>>>>>>> >>>>>>>>>>>> After the 2003 guest was rebooted, the network card showed up as >>>>>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it again >>>>>>>>>>>> in >>>>>>>>>>>> order to correct the issue. The Centos VM did not have the same >>>>>>>>>>>> issue. >>>>>>>>>>>> >>>>>>>>>>>> I'm concerned that this might happen to a VM that's quite >>>>>>>>>>>> critical, >>>>>>>>>>>> any thoughts or ideas? >>>>>>>>>>>> >>>>>>>>>>>> The only recent changes have been updating from Dreyou 3.2 to the >>>>>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I >>>>>>>>>>>> haven't had this issue. >>>>>>>>>>>> >>>>>>>>>>>> Any assistance is greatly appreciated. >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> Regards. >>>>>>>>>>>> >>>>>>>>>>>> Neil Wilson. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dya...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse disks? >>>>>>>>>>>>> >>>>>>>>>>>>> Pausing happens when the VM has an IO error or runs out of space >>>>>>>>>>>>> on >>>>>>>>>>>>> the >>>>>>>>>>>>> storage domain, and it is done intentionally, so that the VM >>>>>>>>>>>>> will >>>>>>>>>>>>> not >>>>>>>>>>>>> experience a disk corruption. If you have thin provisioned >>>>>>>>>>>>> disks, >>>>>>>>>>>>> and >>>>>>>>>>>>> the VM >>>>>>>>>>>>> writes to it's disks faster than the disks can grow, this is >>>>>>>>>>>>> exactly >>>>>>>>>>>>> what >>>>>>>>>>>>> you will see >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi guys, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've had two different Vm's randomly pause this past week and >>>>>>>>>>>>>> inside >>>>>>>>>>>>>> ovirt >>>>>>>>>>>>>> the error received is something like 'vm ran out of storage and >>>>>>>>>>>>>> was >>>>>>>>>>>>>> paused'. >>>>>>>>>>>>>> Resuming the vm's didn't work and I had to force them off and >>>>>>>>>>>>>> then on >>>>>>>>>>>>>> which >>>>>>>>>>>>>> resolved the issue. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Has anyone had this issue before? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I realise this is very vague so if you could please let me know >>>>>>>>>>>>>> which >>>>>>>>>>>>>> logs >>>>>>>>>>>>>> to send in. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Neil Wilson >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Users mailing list >>>>>>>>>>>>>> Users@ovirt.org >>>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Users mailing list >>>>>>>>>>>>> Users@ovirt.org >>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Dafna Ron >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Dafna Ron >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Users mailing list >>>>>>> Users@ovirt.org >>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>> >>>> >>>> -- >>>> Dafna Ron >> >> >> >> -- >> Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users