Arik Hadas <aha...@redhat.com> writes: > On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <alkap...@redhat.com> wrote: > >> >> >> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenh...@redhat.com> wrote: >> >>> I'm seeing the same error in [1], during 006_migrations.migrate_vm. >>> >>> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/ >>> >> >> Seems like another bug. The migration failed since for some reason the vm >> is already defined on the destination host. >> >> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create >> error=Virtual machine already exists (api:129) >> Traceback (most recent call last): >> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in >> method >> ret = func(*args, **kwargs) >> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create >> raise exception.VMExists() >> VMExists: Virtual machine already exists >> >> > Milan, Francesco, could it be that because of [1] that appears on the > destination host right after shutting down the VM, it remained defined on > that host?
I can't see any destroy call in the logs after the successful preceding migration from the given host. That would explain “VMExists” error. > [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running > VM callback (clientIF:683) > > Traceback (most recent call last): > > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in > dispatchLibvirtEvents > > v.onLibvirtLifecycleEvent(event, detail, None) > > AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent' That means that a life cycle event on an unknown VM has arrived, in this case apparently destroy event, following the destroy call after the failed incoming migration. The reported AttributeError is a minor bug, already fixed in master. So it's most likely unrelated to the discussed problem. >>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkap...@redhat.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> Looking at the log it seems that the new GetCapabilitiesAsync is >>>> responsible for the mess. >>>> >>>> - >>>> * 08:29:47 - engine loses connectivity to host >>>> 'lago-basic-suite-4-2-host-0'.* >>>> >>>> >>>> >>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host >>>> (unsuccessfully).* >>>> >>>> * before each "getCapabilitiesAsync" the monitoring lock is taken >>>> (VdsManager,refreshImpl) >>>> >>>> * "getCapabilitiesAsync" immediately fails and throws >>>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The >>>> exception is caught by >>>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls >>>> 'onFailure' of the callback and re-throws the exception. >>>> >>>> catch (Throwable t) { >>>> getParameters().getCallback().onFailure(t); >>>> throw t; >>>> } >>>> >>>> * The 'onFailure' of the callback releases the "monitoringLock" >>>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) >>>> lockManager.releaseLock(monitoringLock);') >>>> >>>> * 'VdsManager,refreshImpl' catches the network exception, marks >>>> 'releaseLock = true' and *tries to release the already released lock*. >>>> >>>> The following warning is printed to the log - >>>> >>>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] >>>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release >>>> exclusive lock which does not exist, lock key: >>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' >>>> >>>> >>>> >>>> >>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.* >>>> >>>> >>>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). >>>> * >>>> >>>> * SetupNetworks takes the monitoring lock. >>>> >>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests >>>> from 4 minutes ago from its queue and prints a VDSNetworkException: Vds >>>> timeout occured.* >>>> >>>> * When the first request is removed from the queue >>>> ('ResponseTracker.remove()'), the >>>> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock >>>> is released (the lock taken by the SetupNetworks!).* >>>> >>>> * *The other requests removed from the queue also try to release the >>>> monitoring lock*, but there is nothing to release. >>>> >>>> * The following warning log is printed - >>>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] >>>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release >>>> exclusive lock which does not exist, lock key: >>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' >>>> >>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. >>>> Why? I'm not 100% sure but I guess the late processing of the >>>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the >>>> late + mupltiple processing of failure is root cause. >>>> >>>> >>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is >>>> trying to be released three times. Please share your opinion regarding how >>>> it should be fixed. >>>> >>>> >>>> Thanks, >>>> >>>> Alona. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <dan...@redhat.com> >>>> wrote: >>>> >>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <eh...@redhat.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <ee...@redhat.com> wrote: >>>>>> >>>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. >>>>>>> Is it still failing? >>>>>>> >>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkor...@redhat.com> >>>>>>> wrote: >>>>>>> >>>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <dan...@redhat.com> wrote: >>>>>>>> > No, I am afraid that we have not managed to understand why setting >>>>>>>> and >>>>>>>> > ipv6 address too the host off the grid. We shall continue >>>>>>>> researching >>>>>>>> > this next week. >>>>>>>> > >>>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, >>>>>>>> but >>>>>>>> > could it possibly be related (I really doubt that)? >>>>>>>> > >>>>>>>> >>>>>>> >>>>>> Sorry, but I do not see how this problem is related to VDSM. >>>>>> There is nothing that indicates that there is a VDSM problem. >>>>>> >>>>>> Has the RPC connection between Engine and VDSM failed? >>>>>> >>>>>> >>>>> Further up the thread, Piotr noticed that (at least on one failure of >>>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm >>>>> process was restarted. However, this does not seems to happen in all cases >>>>> where this test fails. >>>>> >>>>> _______________________________________________ >>>>> Devel mailing list >>>>> Devel@ovirt.org >>>>> http://lists.ovirt.org/mailman/listinfo/devel >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Devel mailing list >>>> Devel@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/devel >>>> >>> >>> >>> >>> -- >>> *GAL bEN HAIM* >>> RHV DEVOPS >>> >> >> >> _______________________________________________ >> Devel mailing list >> Devel@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/devel >> _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel