Working on a patch will post a fix Thanks
Ravi On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkap...@redhat.com> wrote: > Hi all, > > Looking at the log it seems that the new GetCapabilitiesAsync is > responsible for the mess. > > - > * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.* > > > > *- Every 3 seconds a getCapabalititiesAsync request is sent to the host > (unsuccessfully).* > > * before each "getCapabilitiesAsync" the monitoring lock is taken > (VdsManager,refreshImpl) > > * "getCapabilitiesAsync" immediately fails and throws > 'VDSNetworkException: java.net.ConnectException: Connection refused'. The > exception is caught by > 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls > 'onFailure' of the callback and re-throws the exception. > > catch (Throwable t) { > getParameters().getCallback().onFailure(t); > throw t; > } > > * The 'onFailure' of the callback releases the "monitoringLock" > ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) > lockManager.releaseLock(monitoringLock);') > > * 'VdsManager,refreshImpl' catches the network exception, marks > 'releaseLock = true' and *tries to release the already released lock*. > > The following warning is printed to the log - > > WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] > (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release > exclusive lock which does not exist, lock key: > 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' > > > > > *- 08:30:51 a successful getCapabilitiesAsync is sent.* > > > *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). * > > * SetupNetworks takes the monitoring lock. > > *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 > minutes ago from its queue and prints a VDSNetworkException: Vds timeout > occured.* > > * When the first request is removed from the queue > ('ResponseTracker.remove()'), the > *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is > released (the lock taken by the SetupNetworks!).* > > * *The other requests removed from the queue also try to release the > monitoring lock*, but there is nothing to release. > > * The following warning log is printed - > WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] > (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release > exclusive lock which does not exist, lock key: > 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' > > - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. > Why? I'm not 100% sure but I guess the late processing of the > 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late > + mupltiple processing of failure is root cause. > > > Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying > to be released three times. Please share your opinion regarding how it should > be fixed. > > > Thanks, > > Alona. > > > > > > > On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <dan...@redhat.com> wrote: > >> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <eh...@redhat.com> wrote: >> >>> >>> >>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <ee...@redhat.com> wrote: >>> >>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. >>>> Is it still failing? >>>> >>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkor...@redhat.com> >>>> wrote: >>>> >>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <dan...@redhat.com> wrote: >>>>> > No, I am afraid that we have not managed to understand why setting >>>>> and >>>>> > ipv6 address too the host off the grid. We shall continue researching >>>>> > this next week. >>>>> > >>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but >>>>> > could it possibly be related (I really doubt that)? >>>>> > >>>>> >>>> >>> Sorry, but I do not see how this problem is related to VDSM. >>> There is nothing that indicates that there is a VDSM problem. >>> >>> Has the RPC connection between Engine and VDSM failed? >>> >>> >> Further up the thread, Piotr noticed that (at least on one failure of >> this test) that the Vdsm host lost connectivity to its storage, and Vdsm >> process was restarted. However, this does not seems to happen in all cases >> where this test fails. >> >> _______________________________________________ >> Devel mailing list >> Devel@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/devel >> > >
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel