Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkap...@redhat.com> wrote:

> Hi all,
>
> Looking at the log it seems that the new GetCapabilitiesAsync is
> responsible for the mess.
>
> -
> * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*
>
>
>
> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
> (unsuccessfully).*
>
>      * before each "getCapabilitiesAsync" the monitoring lock is taken 
> (VdsManager,refreshImpl)
>
>      * "getCapabilitiesAsync" immediately fails and throws 
> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
> exception is caught by 
> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
> 'onFailure' of the callback and re-throws the exception.
>
>          catch (Throwable t) {
>             getParameters().getCallback().onFailure(t);
>             throw t;
>          }
>
>     * The 'onFailure' of the callback releases the "monitoringLock" 
> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
> lockManager.releaseLock(monitoringLock);')
>
>     * 'VdsManager,refreshImpl' catches the network exception, marks 
> 'releaseLock = true' and *tries to release the already released lock*.
>
>       The following warning is printed to the log -
>
>       WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
> exclusive lock which does not exist, lock key: 
> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>
>
>
>
> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>
>
> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).    *
>
>     * SetupNetworks takes the monitoring lock.
>
> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 
> minutes ago from its queue and prints a VDSNetworkException: Vds timeout 
> occured.*
>
>       * When the first request is removed from the queue 
> ('ResponseTracker.remove()'), the
> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is 
> released (the lock taken by the SetupNetworks!).*
>
>       * *The other requests removed from the queue also try to release the 
> monitoring lock*, but there is nothing to release.
>
>       * The following warning log is printed -
>         WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
> exclusive lock which does not exist, lock key: 
> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>
> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
> Why? I'm not 100% sure but I guess the late processing of the 
> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late 
> + mupltiple processing of failure is root cause.
>
>
> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying 
> to be released three times. Please share your opinion regarding how it should 
> be fixed.
>
>
> Thanks,
>
> Alona.
>
>
>
>
>
>
> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <dan...@redhat.com> wrote:
>
>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <eh...@redhat.com> wrote:
>>
>>>
>>>
>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <ee...@redhat.com> wrote:
>>>
>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>>>> Is it still failing?
>>>>
>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkor...@redhat.com>
>>>> wrote:
>>>>
>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <dan...@redhat.com> wrote:
>>>>> > No, I am afraid that we have not managed to understand why setting
>>>>> and
>>>>> > ipv6 address too the host off the grid. We shall continue researching
>>>>> > this next week.
>>>>> >
>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
>>>>> > could it possibly be related (I really doubt that)?
>>>>> >
>>>>>
>>>>
>>> Sorry, but I do not see how this problem is related to VDSM.
>>> There is nothing that indicates that there is a VDSM problem.
>>>
>>> Has the RPC connection between Engine and VDSM failed?
>>>
>>>
>> Further up the thread, Piotr noticed that (at least on one failure of
>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm
>> process was restarted. However, this does not seems to happen in all cases
>> where this test fails.
>>
>> _______________________________________________
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>
>
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Reply via email to