[ 
https://issues.apache.org/jira/browse/FELIX-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792465#comment-15792465
 ] 

Pierre De Rop commented on FELIX-5471:
--------------------------------------

Hello Jeroen;

The timeout I introduced last friday was because I wanted to avoid a corner 
case where lost dependencies could be stopped before calling "removed" 
callbacks (I will explain this below).
But to be honest, I should not have added this timeout because I realize that 
it may generate some deadlocks. I have added the 
FELIX5471_CyclicDependencyTest.java which demonstrates the deadlock:
So, running this test with the timeout guard in the ComponentImpl.schedule 
method (ComponentImpl.java, in revision 1776641) produces a deadlock, and after 
30 seconds, we see this message log:

<code>
WARN - pool-3-thread-2 : [pool-3-thread-2] task could not be scheduled timely 
in component org.apache.felix.dm.itest.api.FELIX5471_CyclidDependencyTest$A 
(exception: java.util.concurrent.TimeoutException)
WARN - pool-3-thread-1 : [pool-3-thread-1] task could not be scheduled timely 
in component org.apache.felix.dm.itest.api.FELIX5471_CyclidDependencyTest$B 
(exception: java.util.concurrent.TimeoutException)
</code>

So, for now, I have committed a new patch (in revision 1776903.) which includes 
the FELIX5471_CyclicDependencyTest.java test. The patch does not use the 
timeout in the ComponentImpl.schedule method anymore.

Now, actually, the real patch which (partially) resolves the issue you found is 
that now, when we are
handling a REMOVED service event, then if a threadpool is used we try to not 
reschedule the REMOVED event it it and we try to call the handleRemoved method 
synchronously. Initially, the problem was that if a threadpool was used, then 
the REMOVED event was always scheduled in it, and the event was always handled 
asynchronously.

So, let me try to explain deeper what was going on using the scenario you gave 
in the users mailing list (that is: M depends on X, and M.remove(X) is called 
but at this point, X is already stopped):

1) So, X is being unregistered from the OSGi service registry. That is: the 
ComponentImpl for X is currently invoking m_registration.unregister() in the 
ComponentImpl.unregisterService() method.

2) Then, the framework callbacks the tracker of the M component, which ends up 
calling the ComponentImpl.handleEvent() method with the event for X being 
unregistered.
At this point, we are still in the call stack of the unregisterService for X 
(see step 1).
But the bug is that if we are using a threadpool, then the 
Component.handleRemoved() method was always executed asynchronously in the 
threadpool.

3) So, while the Component.handleRemoved method (for M) is being scheduled in 
the threadpool, then we return from the handleEvent method and then
the ComponentImpl.unregisterService() method returns (from step 1). So, then, 
at this point the thread from (1) ends up calling the X.stop() method, but
at a point where M.unbind(X) method has not been yet been scheduled and 
executed in the threadpool. So, in this case, X.stop() is called , and then, 
M.remove(X) is
called.

So, now, the Component.handleRemoved() method is not rescheduled in the 
threadpool (if possible). see the schedule method which invokes the 
DispatchExecutor.execute method:

{code}
    // Try to execute the task from the current thread if the threadpool is not 
currently running our queue.
    ((DispatchExecutor) exec).execute(task, false);
{code}

So, the patch will try to not schedule the Component.handleRemoved()  in the 
threadpool, but instead will invoke the method synchronously.

Unfortunately, the patch is not perfect, because the DispatchExecutor won't 
invoke the task synchronously if the queue for the Component is currently 
executing another task (like a concurrent service ADD event). This may happen 
in a highly concurrent situations, and for now, I don't see an easy solution in 
order to make sure M.remove(X) is called before X.stop().

For example, if during the unregistration of the X service, then at the same 
time another thread is registering a Z service on which M depends on, then it 
is possible that
at the time the ComponentImpl for M is called in handleEvent for the REMOVED X 
event , then the threadpool may possibly be currently handling the ADDED Z 
event.
In this case, since the state machine is single threaded, it won't be possible 
to handle the REMOVED X event synchronously, and the Component.handleRemoved() 
method call will be
scheduled in the threadpool, which will be executed after the handling of the Z 
ADDED event, but in this case, X.stop() will possibly be called before 
M.remove(X) is invoked.

Please notice that this can also happen even if no threadpool is used (antoher 
master thread me be running the queue while we are handling the REMOVED X 
event).

So, this is for this reason that I also introduced last friday the timeout 
guard in the schedule method (if the synchronous flag is true).
However, it does not work because it may introduce a nasty deadlock which is 
reproduced in the FELIX5471_CyclicDependencyTest.java test.

So, for now, I have simply removed the usage of the timeout guard.

it means that for most situations (whether or not you are using a threadpool), 
the ordering will be guaranteed.
But if there is a highly concurrent situation where some services are 
unregistered while others are added from multiple distinct threads, then the 
ordering can't be guaranteed in all cases.
Always ensuring the ordering guarantee seems that we would have to rework the 
state machine thread model with complex internal locking policy, and I'm not 
sure it's worth doing this.
Anyway, I will try to contact Marcel and see what he is thinking and will get 
back to this issue in case some more rework has to really be done.

Now,  I would also be interested to have your opinion ? do you think what is 
committed is reasonable ?

thanks a lot Jeroen;
/Pierre


> Ensure that unbound services are always handled synchronously
> -------------------------------------------------------------
>
>                 Key: FELIX-5471
>                 URL: https://issues.apache.org/jira/browse/FELIX-5471
>             Project: Felix
>          Issue Type: Bug
>          Components: Dependency Manager
>    Affects Versions: org.apache.felix.dependencymanager-r1
>            Reporter: Pierre De Rop
>            Assignee: Pierre De Rop
>             Fix For: org.apache.felix.dependencymanager-r9
>
>
> When a component loses a service dependency, it should handle the lost 
> service synchronously. For example, if service A loses a dependency on B 
> (because B is being unregistered),  then A.remove(B) should be called 
> synchronously (when B is being unregistered from the service registry), else 
> the A.remove(B) callback could possibly be invoked while B is already 
> unregistered and stopped.
> Currently, unbound services may be handled asynchronously if DM is used in a 
> concurrent mode (using a threadpool). And even if no threadpool is used, the 
> issue may happen if there is a highly concurrent situation where services are 
> registered/removed concurrently from multiple threads.
> So, a patch should be done in order to ensure that a service dependency 
> remove event is always handled synchronously (especially if DM is used with a 
> threadpool).
> I will provide a testcase soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to