[
https://issues.apache.org/jira/browse/FELIX-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792465#comment-15792465
]
Pierre De Rop commented on FELIX-5471:
--------------------------------------
Hello Jeroen;
The timeout I introduced last friday was because I wanted to avoid a corner
case where lost dependencies could be stopped before calling "removed"
callbacks (I will explain this below).
But to be honest, I should not have added this timeout because I realize that
it may generate some deadlocks. I have added the
FELIX5471_CyclicDependencyTest.java which demonstrates the deadlock:
So, running this test with the timeout guard in the ComponentImpl.schedule
method (ComponentImpl.java, in revision 1776641) produces a deadlock, and after
30 seconds, we see this message log:
<code>
WARN - pool-3-thread-2 : [pool-3-thread-2] task could not be scheduled timely
in component org.apache.felix.dm.itest.api.FELIX5471_CyclidDependencyTest$A
(exception: java.util.concurrent.TimeoutException)
WARN - pool-3-thread-1 : [pool-3-thread-1] task could not be scheduled timely
in component org.apache.felix.dm.itest.api.FELIX5471_CyclidDependencyTest$B
(exception: java.util.concurrent.TimeoutException)
</code>
So, for now, I have committed a new patch (in revision 1776903.) which includes
the FELIX5471_CyclicDependencyTest.java test. The patch does not use the
timeout in the ComponentImpl.schedule method anymore.
Now, actually, the real patch which (partially) resolves the issue you found is
that now, when we are
handling a REMOVED service event, then if a threadpool is used we try to not
reschedule the REMOVED event it it and we try to call the handleRemoved method
synchronously. Initially, the problem was that if a threadpool was used, then
the REMOVED event was always scheduled in it, and the event was always handled
asynchronously.
So, let me try to explain deeper what was going on using the scenario you gave
in the users mailing list (that is: M depends on X, and M.remove(X) is called
but at this point, X is already stopped):
1) So, X is being unregistered from the OSGi service registry. That is: the
ComponentImpl for X is currently invoking m_registration.unregister() in the
ComponentImpl.unregisterService() method.
2) Then, the framework callbacks the tracker of the M component, which ends up
calling the ComponentImpl.handleEvent() method with the event for X being
unregistered.
At this point, we are still in the call stack of the unregisterService for X
(see step 1).
But the bug is that if we are using a threadpool, then the
Component.handleRemoved() method was always executed asynchronously in the
threadpool.
3) So, while the Component.handleRemoved method (for M) is being scheduled in
the threadpool, then we return from the handleEvent method and then
the ComponentImpl.unregisterService() method returns (from step 1). So, then,
at this point the thread from (1) ends up calling the X.stop() method, but
at a point where M.unbind(X) method has not been yet been scheduled and
executed in the threadpool. So, in this case, X.stop() is called , and then,
M.remove(X) is
called.
So, now, the Component.handleRemoved() method is not rescheduled in the
threadpool (if possible). see the schedule method which invokes the
DispatchExecutor.execute method:
{code}
// Try to execute the task from the current thread if the threadpool is not
currently running our queue.
((DispatchExecutor) exec).execute(task, false);
{code}
So, the patch will try to not schedule the Component.handleRemoved() in the
threadpool, but instead will invoke the method synchronously.
Unfortunately, the patch is not perfect, because the DispatchExecutor won't
invoke the task synchronously if the queue for the Component is currently
executing another task (like a concurrent service ADD event). This may happen
in a highly concurrent situations, and for now, I don't see an easy solution in
order to make sure M.remove(X) is called before X.stop().
For example, if during the unregistration of the X service, then at the same
time another thread is registering a Z service on which M depends on, then it
is possible that
at the time the ComponentImpl for M is called in handleEvent for the REMOVED X
event , then the threadpool may possibly be currently handling the ADDED Z
event.
In this case, since the state machine is single threaded, it won't be possible
to handle the REMOVED X event synchronously, and the Component.handleRemoved()
method call will be
scheduled in the threadpool, which will be executed after the handling of the Z
ADDED event, but in this case, X.stop() will possibly be called before
M.remove(X) is invoked.
Please notice that this can also happen even if no threadpool is used (antoher
master thread me be running the queue while we are handling the REMOVED X
event).
So, this is for this reason that I also introduced last friday the timeout
guard in the schedule method (if the synchronous flag is true).
However, it does not work because it may introduce a nasty deadlock which is
reproduced in the FELIX5471_CyclicDependencyTest.java test.
So, for now, I have simply removed the usage of the timeout guard.
it means that for most situations (whether or not you are using a threadpool),
the ordering will be guaranteed.
But if there is a highly concurrent situation where some services are
unregistered while others are added from multiple distinct threads, then the
ordering can't be guaranteed in all cases.
Always ensuring the ordering guarantee seems that we would have to rework the
state machine thread model with complex internal locking policy, and I'm not
sure it's worth doing this.
Anyway, I will try to contact Marcel and see what he is thinking and will get
back to this issue in case some more rework has to really be done.
Now, I would also be interested to have your opinion ? do you think what is
committed is reasonable ?
thanks a lot Jeroen;
/Pierre
> Ensure that unbound services are always handled synchronously
> -------------------------------------------------------------
>
> Key: FELIX-5471
> URL: https://issues.apache.org/jira/browse/FELIX-5471
> Project: Felix
> Issue Type: Bug
> Components: Dependency Manager
> Affects Versions: org.apache.felix.dependencymanager-r1
> Reporter: Pierre De Rop
> Assignee: Pierre De Rop
> Fix For: org.apache.felix.dependencymanager-r9
>
>
> When a component loses a service dependency, it should handle the lost
> service synchronously. For example, if service A loses a dependency on B
> (because B is being unregistered), then A.remove(B) should be called
> synchronously (when B is being unregistered from the service registry), else
> the A.remove(B) callback could possibly be invoked while B is already
> unregistered and stopped.
> Currently, unbound services may be handled asynchronously if DM is used in a
> concurrent mode (using a threadpool). And even if no threadpool is used, the
> issue may happen if there is a highly concurrent situation where services are
> registered/removed concurrently from multiple threads.
> So, a patch should be done in order to ensure that a service dependency
> remove event is always handled synchronously (especially if DM is used with a
> threadpool).
> I will provide a testcase soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)