I agree. I probably need to understand both SDM and JoinManager, which
has its own concurrency issues especially in the area of retries, before
I can nail the requirements for TaskManager. I'll carry on studying SDM
for now.
Patricia
On 8/6/2010 7:47 AM, Gregg Wonderly wrote:
Okay, it looks like there are several things that we need to look at
with the SDM code then(*). I think we need to look carefully at the
mutations of the data that are occurring and that River-324 tries to
help manage to see if we can get our heads around it and hopefully make
sure that we believe that River-324's problem of stale or wrong cache
contents can be managed more readily.
Gregg Wonderly
* I had a hard time getting SDM to work reliably for me, and in
particular, it initially did not perform unicast lookups, but waited for
multi-cast announcements only and this made it non-functional on
non-local network segments.
Patricia Shanahan wrote:
addProxyReg is called from the LookupCacheImpl constructor and from
the discover method of a DiscoveryListener attached to a
LookupDiscoveryManager. I don't see any specific rules about thread
choice for the discover call, but as far as I can tell the lookup
discovery manager does not know anything about the service discovery
manager's cache, and so is unlikely to use its TaskManager.
A LookupTask is created in the run method of a RegisterListenerTask,
which is the class of task that addProxyReg adds to the cacheTaskMgr
TaskManager. RegisterListenerTask inherits the CacheTask runAfter
method that just returns false, so there is nothing to stop it from
running in parallel with any older task.
Looks like two or more different threads to me. I think a simpler
explanation for the lack of problem reports lies in two effects:
1. Almost all the time, addProxyReg will win the race. It has a
significant running start. Immediately after the end of its
synchronized block, it drops straight into calling the TaskManager add
method. Meanwhile, another thread that was at the start of its
synchronized block would have to go through creating a new Task object
before attempting the add call. For addProxyReg to lose the race for
the add method's TaskManager synchronization would require a cache
miss or an interrupt in a window a few instructions long.
2. The symptom, if any, would be cache confusion that would be very
hard to distinguish from a variation on
https://issues.apache.org/jira/browse/RIVER-324. If there were any
observations of this problem before RIVER-324 was fixed, they would
have been conflated with it and presumed fixed.
I am a strong believer in closing even the tiniest timing windows,
because they can collectively lead to general flakiness even if there
are no reproducible bug reports.
Patricia
On 8/5/2010 6:13 AM, Gregg Wonderly wrote:
I haven't looked yet, but a quick thought I had was, are the other
dependent instances (higher sequence numbers) actually created on a
separate thread, so that ordering could be compromised, or are they just
created in later in the same threads execution path?
Gregg Wonderly
Peter Firmstone wrote:
Thanks Patricia, sharp eyes!
Cheers,
Peter.
Patricia Shanahan wrote:
...
ServiceDiscoveryManager.LookupCacheImpl.taskSeqN is used to
initialize sequence number fields in some CacheTask subclasses that
are then used in runAfter methods.
ServiceDiscoveryManager.LookupCacheImpl.addProxyReg creates a task
with incremented taskSeqN inside a serviceIdMap synchronized block,
but adds it to the cacheTaskMgr outside the block.
public void addProxyReg(ProxyReg reg) {
RegisterListenerTask treg;
synchronized(serviceIdMap) {
treg = new RegisterListenerTask(reg, taskSeqN++);
}//end sync(serviceIdMap)
cacheTaskMgr.add(treg);
}//end LookupCacheImpl.addProxyReg
In the remaining sequence numbered CacheTask subclass cases, the
cacheTaskMgr.add call is done inside the same
synchronized(serviceIdMap) block as the taskSeqN increment.
There is a window in addProxyReg during which a LookupTask or
NotifyEventTask with a higher sequence number could be added before
the RegisterListenerTask task addProxyReg is adding. The LookupTask
and NotifyEventTask runAfter methods both wait for any
RegisterEventTask with a lower sequence number.
I believe this is a bug, but it will be hard to construct a test case
because it involves such a narrow timing window. I do recommend
moving the cacheTaskMgr.add(treg) statement inside the synchronized
block.