Okay, it looks like there are several things that we need to look at with the SDM code then(*). I think we need to look carefully at the mutations of the data that are occurring and that River-324 tries to help manage to see if we can get our heads around it and hopefully make sure that we believe that River-324's problem of stale or wrong cache contents can be managed more readily.

Gregg Wonderly


* I had a hard time getting SDM to work reliably for me, and in particular, it initially did not perform unicast lookups, but waited for multi-cast announcements only and this made it non-functional on non-local network segments.

Patricia Shanahan wrote:
addProxyReg is called from the LookupCacheImpl constructor and from the discover method of a DiscoveryListener attached to a LookupDiscoveryManager. I don't see any specific rules about thread choice for the discover call, but as far as I can tell the lookup discovery manager does not know anything about the service discovery manager's cache, and so is unlikely to use its TaskManager.

A LookupTask is created in the run method of a RegisterListenerTask, which is the class of task that addProxyReg adds to the cacheTaskMgr TaskManager. RegisterListenerTask inherits the CacheTask runAfter method that just returns false, so there is nothing to stop it from running in parallel with any older task.

Looks like two or more different threads to me. I think a simpler explanation for the lack of problem reports lies in two effects:

1. Almost all the time, addProxyReg will win the race. It has a significant running start. Immediately after the end of its synchronized block, it drops straight into calling the TaskManager add method. Meanwhile, another thread that was at the start of its synchronized block would have to go through creating a new Task object before attempting the add call. For addProxyReg to lose the race for the add method's TaskManager synchronization would require a cache miss or an interrupt in a window a few instructions long.

2. The symptom, if any, would be cache confusion that would be very hard to distinguish from a variation on https://issues.apache.org/jira/browse/RIVER-324. If there were any observations of this problem before RIVER-324 was fixed, they would have been conflated with it and presumed fixed.

I am a strong believer in closing even the tiniest timing windows, because they can collectively lead to general flakiness even if there are no reproducible bug reports.

Patricia





On 8/5/2010 6:13 AM, Gregg Wonderly wrote:
I haven't looked yet, but a quick thought I had was, are the other
dependent instances (higher sequence numbers) actually created on a
separate thread, so that ordering could be compromised, or are they just
created in later in the same threads execution path?

Gregg Wonderly

Peter Firmstone wrote:
Thanks Patricia, sharp eyes!

Cheers,

Peter.

Patricia Shanahan wrote:
...
ServiceDiscoveryManager.LookupCacheImpl.taskSeqN is used to
initialize sequence number fields in some CacheTask subclasses that
are then used in runAfter methods.

ServiceDiscoveryManager.LookupCacheImpl.addProxyReg creates a task
with incremented taskSeqN inside a serviceIdMap synchronized block,
but adds it to the cacheTaskMgr outside the block.

public void addProxyReg(ProxyReg reg) {
RegisterListenerTask treg;
synchronized(serviceIdMap) {
treg = new RegisterListenerTask(reg, taskSeqN++);
}//end sync(serviceIdMap)
cacheTaskMgr.add(treg);
}//end LookupCacheImpl.addProxyReg

In the remaining sequence numbered CacheTask subclass cases, the
cacheTaskMgr.add call is done inside the same
synchronized(serviceIdMap) block as the taskSeqN increment.

There is a window in addProxyReg during which a LookupTask or
NotifyEventTask with a higher sequence number could be added before
the RegisterListenerTask task addProxyReg is adding. The LookupTask
and NotifyEventTask runAfter methods both wait for any
RegisterEventTask with a lower sequence number.

I believe this is a bug, but it will be hard to construct a test case
because it involves such a narrow timing window. I do recommend
moving the cacheTaskMgr.add(treg) statement inside the synchronized
block.


Reply via email to