With TaskManager.Task.runAfter, throughput wasn't significant enough for
this race to occur.
If I make the ExecutorService single threaded, the error doesn't occur
as the tasks are executed in correct dependency order, however, when the
ExecutorService has a lot of threads ready, the tasks aren't able to be
arranged in a queue as they're immediately handed off to a waiting thread.
TaskManager added sufficient latency to prevent this race from occurring.
I think the solution to this race condition (note that all accesses are
synchronized) is to make the the operations atomic and separate out the
event notifications.
Thoughts?
Peter.
com/sun/jini/test/impl/servicediscovery/event/LookupTaskServiceIdMapRace.td
**
* This test attempts to simulate the following race condition that
* can occur between an instance of UnmapProxyTask (created and queued
* in LookupTask) and instances of NewOldServiceTask that are created
* and queued by NotifyEventTask:
*
* - 1 LUS {L0}
* - N (~250) services {s0, s1, ..., sN-1}, to be registered in L0
* - M (~24) SDMs, each with 1 cache with template matching all si's
* {SDM_0/C0, SDM_1/C1, ... SDM_M-1/CM-1}
*
* Through the shear number of service registrations, caches, and events,
* this test attempts to produce the conditions that cause the regular
* occurrence of the race between an instance of UnmapProxyTask and
* instances of NewOldServiceTask produced by NotifyEventTask when a
* service event is received from L0.
*
* This test starts lookup L0 during construct. Then, when the test begins
* running, half the services are registered with L0, followed by the
* creation of half the SDMs and corresponding caches; which causes the
* tasks being tested to be queued, and event generation to ultimately
* begin. After registering the first half of the services and creating
* the first half of the SDMs, the remaining services are registered and
* the remaining SDMs and caches are created. As events are generated,
* the number of serviceAdded and serviceRemoved events are tallied.
*
* When an SDM_i/cach_i pair is created, an instance of
RegisterListenerTask
* is queued and executed. RegisterListenerTask registers a remote event
* listener with L0's event mechanism. When the services are registered
with
* L0, that listener receives service events; which causes NotifyEventTask
* to be queued and executed. After RegisterListerTask registers for events
* with L0, but before RegisterListenerTask exits, an instance of
LookupTask
* is queued and executed. LookupTask retrieves from L0 a "snapshot" of its
* state. Thus, while events begin to arrive informing each cache of the
* services that are registering with L0, LookupTask is querying L0 for
* its current state.
*
* Upon receipt of a service event, NotifyEventTask queues a
NewOldServiceTask
* to determine if the service corresponding to the event represents a new
* service that has been added, a change to a previously-registered
service,
* or the removal of a service from L0. If the event corresponds to a newly
* registered service, the service is added to the cache's serviceIdMap and
* a serviceAdded event is sent to any listeners registered with the cache.
* That is,
*
* Service event received
*
* NotifyEventTask {
* if (service removed) {
* remove service from serviceIdMap
* send serviceRemoved
* } else {
* NewOldServiceTask
* if (service changed) {
* send serviceChanged
* } else if (service is new) {
* add service to serviceIdMap
* send serviceAdded
* }
* }
* }
*
* While events are being received and processed by NotifyEventTask and
* NewOldServiceTask, LookupTask is asynchronously requesting a snapshot
* of L0's state and attempting to process that snapshot to populate
* the same serviceIdMap that is being populated by instances of
* NewOldServiceTask that are initiated by NotifyEventTask. LookupTask
* first examines serviceIdMap, looking for services that are NOT in the
* snapshot; that is, services that are not currently registered with L0.
* Such a service is referred to as an, "orphan". For each orphan service
* that LookupTask finds, an instance of UnmapProxyTask is queued. That
task
* removes the service from the serviceIdMap and sends a serviceRemoved
* event to any listeners registered with the cache. After processing
* any orphans that it finds, LookupTask then queues an instance of
* NewOldServiceTask for each service in the snapshot previously retrieved.
* That is,
*
* LookupTask - retrieve snapshot {
*
* for each service in serviceIdMap {
* if (service is not in snapshot) { //orphan
* UnmapProxyTask {
* remove service from serviceIdMap
* send serviceRemoved
* }
* }
* }
* for each service in snapshot {
* NewOldServiceTask
* if (service changed) {
* send serviceChanged
* } else if (service is new) {
* add service to serviceIdMap
* send serviceAdded
* }
* }
* }
*
* The race can occur because the NewOldServiceTasks that are queued by the
* NotifyEventTasks can add services to the serviceIdMap between the time
* LookupTask retrieves the snapshot and the time it analyzes the
serviceIdMap
* for orphans. That is,
*
* o SDM_i/cache_i created
* RegisterListenerTask
* --------------------
* register for events
* LookupTask
* ----------
* retrieve snapshot {s0,s1,s2}
* o s3 registered with L0
* o L0 sends NO_MATCH_MATCH
* NotifyEventTask
* ---------------
* NewOldServiceTask
* -----------------
* add s3 to
serviceIdMap
* send
serviceAdded event
* ORPHAN: s3 in serviceIdMap, not snapshot
* UnmapProxyTask
* --------------
* remove s3 from serviceIdMap
* send serviceRemoved event
*
* This test returns a pass when no race is detected between UnmapProxyTask
* and any NewOldServiceTask initiated by a NotifyEventTask. This is
* determined by examining the serviceAdded and serviceRemoved event
* tallies collected during test execution. If, for each SDM/cache
* combination, the number of serviceAdded events received equals the
* number of services registered with L0, and no serviceRemoved events
* are received, then there is no race, and the test passes; otherwise,
* the test fails (in particular, if at least one serviceRemoved event
* is sent by at least one SDM/cache).
*
* No special modifications to the SDM are required to cause the race
* condition to occur consistently. When running this test individually
* on Solaris, out of "the vob", under a JERI or JRMP configuration, and
* with 24 SDMs/caches and 250 services, the race condition was
consistently
* observed (until a fix was integrated). Thus, it appears that the greater
* the number of SDMs/caches/services, the greater the probability the
* conditions for the race will be encountered.
*
* Related bug ids: 6291851
*/