With TaskManager.Task.runAfter, throughput wasn't significant enough for this race to occur.

If I make the ExecutorService single threaded, the error doesn't occur as the tasks are executed in correct dependency order, however, when the ExecutorService has a lot of threads ready, the tasks aren't able to be arranged in a queue as they're immediately handed off to a waiting thread.

TaskManager added sufficient latency to prevent this race from occurring.

I think the solution to this race condition (note that all accesses are synchronized) is to make the the operations atomic and separate out the event notifications.

Thoughts?

Peter.

com/sun/jini/test/impl/servicediscovery/event/LookupTaskServiceIdMapRace.td

**
 * This test attempts to simulate the following race condition that
 * can occur between an instance of UnmapProxyTask (created and queued
 * in LookupTask) and instances of NewOldServiceTask that are created
 * and queued by NotifyEventTask:
 *
 * - 1 LUS {L0}
 * - N (~250) services {s0, s1, ..., sN-1}, to be registered in L0
 * - M (~24) SDMs, each with 1 cache with template matching all si's
 *   {SDM_0/C0, SDM_1/C1, ... SDM_M-1/CM-1}
 *
 * Through the shear number of service registrations, caches, and events,
 * this test attempts to produce the conditions that cause the regular
 * occurrence of the race between an instance of UnmapProxyTask and
 * instances of NewOldServiceTask produced by NotifyEventTask when a
 * service event is received from L0.
 *
 * This test starts lookup L0 during construct. Then, when the test begins
 * running, half the services are registered with L0, followed by the
 * creation of half the SDMs and corresponding caches; which causes the
 * tasks being tested to be queued, and event generation to ultimately
 * begin. After registering the first half of the services and creating
 * the first half of the SDMs, the remaining services are registered and
 * the remaining SDMs and caches are created. As events are generated,
 * the number of serviceAdded and serviceRemoved events are tallied.
 *
* When an SDM_i/cach_i pair is created, an instance of RegisterListenerTask
 * is queued and executed. RegisterListenerTask registers a remote event
* listener with L0's event mechanism. When the services are registered with
 * L0, that listener receives service events; which causes NotifyEventTask
 * to be queued and executed. After RegisterListerTask registers for events
* with L0, but before RegisterListenerTask exits, an instance of LookupTask
 * is queued and executed. LookupTask retrieves from L0 a "snapshot" of its
 * state. Thus, while events begin to arrive informing each cache of the
 * services that are registering with L0, LookupTask is querying L0 for
 * its current state.
 *
* Upon receipt of a service event, NotifyEventTask queues a NewOldServiceTask
 * to determine if the service corresponding to the event represents a new
* service that has been added, a change to a previously-registered service,
 * or the removal of a service from L0. If the event corresponds to a newly
 * registered service, the service is added to the cache's serviceIdMap and
 * a serviceAdded event is sent to any listeners registered with the cache.
 * That is,
 *
 * Service event received
 *
 *   NotifyEventTask {
 *     if (service removed) {
 *       remove service from serviceIdMap
 *       send serviceRemoved
 *     } else {
 *       NewOldServiceTask
 *         if (service changed) {
 *           send serviceChanged
 *         } else if (service is new) {
 *           add service to serviceIdMap
 *           send serviceAdded
 *         }
 *     }
 *   }
 *
 * While events are being received and processed by NotifyEventTask and
 * NewOldServiceTask, LookupTask is asynchronously requesting a snapshot
 * of L0's state and attempting to process that snapshot to populate
 * the same serviceIdMap that is being populated by instances of
 * NewOldServiceTask that are initiated by NotifyEventTask. LookupTask
 * first examines serviceIdMap, looking for services that are NOT in the
 * snapshot; that is, services that are not currently registered with L0.
 * Such a service is referred to as an, "orphan". For each orphan service
* that LookupTask finds, an instance of UnmapProxyTask is queued. That task
 * removes the service from the serviceIdMap and sends a serviceRemoved
 * event to any listeners registered with the cache. After processing
 * any orphans that it finds, LookupTask then queues an instance of
 * NewOldServiceTask for each service in the snapshot previously retrieved.
 * That is,
 *
 * LookupTask - retrieve snapshot {
 *
 *   for each service in serviceIdMap {
 *     if (service is not in snapshot) { //orphan
 *       UnmapProxyTask {
 *         remove service from serviceIdMap
 *         send serviceRemoved
 *       }
 *     }
 *   }
 *   for each service in snapshot {
 *       NewOldServiceTask
 *         if (service changed) {
 *           send serviceChanged
 *         } else if (service is new) {
 *           add service to serviceIdMap
 *           send serviceAdded
 *         }
 *   }
 * }
 *
 * The race can occur because the NewOldServiceTasks that are queued by the
 * NotifyEventTasks can add services to the serviceIdMap between the time
* LookupTask retrieves the snapshot and the time it analyzes the serviceIdMap
 * for orphans. That is,
 *
 *                        o SDM_i/cache_i created
 * RegisterListenerTask
 * --------------------
 *  register for events
 *  LookupTask
 *  ----------
 *   retrieve snapshot {s0,s1,s2}
 *                        o s3 registered with L0
 *                        o L0 sends NO_MATCH_MATCH
 *                                                   NotifyEventTask
 *                                                   ---------------
 *                                                     NewOldServiceTask
 *                                                     -----------------
* add s3 to serviceIdMap * send serviceAdded event
 *   ORPHAN: s3 in serviceIdMap, not snapshot
 *   UnmapProxyTask
 *   --------------
 *     remove s3 from serviceIdMap
 *     send serviceRemoved event
 *
 * This test returns a pass when no race is detected between UnmapProxyTask
 * and any NewOldServiceTask initiated by a NotifyEventTask. This is
 * determined by examining the serviceAdded and serviceRemoved event
 * tallies collected during test execution. If, for each SDM/cache
 * combination, the number of serviceAdded events received equals the
 * number of services registered with L0, and no serviceRemoved events
 * are received, then there is no race, and the test passes; otherwise,
 * the test fails (in particular, if at least one serviceRemoved event
 * is sent by at least one SDM/cache).
 *
 * No special modifications to the SDM are required to cause the race
 * condition to occur consistently. When running this test individually
 * on Solaris, out of "the vob", under a JERI or JRMP configuration, and
* with 24 SDMs/caches and 250 services, the race condition was consistently
 * observed (until a fix was integrated). Thus, it appears that the greater
 * the number of SDMs/caches/services, the greater the probability the
 * conditions for the race will be encountered.
 *
 * Related bug ids: 6291851
 */

Reply via email to