Re: ServiceDiscoveryManager Task Race

Peter Tue, 14 Jan 2014 16:15:54 -0800

----- Original message -----
> I don't think it is just a matter of latency. When I last looked closely 
> at TaskManager, it appeared to me that it kept each task in the runAfter 
> checking list until the task returned from its run method. That way, a 
> dependent task was prevented from running until completion of the task 
> on which it depends.


Yes, your right, that it does, it was only cases of RetryTask that's removed 
after each unsuccesfull attempt.


> 
> I can see two approaches to this. One is to make the TaskManager 
> replacement do something similar, wrapping the client tasks in code its 
> own code to keep track of them for dependency purposes until they 
> finish. The other is to dump the whole dependency issue on the client 
> code, which can use anything in the entire range of concurrency control 
> mechanisms, including atomicity, synchronization, semaphores, etc.

Comments made on the list so far indicate people want configurable Executors, 
which makes sense because it allows deployment time optimisation and we don't 
get much feedback or info about that.

Having said that though, there's a lot of mutability in SDM's implementation 
and that's difficult to manage.

Cheers,

Peter.

> 
> Patricia
> 
> On 1/14/2014 3:32 AM, Peter Firmstone wrote:
> > With TaskManager.Task.runAfter, throughput wasn't significant enough
> > for this race to occur.
> > 
> > If I make the ExecutorService single threaded, the error doesn't occur
> > as the tasks are executed in correct dependency order, however, when
> > the ExecutorService has a lot of threads ready, the tasks aren't able
> > to be arranged in a queue as they're immediately handed off to a
> > waiting thread.
> > 
> > TaskManager added sufficient latency to prevent this race from
> > occurring.
> > 
> > I think the solution to this race condition (note that all accesses are
> > synchronized) is to make the the operations atomic and separate out the
> > event notifications.
> > 
> > Thoughts?
> > 
> > Peter.
> > 
> > com/sun/jini/test/impl/servicediscovery/event/LookupTaskServiceIdMapRace.td
> > 
> > **
> > * This test attempts to simulate the following race condition that
> > * can occur between an instance of UnmapProxyTask (created and queued
> > * in LookupTask) and instances of NewOldServiceTask that are created
> > * and queued by NotifyEventTask:
> > *
> > * - 1 LUS {L0}
> > * - N (~250) services {s0, s1, ..., sN-1}, to be registered in L0
> > * - M (~24) SDMs, each with 1 cache with template matching all si's
> > *     {SDM_0/C0, SDM_1/C1, ... SDM_M-1/CM-1}
> > *
> > * Through the shear number of service registrations, caches, and
> > events, * this test attempts to produce the conditions that cause the
> > regular * occurrence of the race between an instance of UnmapProxyTask
> > and * instances of NewOldServiceTask produced by NotifyEventTask when a
> > * service event is received from L0.
> > *
> > * This test starts lookup L0 during construct. Then, when the test
> > begins * running, half the services are registered with L0, followed
> > by the * creation of half the SDMs and corresponding caches; which
> > causes the * tasks being tested to be queued, and event generation to
> > ultimately * begin. After registering the first half of the services
> > and creating * the first half of the SDMs, the remaining services are
> > registered and * the remaining SDMs and caches are created. As events
> > are generated, * the number of serviceAdded and serviceRemoved events
> > are tallied. *
> > * When an SDM_i/cach_i pair is created, an instance of
> > RegisterListenerTask
> > * is queued and executed. RegisterListenerTask registers a remote event
> > * listener with L0's event mechanism. When the services are registered
> > with
> > * L0, that listener receives service events; which causes
> > NotifyEventTask * to be queued and executed. After RegisterListerTask
> > registers for events
> > * with L0, but before RegisterListenerTask exits, an instance of
> > LookupTask
> > * is queued and executed. LookupTask retrieves from L0 a "snapshot" of
> > its
> > * state. Thus, while events begin to arrive informing each cache of the
> > * services that are registering with L0, LookupTask is querying L0 for
> > * its current state.
> > *
> > * Upon receipt of a service event, NotifyEventTask queues a
> > NewOldServiceTask
> > * to determine if the service corresponding to the event represents a
> > new * service that has been added, a change to a previously-registered
> > service,
> > * or the removal of a service from L0. If the event corresponds to a
> > newly
> > * registered service, the service is added to the cache's serviceIdMap
> > and
> > * a serviceAdded event is sent to any listeners registered with the
> > cache.
> > * That is,
> > *
> > * Service event received
> > *
> > *     NotifyEventTask {
> > *         if (service removed) {
> > *             remove service from serviceIdMap
> > *             send serviceRemoved
> > *         } else {
> > *             NewOldServiceTask
> > *                 if (service changed) {
> > *                     send serviceChanged
> > *                 } else if (service is new) {
> > *                     add service to serviceIdMap
> > *                     send serviceAdded
> > *                 }
> > *         }
> > *     }
> > *
> > * While events are being received and processed by NotifyEventTask and
> > * NewOldServiceTask, LookupTask is asynchronously requesting a snapshot
> > * of L0's state and attempting to process that snapshot to populate
> > * the same serviceIdMap that is being populated by instances of
> > * NewOldServiceTask that are initiated by NotifyEventTask. LookupTask
> > * first examines serviceIdMap, looking for services that are NOT in the
> > * snapshot; that is, services that are not currently registered with
> > L0. * Such a service is referred to as an, "orphan". For each orphan
> > service * that LookupTask finds, an instance of UnmapProxyTask is
> > queued. That task
> > * removes the service from the serviceIdMap and sends a serviceRemoved
> > * event to any listeners registered with the cache. After processing
> > * any orphans that it finds, LookupTask then queues an instance of
> > * NewOldServiceTask for each service in the snapshot previously
> > retrieved.
> > * That is,
> > *
> > * LookupTask - retrieve snapshot {
> > *
> > *     for each service in serviceIdMap {
> > *         if (service is not in snapshot) { //orphan
> > *             UnmapProxyTask {
> > *                 remove service from serviceIdMap
> > *                 send serviceRemoved
> > *             }
> > *         }
> > *     }
> > *     for each service in snapshot {
> > *             NewOldServiceTask
> > *                 if (service changed) {
> > *                     send serviceChanged
> > *                 } else if (service is new) {
> > *                     add service to serviceIdMap
> > *                     send serviceAdded
> > *                 }
> > *     }
> > * }
> > *
> > * The race can occur because the NewOldServiceTasks that are queued by
> > the
> > * NotifyEventTasks can add services to the serviceIdMap between the
> > time * LookupTask retrieves the snapshot and the time it analyzes the
> > serviceIdMap
> > * for orphans. That is,
> > *
> > *                                               o SDM_i/cache_i created
> > * RegisterListenerTask
> > * --------------------
> > *   register for events
> > *   LookupTask
> > *   ----------
> > *     retrieve snapshot {s0,s1,s2}
> > *                                               o s3 registered with L0
> > *                                               o L0 sends NO_MATCH_MATCH
> > *                                                                           
> >                           NotifyEventTask
> > *                                                                           
> >                           ---------------
> > *                                                                           
> >                               NewOldServiceTask
> > *                                                                           
> >                               -----------------
> > *                                                                           
> >                               add s3 to
> > serviceIdMap
> > *                                                                           
> >                               send
> > serviceAdded event
> > *     ORPHAN: s3 in serviceIdMap, not snapshot
> > *     UnmapProxyTask
> > *     --------------
> > *         remove s3 from serviceIdMap
> > *         send serviceRemoved event
> > *
> > * This test returns a pass when no race is detected between
> > UnmapProxyTask
> > * and any NewOldServiceTask initiated by a NotifyEventTask. This is
> > * determined by examining the serviceAdded and serviceRemoved event
> > * tallies collected during test execution. If, for each SDM/cache
> > * combination, the number of serviceAdded events received equals the
> > * number of services registered with L0, and no serviceRemoved events
> > * are received, then there is no race, and the test passes; otherwise,
> > * the test fails (in particular, if at least one serviceRemoved event
> > * is sent by at least one SDM/cache).
> > *
> > * No special modifications to the SDM are required to cause the race
> > * condition to occur consistently. When running this test individually
> > * on Solaris, out of "the vob", under a JERI or JRMP configuration, and
> > * with 24 SDMs/caches and 250 services, the race condition was
> > consistently
> > * observed (until a fix was integrated). Thus, it appears that the
> > greater
> > * the number of SDMs/caches/services, the greater the probability the
> > * conditions for the race will be encountered.
> > *
> > * Related bug ids: 6291851
> > */
>

Re: ServiceDiscoveryManager Task Race

Reply via email to