> On Dec 26, 2018, at 4:11 PM, Taylor R Campbell 
> <campbell+netbsd-source-change...@mumble.net> wrote:
> 
> Job reference counting is currently slightly busted -- my draft for
> threadpool_schedule_job didn't increment it correctly, and your
> threadpool_schedule_job doesn't handle failure in threadpool_job_hold.


Ok, I looked at this some last night, and was up early this morning and took a 
peek at it again, and I think getting rid of the refcnt is going to be tough.

The rub is the special handling of "assigned to overseer" in cancel_async.  
Because of that, you can't really use job_thread as a proxy for "held", and 
there's a small window in the overseer thread where neither the pool nor the 
job are locked because of locking order (which is job -> pool), yet the job is 
referenced; this happens when there is a thread to give the job to, we pull the 
job off the overseer queue, but we can't lock the job until we drop the tp_lock 
-- it's during that window where we have phantom reference that needs to be 
accounted for.

So, here's how I think it can be fixed:

-- Must continue to use atomics for job_hold / job_rele.

-- job_hold remains lockless, but job_rele can only be called **without** the 
job_lock held, because it needs to acquire the lock in the unlikely case it 
performs a cv_broadcast (see below).  Also need a job_rele_locked.

-- In schedule_job, always job_hold for either case (overseer or direct 
hand-off to idle thread).  This is the job's lifecycle reference, and will be 
dropped either in cancel_async or in job_done.

-- In cancel_async, only job_rele_locked after successfully removing it from 
the overseer queue (protected by tp_lock).  Note the job_lock is held by the 
caller of cancel_async.

-- In overseer thread, after grabbing a local reference to the job, do a 
job_hold before dropping the tp_lock, to prevent racing with cancel_async.  
This is a temporary hold while we do our work, and will always be released, 
regardless of job being cancelled beneath us or given to a thread.  Note the 
most likely scenario is that the job was NOT cancelled beneath us, and thus we 
will not be dropping the last reference, and therefore we don't want to take 
the job_lock again in this common case just to drop our temporary reference.

-- job_done should assert there is always at least 1 reference, and should 
directly decrement it rather than doing job_rele, because it already has the 
job_lock held and always does the cv_broadcast anyway.

-- Don't need to handle any overflow in job_hold / job_rele ... there reference 
count should ever one be 0 (totally idle), 1 (scheduled/running OR phantom 
reference), or 2 (scheduled AND phantom reference).  We can assert no overflow 
in job_hold.

-- thorpej

Reply via email to