Hello,

On Fri, Feb 06, 2015 at 06:11:56PM +0100, Rabin Vincent wrote:
> task1                           task2                           worker
> 
>                                                                 add to busy 
> hash
>                                                                 clear pending
>                                                                 exec work
> 
> __cancel_work_timer()
>  try_to_grab_pending()
>    get pending, return 0
>  set work cancelling
>  flush_work()
>    wait_for_completion()
> 
>                                                                 remove from 
> busy_hash
> 
>                                 __cancel_work_timer()
>                                 while (forever) {
>                                   try_to_grab_pending()
>                                     return -ENOENT due to cancelling
>                                   flush_work
>                                     return false due to already gone
>                                 }
> 
> 
> On kernels with CONFIG_PREEMPT disabled, this causes a permanent soft
> lockup of the system.

Ah, drat.

> Adding a cond_resched() to the loop in cancel_delayed_work_sync(), like
> below, helps the simple !PREEMPT case, but does not completely fix the
> problem, because the problem can still be seen if the thread which sees
> the ENOENT has for example SCHED_FIFO scheduling class, both on systems
> with CONFIG_PREEMPT enabled and on !PREEMPT with the cond_resched().
> 
> We've seen this problem with the cancel_delayed_work() call in
> jffs2_sync_fs(), but I've attached a testing patch which forces the
> problem on current mainline without the need for jffs2.
...
> @@ -2741,6 +2741,9 @@ static bool __cancel_work_timer(struct work_struct 
> *work, bool is_dwork)
>                */
>               if (unlikely(ret == -ENOENT))
>                       flush_work(work);
> +
> +             if (ret < 0)
> +                     cond_resched();
>       } while (unlikely(ret < 0));

Well, an obvious thing to do would be

                if (unlikely(ret == -ENOENT)) {
                        if (!flush_work(work))
                                schedule_timeout(1);
                }

It was gonna block for the work item anyway so I don't think
schedule_timeout() there is a problem.  That said, given that we're
guaranteed to be able to dereference @work there, there probably is a
better way.  I'll think more about it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to