Dear Ingo and Peter,

I would like to report a possible bug in the CFS scheduler causing a dead lock. 

We suspect this bug to have caused intermittent yet highly-persistent system 
freezes on our quad-core SMP systems.

We noticed the problem on 4.1.17 preempt-rt but we suspect the problematic code 
is not linked to the preempt-rt patch and is also present in the latest 4.20 
kernel.

The problem concerns the use of spin_lock to lock cfs_b in a situation where 
the spin lock is used in an interrupt handler:
-  __run_hrtimer (in kernel/time/hrtimer.c) calls fn(timer) with IRQ's enabled. 
This can call sched_cfs_period_timer() (in kernel/sched/fair.c) which locks 
cfs_b. 
- the hard IRQ smp_apic_timer_interrupt can then occur. It can call 
ttwu_queue() which grabs the spin lock for its CPU run queue and can then try 
to enqueue a task via the CFS scheduler.
- this can call check_enqueue_throttle() which can call assign_cfs_rq_runtime() 
which tries to obtain the cfs_b lock. It is now blocked.

The cfs_b lock uses spin_lock and so was not intended for use inside a hard irq 
but the CFS scheduler does just that when it uses a hrtimer_interrupt to wake 
up and enqueue work. Our initial impression is that  the cfs_b needs to be 
locked using spin_lock_irqsave.

My colleague Mike Pearce has submitted a bug report on Bugzilla 3 weeks ago: 
https://bugzilla.kernel.org/show_bug.cgi?id=201993

We would appreciate any feedback.

Kind regards,

Tom

     

Reply via email to