On 02/04/2016 02:17 PM, Mike Galbraith wrote: > On Thu, 2016-02-04 at 13:51 +0200, Nikolay Borisov wrote: >> >> On 02/04/2016 01:32 PM, Mike Galbraith wrote: >>> On Wed, 2016-02-03 at 12:58 +0200, Nikolay Borisov wrote: >>>> >>>> So in this case the prev/next entries do not look like corrupted, >>>> whereas >>>> when manipulating the list inside detach_timer they do. This is >>>> really >>>> odd, any ideas how to further debug this? >>> >>> Suspiciously similar to https://lkml.org/lkml/2016/2/4/247 >> >> Right, I've been cursory following this thread but I was left with the >> impression this only occurs on machines where the CPU can go offline, >> currently the server on which this happened should never offline any of >> its CPUs since the power management is disabled (though I will have to >> double check this). > > AFAIU, hotplug isn't required, only mod_delayed_work() being called > from a different CPU than where the timer was born, migrating it at a > bad time.
Right, in this case the ib_addr was indeed using mod_delayed_work so things line up so far. > >> On a different note - is there a way to safely reproduce this so I can >> test the suggested fix by Thomas? > > Hm, write a module to beat mod_delayed_work() to pulp with a NR_CPUS > horde, and run it in a vm where you don't care about shrapnel? In other words, have multiple threads (NR_CPUS) that spin on mod_delayed_work? > > -Mike >

