Hi Tejun, Thanks, I see I missed the RCU part. I'll try the force atomic thing. Though so far I haven't been able to reproduce it yet.
Thanks, David 2018-03-14 8:43 GMT-07:00 Tejun Heo <t...@kernel.org>: > Hello, David. > > On Tue, Mar 13, 2018 at 03:50:47PM -0700, David Chen wrote: >> ==== >> CPU A CPU B >> ----- ----- >> percpu_ref_kill() percpu_ref_tryget_live() >> { >> if (__ref_is_percpu()) >> set __PERCPU_REF_DEAD; >> __percpu_ref_switch_mode(); >> ^ sum up current percpu_count >> this_cpu_inc(*percpu_count); <- this >> increment got leaked. >> >> ==== >> >> So if later CPU B later does percpu_ref_put, it will cause ref->count >> to drop to -1. >> And thus causing the above hung task issue. >> >> Do you think this theory is correct, or am I missing something? >> Please tell me what do you think. > > The switching to atomic mode does something like the following. > > 1. Mark the refcnt so that __ref_is_percpu() is false. > > 2. Wait for RCU grace period so that everyone including > percpu_ref_tryget_live() which has seen true __ref_is_percpu() is > done with its operation. > > 3. Now that it knows nobody is operating on the assumption that the > counter is in percpu mode, it adds up all the percpu counters. > > So, provided there aren't some silly bugs, what you described > shouldn't happen. Can you force the refcnt into atomic mode w/ > PERCPU_REF_INIT_ATOMIC and see whether the problem persists? > > Thanks. > > -- > tejun