> On Sep 7, 2017, at 12:55 PM, Jiri Kosina <ji...@kernel.org> wrote: > > On Thu, 7 Sep 2017, Ingo Molnar wrote: > >>>> When Linux brings a CPU down and back up, it switches to init_mm and then >>>> loads swapper_pg_dir into CR3. With PCID enabled, this has the side effect >>>> of masking off the ASID bits in CR3. >>>> >>>> This can result in some confusion in the TLB handling code. If we >>>> bring a CPU down and back up with any ASID other than 0, we end up >>>> with the wrong ASID active on the CPU after resume. This could >>>> cause our internal state to become corrupt, although major >>>> corruption is unlikely because init_mm doesn't have any user pages. >>>> More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion >>>> in the next context switch. The result of *that* is a failure to >>>> resume from suspend with probability 1 - 1/6^(cpus-1). >>>> >>>> Fix it by reinitializing cpu_tlbstate on resume and CPU bringup. >>>> >>>> Reported-by: Linus Torvalds <torva...@linux-foundation.org> >>>> Reported-by: Jiri Kosina <ji...@kernel.org> >>>> Fixes: 10af6235e0d3 ("x86/mm: Implement PCID based optimization: try to >>>> preserve old TLB entries using PCID") >>>> Signed-off-by: Andy Lutomirski <l...@kernel.org> >>> >>> Tested-by: Jiri Kosina <jkos...@suse.cz> >> >> The fix should be upstream already, as of 1c9fe4409ce3 and later. > > Hm, so I've just experienced two instances in a row of reboot just after > reading hibernation image (i.e. exactly the same symptom as before) even > with 3b9f8ed kernel (which contains the fix). Seems like the fix is either > incomplete (just the probability of it happening is lower), or I'm seeing > something differet with the same symptom. > > I'll try to figure out whether it's the same VM_BUG_ON() triggering, but > probably will be able to do so only tomorrow. >
Nah, don't waste your time. I think I see the bug, and it's a different bug. It's an easy one-line fix, but I have to figure out how to test it. > -- > Jiri Kosina > SUSE Labs >