On Tue, May 16, 2017 at 7:38 AM, Peter Zijlstra <pet...@infradead.org> wrote: > On Thu, May 04, 2017 at 10:31:43AM +0800, Zefan Li wrote: >> It is assumed that the head of cache_groups always has valid RMID, >> which isn't true. >> >> When we deallocate RMID from conflicting events currently we don't >> move them to the tail, and one of those events can happen to be in >> the head. Another case is we allocate RMIDs for all the events except >> the head event in intel_cqm_sched_in_event(). >> >> Besides there's another bug that we retry rotating without resetting >> nr_needed and start in __intel_cqm_rmid_rotate(). >> >> Those bugs combined together led to the following oops. >> >> WARNING: at arch/x86/kernel/cpu/perf_event_intel_cqm.c:186 >> __put_rmid+0x28/0x80() >> ... >> [<ffffffff8103a578>] __put_rmid+0x28/0x80 >> [<ffffffff8103a74a>] intel_cqm_rmid_rotate+0xba/0x440 >> [<ffffffff8109d8cb>] process_one_work+0x17b/0x470 >> [<ffffffff8109e69b>] worker_thread+0x11b/0x400 >> ... >> BUG: unable to handle kernel NULL pointer dereference at (null)
I ran into this bug long time ago but never found an easy way to reproduce. Do you have one? >> ... >> [<ffffffff8103a74a>] intel_cqm_rmid_rotate+0xba/0x440 >> [<ffffffff8109d8cb>] process_one_work+0x17b/0x470 >> [<ffffffff8109e69b>] worker_thread+0x11b/0x400 > > I've managed to forgot most if not all of that horror show. Vikas and > David seem to be working on a replacement, but until such a time it > would be good if this thing would not crash the kernel. > > Guys, could you have a look? To me it appears to mostly have the right > shape, but like I said, I forgot most details... The patch LGTM. I ran into this issues before and fixed them in a similar but messier way, then the re-write started ... > >> >> Cc: sta...@vger.kernel.org >> Signed-off-by: Zefan Li <lize...@huawei.com> Acked-by: David Carrillo-Cisneros <davi...@google.com>