Re: perf: fuzzer crashes immediately on AMD system
On Wed, 24 Aug 2016, Ingo Molnar wrote: > If there's no progress finding the root cause I'd be happy to exchange a > crash for > a leak ... It's actually a crash of the program doing the perf_event_open() call, not a crash of the system (at least in my experience). However, it's possible that if you have bad luck and if the kfree'd space is reused with just the right combination of values you could potentially end up crashing the system. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Wed, 24 Aug 2016, Ingo Molnar wrote: > If there's no progress finding the root cause I'd be happy to exchange a > crash for > a leak ... It's actually a crash of the program doing the perf_event_open() call, not a crash of the system (at least in my experience). However, it's possible that if you have bad luck and if the kfree'd space is reused with just the right combination of values you could potentially end up crashing the system. Vince
Re: perf: fuzzer crashes immediately on AMD system
* Vince Weaverwrote: > On Tue, 23 Aug 2016, Peter Zijlstra wrote: > > > On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote: > > > > > > > > > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > > > amd_uncore_find_online_sibling() > > > > function is broken. > > > > > > and that's the problem. uncore_find_online_sibling() does all kinds of > > > wrong things including sticking active uncore structures in > > > uncore->free_when_cpu_online > > > > > > Then uncore_online() comes along and frees those structures. > > > > > > Then some other part of the kernel comes and re-uses the free'd data. > > > > > > Then when we try to start an event, all of the fields are invalid because > > > the uncore pointer is pointing to re-used data. > > > > > > I don't have a patch because I am not 100% clear on what > > > uncore_find_online_sibling() is doing in the first place. > > > > Thanks for doing all that, I'll see if I can make sense of it. > > I should have provided more detail, was just tired after chasing the bug > for so long. I mostly found things by sprinkling printks everywhere. > Comenting out the call to kfree() in uncore_online() makes the code stop > crashing (but perhaps causes a memory leak?) If there's no progress finding the root cause I'd be happy to exchange a crash for a leak ... > In any case it's odd the problem didn't show up earlier, but maybe the > recent changes to CPU hotplugging in that file exposed the issue. Yeah, we had lots of changes to CPU hotplugging recently. Thanks, Ingo
Re: perf: fuzzer crashes immediately on AMD system
* Vince Weaver wrote: > On Tue, 23 Aug 2016, Peter Zijlstra wrote: > > > On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote: > > > > > > > > > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > > > amd_uncore_find_online_sibling() > > > > function is broken. > > > > > > and that's the problem. uncore_find_online_sibling() does all kinds of > > > wrong things including sticking active uncore structures in > > > uncore->free_when_cpu_online > > > > > > Then uncore_online() comes along and frees those structures. > > > > > > Then some other part of the kernel comes and re-uses the free'd data. > > > > > > Then when we try to start an event, all of the fields are invalid because > > > the uncore pointer is pointing to re-used data. > > > > > > I don't have a patch because I am not 100% clear on what > > > uncore_find_online_sibling() is doing in the first place. > > > > Thanks for doing all that, I'll see if I can make sense of it. > > I should have provided more detail, was just tired after chasing the bug > for so long. I mostly found things by sprinkling printks everywhere. > Comenting out the call to kfree() in uncore_online() makes the code stop > crashing (but perhaps causes a memory leak?) If there's no progress finding the root cause I'd be happy to exchange a crash for a leak ... > In any case it's odd the problem didn't show up earlier, but maybe the > recent changes to CPU hotplugging in that file exposed the issue. Yeah, we had lots of changes to CPU hotplugging recently. Thanks, Ingo
Re: perf: fuzzer crashes immediately on AMD system
On Tue, 23 Aug 2016, Peter Zijlstra wrote: > On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote: > > > > > > > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > > amd_uncore_find_online_sibling() > > > function is broken. > > > > and that's the problem. uncore_find_online_sibling() does all kinds of > > wrong things including sticking active uncore structures in > > uncore->free_when_cpu_online > > > > Then uncore_online() comes along and frees those structures. > > > > Then some other part of the kernel comes and re-uses the free'd data. > > > > Then when we try to start an event, all of the fields are invalid because > > the uncore pointer is pointing to re-used data. > > > > I don't have a patch because I am not 100% clear on what > > uncore_find_online_sibling() is doing in the first place. > > Thanks for doing all that, I'll see if I can make sense of it. I should have provided more detail, was just tired after chasing the bug for so long. I mostly found things by sprinkling printks everywhere. Comenting out the call to kfree() in uncore_online() makes the code stop crashing (but perhaps causes a memory leak?) In any case it's odd the problem didn't show up earlier, but maybe the recent changes to CPU hotplugging in that file exposed the issue. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Tue, 23 Aug 2016, Peter Zijlstra wrote: > On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote: > > > > > > > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > > amd_uncore_find_online_sibling() > > > function is broken. > > > > and that's the problem. uncore_find_online_sibling() does all kinds of > > wrong things including sticking active uncore structures in > > uncore->free_when_cpu_online > > > > Then uncore_online() comes along and frees those structures. > > > > Then some other part of the kernel comes and re-uses the free'd data. > > > > Then when we try to start an event, all of the fields are invalid because > > the uncore pointer is pointing to re-used data. > > > > I don't have a patch because I am not 100% clear on what > > uncore_find_online_sibling() is doing in the first place. > > Thanks for doing all that, I'll see if I can make sense of it. I should have provided more detail, was just tired after chasing the bug for so long. I mostly found things by sprinkling printks everywhere. Comenting out the call to kfree() in uncore_online() makes the code stop crashing (but perhaps causes a memory leak?) In any case it's odd the problem didn't show up earlier, but maybe the recent changes to CPU hotplugging in that file exposed the issue. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote: > > > > > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > amd_uncore_find_online_sibling() > > function is broken. > > and that's the problem. uncore_find_online_sibling() does all kinds of > wrong things including sticking active uncore structures in > uncore->free_when_cpu_online > > Then uncore_online() comes along and frees those structures. > > Then some other part of the kernel comes and re-uses the free'd data. > > Then when we try to start an event, all of the fields are invalid because > the uncore pointer is pointing to re-used data. > > I don't have a patch because I am not 100% clear on what > uncore_find_online_sibling() is doing in the first place. Thanks for doing all that, I'll see if I can make sense of it.
Re: perf: fuzzer crashes immediately on AMD system
On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote: > > > > > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > amd_uncore_find_online_sibling() > > function is broken. > > and that's the problem. uncore_find_online_sibling() does all kinds of > wrong things including sticking active uncore structures in > uncore->free_when_cpu_online > > Then uncore_online() comes along and frees those structures. > > Then some other part of the kernel comes and re-uses the free'd data. > > Then when we try to start an event, all of the fields are invalid because > the uncore pointer is pointing to re-used data. > > I don't have a patch because I am not 100% clear on what > uncore_find_online_sibling() is doing in the first place. Thanks for doing all that, I'll see if I can make sense of it.
Re: perf: fuzzer crashes immediately on AMD system
> > > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > amd_uncore_find_online_sibling() > function is broken. and that's the problem. uncore_find_online_sibling() does all kinds of wrong things including sticking active uncore structures in uncore->free_when_cpu_online Then uncore_online() comes along and frees those structures. Then some other part of the kernel comes and re-uses the free'd data. Then when we try to start an event, all of the fields are invalid because the uncore pointer is pointing to re-used data. I don't have a patch because I am not 100% clear on what uncore_find_online_sibling() is doing in the first place. Vince
Re: perf: fuzzer crashes immediately on AMD system
> > > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > amd_uncore_find_online_sibling() > function is broken. and that's the problem. uncore_find_online_sibling() does all kinds of wrong things including sticking active uncore structures in uncore->free_when_cpu_online Then uncore_online() comes along and frees those structures. Then some other part of the kernel comes and re-uses the free'd data. Then when we try to start an event, all of the fields are invalid because the uncore pointer is pointing to re-used data. I don't have a patch because I am not 100% clear on what uncore_find_online_sibling() is doing in the first place. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Mon, 22 Aug 2016, Huang Rui wrote: > Hi Peter, Vince > > On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote: > > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and > > > > it > > > > falls over more or less immediately. > > > > > > > > This maps to variable_test_bit() > > > > called by ctx = find_get_context(pmu, task, event); > > > > in kernel/events/core.c:9467 > > > > > > > > It happens quickly enough I can probably track down the exact event > > > > that > > > > causes this, if needed. > > > > > > I have a one line reproducer: > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > > various manuals to see if I can spot the fail. > > > > Huang could you either prod someone at AMD or do yourself, audit the AMD > > perf code for all the various new models? > > Actually, there might be some NBPMC event changes between model 0h-fh and > model 10h-1fh. Below are the documents of these two processors: > > http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf > http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf > > In section 3.16, it describes usage of NB Performance Counter Events. I don't think it's the hardware that's causing the problem. I've wasted a lot more time on it, and finally figured out how the "bt" instruction works, so the assembly more or less makes sense. The problem is the per-cpu amd_uncore struct is being over-written with kernel memory addresses. This makes uncore[0]->cpu a large number (it's often, but not always, the per-cpu address of uncore[1]->cpu) which leads to the GPF. I can't figure out what piece of code is overwriting things though. And to make things complicated, I think the amd_uncore_find_online_sibling() function is broken. The code could really use more commenting, but I think it is designed so all siblings share one single amd_uncore structure, but in practice it looks like this doesn't work due to the way the list iterator works. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Mon, 22 Aug 2016, Huang Rui wrote: > Hi Peter, Vince > > On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote: > > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and > > > > it > > > > falls over more or less immediately. > > > > > > > > This maps to variable_test_bit() > > > > called by ctx = find_get_context(pmu, task, event); > > > > in kernel/events/core.c:9467 > > > > > > > > It happens quickly enough I can probably track down the exact event > > > > that > > > > causes this, if needed. > > > > > > I have a one line reproducer: > > > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > > various manuals to see if I can spot the fail. > > > > Huang could you either prod someone at AMD or do yourself, audit the AMD > > perf code for all the various new models? > > Actually, there might be some NBPMC event changes between model 0h-fh and > model 10h-1fh. Below are the documents of these two processors: > > http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf > http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf > > In section 3.16, it describes usage of NB Performance Counter Events. I don't think it's the hardware that's causing the problem. I've wasted a lot more time on it, and finally figured out how the "bt" instruction works, so the assembly more or less makes sense. The problem is the per-cpu amd_uncore struct is being over-written with kernel memory addresses. This makes uncore[0]->cpu a large number (it's often, but not always, the per-cpu address of uncore[1]->cpu) which leads to the GPF. I can't figure out what piece of code is overwriting things though. And to make things complicated, I think the amd_uncore_find_online_sibling() function is broken. The code could really use more commenting, but I think it is designed so all siblings share one single amd_uncore structure, but in practice it looks like this doesn't work due to the way the list iterator works. Vince
Re: perf: fuzzer crashes immediately on AMD system
Hi Peter, Vince On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote: > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > > falls over more or less immediately. > > > > > > This maps to variable_test_bit() > > > called by ctx = find_get_context(pmu, task, event); > > > in kernel/events/core.c:9467 > > > > > > It happens quickly enough I can probably track down the exact event that > > > causes this, if needed. > > > > I have a one line reproducer: > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > various manuals to see if I can spot the fail. > > Huang could you either prod someone at AMD or do yourself, audit the AMD > perf code for all the various new models? Actually, there might be some NBPMC event changes between model 0h-fh and model 10h-1fh. Below are the documents of these two processors: http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf In section 3.16, it describes usage of NB Performance Counter Events. Hope it helps. :-) Thanks, Rui
Re: perf: fuzzer crashes immediately on AMD system
Hi Peter, Vince On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote: > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > > falls over more or less immediately. > > > > > > This maps to variable_test_bit() > > > called by ctx = find_get_context(pmu, task, event); > > > in kernel/events/core.c:9467 > > > > > > It happens quickly enough I can probably track down the exact event that > > > causes this, if needed. > > > > I have a one line reproducer: > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > various manuals to see if I can spot the fail. > > Huang could you either prod someone at AMD or do yourself, audit the AMD > perf code for all the various new models? Actually, there might be some NBPMC event changes between model 0h-fh and model 10h-1fh. Below are the documents of these two processors: http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf In section 3.16, it describes usage of NB Performance Counter Events. Hope it helps. :-) Thanks, Rui
Re: perf: fuzzer crashes immediately on AMD system
On Fri, 19 Aug 2016, Peter Zijlstra wrote: > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > > falls over more or less immediately. > > > > > > This maps to variable_test_bit() > > > called by ctx = find_get_context(pmu, task, event); > > > in kernel/events/core.c:9467 > > > > > > It happens quickly enough I can probably track down the exact event that > > > causes this, if needed. > > > > I have a one line reproducer: > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > various manuals to see if I can spot the fail. > > Huang could you either prod someone at AMD or do yourself, audit the AMD > perf code for all the various new models? This is bizzarre, I can't make any sense of the crash. To recap, the crash looks like this: BUG: unable to handle kernel paging request at 85e67600 IP: [] find_get_context.isra.75+0x28/0x20f The code in question is this code: if (!cpu_online(cpu)) which maps to test_bit(cpumask_check(cpu), cpumask_bits((cpumask))); which assembles to 810e4ca9: 41 89 ccmov%ecx,%r12d 810e4cac: 7f 1e jg 810e4ccc810e4cae: 44 89 e0mov%r12d,%eax * 810e4cb1: 48 0f a3 05 87 0f 7fbt %rax,0x7f0f87(%rip)# 818d5c40 <__cpu_online_mask> 810e4cb8: 00 810e4cb9: 0f 92 c0setb %al 810e4cbc: 84 c0 test %al,%al There is no way that 0x7f0f87(%rip) should ever possibly be the 85e67600 value that causes the fault. Though oddly rax when the call happens (according to the oops message) is RAX: 22c8ce30 which seems nonsensical for a CPU number, but shouldn't cause an invalid memory address. Also oddly RDI matches RAX but RCX doesn't which I think should be true with that assembly. So very weird. I even wrote a kernel module and dumped the raw kernel memory to make sure the instruction stream didn't get overwritten somehow, but as far as I can tell the code in memory matches the disassembly. anyway I am out of time to look at this for now. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Fri, 19 Aug 2016, Peter Zijlstra wrote: > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > > falls over more or less immediately. > > > > > > This maps to variable_test_bit() > > > called by ctx = find_get_context(pmu, task, event); > > > in kernel/events/core.c:9467 > > > > > > It happens quickly enough I can probably track down the exact event that > > > causes this, if needed. > > > > I have a one line reproducer: > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > various manuals to see if I can spot the fail. > > Huang could you either prod someone at AMD or do yourself, audit the AMD > perf code for all the various new models? This is bizzarre, I can't make any sense of the crash. To recap, the crash looks like this: BUG: unable to handle kernel paging request at 85e67600 IP: [] find_get_context.isra.75+0x28/0x20f The code in question is this code: if (!cpu_online(cpu)) which maps to test_bit(cpumask_check(cpu), cpumask_bits((cpumask))); which assembles to 810e4ca9: 41 89 ccmov%ecx,%r12d 810e4cac: 7f 1e jg 810e4ccc 810e4cae: 44 89 e0mov%r12d,%eax * 810e4cb1: 48 0f a3 05 87 0f 7fbt %rax,0x7f0f87(%rip)# 818d5c40 <__cpu_online_mask> 810e4cb8: 00 810e4cb9: 0f 92 c0setb %al 810e4cbc: 84 c0 test %al,%al There is no way that 0x7f0f87(%rip) should ever possibly be the 85e67600 value that causes the fault. Though oddly rax when the call happens (according to the oops message) is RAX: 22c8ce30 which seems nonsensical for a CPU number, but shouldn't cause an invalid memory address. Also oddly RDI matches RAX but RCX doesn't which I think should be true with that assembly. So very weird. I even wrote a kernel module and dumped the raw kernel memory to make sure the instruction stream didn't get overwritten somehow, but as far as I can tell the code in memory matches the disassembly. anyway I am out of time to look at this for now. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Fri, 19 Aug 2016, Vince Weaver wrote: > OK, this is weird. I rebooted (didn't patch the kernel, just rebooted) > and I can't reproduce the original problem at all. I rebooted three more times (after perf_fuzzer turned up a more boring probably known dump, shown at end) and now I am hitting the original bug again. Weird. Let me see if I can figure out what is going on. and for the record, the bug the fuzzer kicks out when it doesn't hit the weird one: note this is sprinkled among thousands of [ 3782.364287] BAD LUCK: lost 7650 message(s) from NMI context! [ 3780.821837] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:12074] [ 3781.493831] CPU: 2 PID: 12074 Comm: perf_fuzzer Tainted: G L 4.8.0-rc2+ #27 [ 3781.508478] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013 [ 3781.524054] task: 8802232cf280 task.stack: 8802252c [ 3781.542904] RIP: 0010:[] [] smp_call_function_single+0xbb/0xca [ 3781.558618] RSP: 0018:8802252c3d78 EFLAGS: 0202 [ 3781.570752] RAX: RBX: 0001 RCX: [ 3781.584757] RDX: 0001 RSI: 08fb RDI: 0300 [ 3781.598819] RBP: 0001 R08: 0003 R09: 7f0c0ea07700 [ 3781.612930] R10: 7f0c0ea079d0 R11: 0206 R12: 810e226b [ 3781.627107] R13: 8802252c3dc8 R14: 8802252c3d78 R15: [ 3781.641335] FS: 7f0c0ea07700() GS:88022ed0() knlGS: [ 3781.656573] CS: 0010 DS: ES: CR0: 80050033 [ 3781.669534] CR2: 7f0c0e7d72c8 CR3: 0002251d1000 CR4: 000407e0 [ 3781.683929] DR0: DR1: DR2: [ 3781.698410] DR3: DR6: 0ff0 DR7: 00010602 [ 3781.712845] Stack: [ 3781.747577] 810e226b 8802252c3dc8 0003 [ 3781.787434] e8c87190 880223fb7800 810e5676 [ 3781.827415] 810e18df 810e16cd 810e13d2 [ 3781.841792] Call Trace: [ 3781.851292] [] ? perf_cgroup_attach+0x34/0x34 [ 3781.864355] [] ? group_sched_out+0x70/0x70 [ 3781.877219] [] ? event_function_call+0xa8/0xa8 [ 3781.890345] [] ? cpu_function_call+0x32/0x3b [ 3781.903284] [] ? perf_ctx_lock+0x1e/0x1e [ 3781.915864] [] ? event_function_call+0x49/0xa8 [ 3781.928952] [] ? group_sched_out+0x70/0x70 [ 3781.941675] [] ? event_function_call+0xa8/0xa8 [ 3781.954734] [] ? perf_event_for_each_child+0x53/0x8a [ 3781.968295] [] ? perf_ioctl+0x41d/0x495 [ 3781.980725] [] ? vfs_ioctl+0x16/0x23 [ 3781.992893] [] ? do_vfs_ioctl+0x46e/0x519 [ 3782.005532] [] ? do_sigaltstack+0xe1/0x1b0 [ 3782.018184] [] ? SyS_ioctl+0x4e/0x71 [ 3782.030319] [] ? entry_SYSCALL_64_fastpath+0x17/0x93 [ 3782.433996] Code: e2 01 74 04 f3 90 eb f4 83 48 18 01 4c 89 e9 4c 89 e2 4c 89 f6 89 ef e8 94 fe ff ff 85 db 74 0d 41 8b 56 18 80 e2 01 74 04 f3 90 f3 48 83 c4 20 5b 5d 41 5c 41 5d 41 5e c3 41 56 41 55 41 89
Re: perf: fuzzer crashes immediately on AMD system
On Fri, 19 Aug 2016, Vince Weaver wrote: > OK, this is weird. I rebooted (didn't patch the kernel, just rebooted) > and I can't reproduce the original problem at all. I rebooted three more times (after perf_fuzzer turned up a more boring probably known dump, shown at end) and now I am hitting the original bug again. Weird. Let me see if I can figure out what is going on. and for the record, the bug the fuzzer kicks out when it doesn't hit the weird one: note this is sprinkled among thousands of [ 3782.364287] BAD LUCK: lost 7650 message(s) from NMI context! [ 3780.821837] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:12074] [ 3781.493831] CPU: 2 PID: 12074 Comm: perf_fuzzer Tainted: G L 4.8.0-rc2+ #27 [ 3781.508478] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013 [ 3781.524054] task: 8802232cf280 task.stack: 8802252c [ 3781.542904] RIP: 0010:[] [] smp_call_function_single+0xbb/0xca [ 3781.558618] RSP: 0018:8802252c3d78 EFLAGS: 0202 [ 3781.570752] RAX: RBX: 0001 RCX: [ 3781.584757] RDX: 0001 RSI: 08fb RDI: 0300 [ 3781.598819] RBP: 0001 R08: 0003 R09: 7f0c0ea07700 [ 3781.612930] R10: 7f0c0ea079d0 R11: 0206 R12: 810e226b [ 3781.627107] R13: 8802252c3dc8 R14: 8802252c3d78 R15: [ 3781.641335] FS: 7f0c0ea07700() GS:88022ed0() knlGS: [ 3781.656573] CS: 0010 DS: ES: CR0: 80050033 [ 3781.669534] CR2: 7f0c0e7d72c8 CR3: 0002251d1000 CR4: 000407e0 [ 3781.683929] DR0: DR1: DR2: [ 3781.698410] DR3: DR6: 0ff0 DR7: 00010602 [ 3781.712845] Stack: [ 3781.747577] 810e226b 8802252c3dc8 0003 [ 3781.787434] e8c87190 880223fb7800 810e5676 [ 3781.827415] 810e18df 810e16cd 810e13d2 [ 3781.841792] Call Trace: [ 3781.851292] [] ? perf_cgroup_attach+0x34/0x34 [ 3781.864355] [] ? group_sched_out+0x70/0x70 [ 3781.877219] [] ? event_function_call+0xa8/0xa8 [ 3781.890345] [] ? cpu_function_call+0x32/0x3b [ 3781.903284] [] ? perf_ctx_lock+0x1e/0x1e [ 3781.915864] [] ? event_function_call+0x49/0xa8 [ 3781.928952] [] ? group_sched_out+0x70/0x70 [ 3781.941675] [] ? event_function_call+0xa8/0xa8 [ 3781.954734] [] ? perf_event_for_each_child+0x53/0x8a [ 3781.968295] [] ? perf_ioctl+0x41d/0x495 [ 3781.980725] [] ? vfs_ioctl+0x16/0x23 [ 3781.992893] [] ? do_vfs_ioctl+0x46e/0x519 [ 3782.005532] [] ? do_sigaltstack+0xe1/0x1b0 [ 3782.018184] [] ? SyS_ioctl+0x4e/0x71 [ 3782.030319] [] ? entry_SYSCALL_64_fastpath+0x17/0x93 [ 3782.433996] Code: e2 01 74 04 f3 90 eb f4 83 48 18 01 4c 89 e9 4c 89 e2 4c 89 f6 89 ef e8 94 fe ff ff 85 db 74 0d 41 8b 56 18 80 e2 01 74 04 f3 90 f3 48 83 c4 20 5b 5d 41 5c 41 5d 41 5e c3 41 56 41 55 41 89
Re: perf: fuzzer crashes immediately on AMD system
On Fri, 19 Aug 2016, Peter Zijlstra wrote: > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > > falls over more or less immediately. > > > > > > This maps to variable_test_bit() > > > called by ctx = find_get_context(pmu, task, event); > > > in kernel/events/core.c:9467 > > > > > > It happens quickly enough I can probably track down the exact event that > > > causes this, if needed. > > > > I have a one line reproducer: > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > various manuals to see if I can spot the fail. > > Huang could you either prod someone at AMD or do yourself, audit the AMD > perf code for all the various new models? OK, this is weird. I rebooted (didn't patch the kernel, just rebooted) and I can't reproduce the original problem at all. It was perfectly repeatable before I rebooted, dumped an OOPS message every time. Sadly I don't have the fuzzer logs that originally triggered the bug (need more serial/USB cables. Actually no, I need more null-modem adapters). Let me look into this a bit more. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Fri, 19 Aug 2016, Peter Zijlstra wrote: > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > > falls over more or less immediately. > > > > > > This maps to variable_test_bit() > > > called by ctx = find_get_context(pmu, task, event); > > > in kernel/events/core.c:9467 > > > > > > It happens quickly enough I can probably track down the exact event that > > > causes this, if needed. > > > > I have a one line reproducer: > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > various manuals to see if I can spot the fail. > > Huang could you either prod someone at AMD or do yourself, audit the AMD > perf code for all the various new models? OK, this is weird. I rebooted (didn't patch the kernel, just rebooted) and I can't reproduce the original problem at all. It was perfectly repeatable before I rebooted, dumped an OOPS message every time. Sadly I don't have the fuzzer logs that originally triggered the bug (need more serial/USB cables. Actually no, I need more null-modem adapters). Let me look into this a bit more. Vince
Re: perf: fuzzer crashes immediately on AMD system
On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote: > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > > falls over more or less immediately. > > > > > > This maps to variable_test_bit() > > > called by ctx = find_get_context(pmu, task, event); > > > in kernel/events/core.c:9467 > > > > > > It happens quickly enough I can probably track down the exact event that > > > causes this, if needed. > > > > I have a one line reproducer: > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > various manuals to see if I can spot the fail. > > Huang could you either prod someone at AMD or do yourself, audit the AMD > perf code for all the various new models? So this should obviously help a little in that it will limit the events you can program into the hardware. Not at all sure that is what you're hitting though, because I cannot for the life of me figure how that would end up exploding in generic code. --- arch/x86/events/amd/uncore.c | 47 +--- 1 file changed, 44 insertions(+), 3 deletions(-) diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c index e6131d4..8c314d7 100644 --- a/arch/x86/events/amd/uncore.c +++ b/arch/x86/events/amd/uncore.c @@ -174,8 +174,8 @@ static void amd_uncore_del(struct perf_event *event, int flags) static int amd_uncore_event_init(struct perf_event *event) { - struct amd_uncore *uncore; struct hw_perf_event *hwc = >hw; + struct amd_uncore *uncore; if (event->attr.type != event->pmu->type) return -ENOENT; @@ -215,6 +215,47 @@ static int amd_uncore_event_init(struct perf_event *event) return 0; } +static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc) +{ + return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff); +} + +static int amd_uncore_l2_event_init(struct perf_event *event) +{ + int ret = amd_uncore_event_init(event); + unsigned int event_code; + + if (ret) + return ret; + + /* +* Fam16h L2I performance counter events are in the range: 0x060 - 0x07F +*/ + event_code = amd_get_event_code(>hw); + if (event_code < 0x060 || event_code > 0x07F) + return -EINVAL; + + return 0; +} + +static int amd_uncore_nb_event_init(struct perf_event *event) +{ + int ret = amd_uncore_event_init(event); + unsigned int event_code; + + if (ret) + return ret; + + /* +* AMD NB events will have bits 0x0E0 set. +*/ + event_code = amd_get_event_code(>hw); + if ((event_code & 0x0E0) != 0x0E0) + return -EINVAL; + + return 0; +} + static ssize_t amd_uncore_attr_show_cpumask(struct device *dev, struct device_attribute *attr, char *buf) @@ -266,7 +307,7 @@ static struct pmu amd_nb_pmu = { .task_ctx_nr= perf_invalid_context, .attr_groups= amd_uncore_attr_groups, .name = "amd_nb", - .event_init = amd_uncore_event_init, + .event_init = amd_uncore_nb_event_init, .add= amd_uncore_add, .del= amd_uncore_del, .start = amd_uncore_start, @@ -278,7 +319,7 @@ static struct pmu amd_l2_pmu = { .task_ctx_nr= perf_invalid_context, .attr_groups= amd_uncore_attr_groups, .name = "amd_l2", - .event_init = amd_uncore_event_init, + .event_init = amd_uncore_l2_event_init, .add= amd_uncore_add, .del= amd_uncore_del, .start = amd_uncore_start,
Re: perf: fuzzer crashes immediately on AMD system
On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote: > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > > falls over more or less immediately. > > > > > > This maps to variable_test_bit() > > > called by ctx = find_get_context(pmu, task, event); > > > in kernel/events/core.c:9467 > > > > > > It happens quickly enough I can probably track down the exact event that > > > causes this, if needed. > > > > I have a one line reproducer: > > > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls > > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the > various manuals to see if I can spot the fail. > > Huang could you either prod someone at AMD or do yourself, audit the AMD > perf code for all the various new models? So this should obviously help a little in that it will limit the events you can program into the hardware. Not at all sure that is what you're hitting though, because I cannot for the life of me figure how that would end up exploding in generic code. --- arch/x86/events/amd/uncore.c | 47 +--- 1 file changed, 44 insertions(+), 3 deletions(-) diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c index e6131d4..8c314d7 100644 --- a/arch/x86/events/amd/uncore.c +++ b/arch/x86/events/amd/uncore.c @@ -174,8 +174,8 @@ static void amd_uncore_del(struct perf_event *event, int flags) static int amd_uncore_event_init(struct perf_event *event) { - struct amd_uncore *uncore; struct hw_perf_event *hwc = >hw; + struct amd_uncore *uncore; if (event->attr.type != event->pmu->type) return -ENOENT; @@ -215,6 +215,47 @@ static int amd_uncore_event_init(struct perf_event *event) return 0; } +static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc) +{ + return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff); +} + +static int amd_uncore_l2_event_init(struct perf_event *event) +{ + int ret = amd_uncore_event_init(event); + unsigned int event_code; + + if (ret) + return ret; + + /* +* Fam16h L2I performance counter events are in the range: 0x060 - 0x07F +*/ + event_code = amd_get_event_code(>hw); + if (event_code < 0x060 || event_code > 0x07F) + return -EINVAL; + + return 0; +} + +static int amd_uncore_nb_event_init(struct perf_event *event) +{ + int ret = amd_uncore_event_init(event); + unsigned int event_code; + + if (ret) + return ret; + + /* +* AMD NB events will have bits 0x0E0 set. +*/ + event_code = amd_get_event_code(>hw); + if ((event_code & 0x0E0) != 0x0E0) + return -EINVAL; + + return 0; +} + static ssize_t amd_uncore_attr_show_cpumask(struct device *dev, struct device_attribute *attr, char *buf) @@ -266,7 +307,7 @@ static struct pmu amd_nb_pmu = { .task_ctx_nr= perf_invalid_context, .attr_groups= amd_uncore_attr_groups, .name = "amd_nb", - .event_init = amd_uncore_event_init, + .event_init = amd_uncore_nb_event_init, .add= amd_uncore_add, .del= amd_uncore_del, .start = amd_uncore_start, @@ -278,7 +319,7 @@ static struct pmu amd_l2_pmu = { .task_ctx_nr= perf_invalid_context, .attr_groups= amd_uncore_attr_groups, .name = "amd_l2", - .event_init = amd_uncore_event_init, + .event_init = amd_uncore_l2_event_init, .add= amd_uncore_add, .del= amd_uncore_del, .start = amd_uncore_start,
Re: perf: fuzzer crashes immediately on AMD system
On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > falls over more or less immediately. > > > > This maps to variable_test_bit() > > called by ctx = find_get_context(pmu, task, event); > > in kernel/events/core.c:9467 > > > > It happens quickly enough I can probably track down the exact event that > > causes this, if needed. > > I have a one line reproducer: > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls OK, cannot reproduce on my fam15h/model1h. I'll go dig through the various manuals to see if I can spot the fail. Huang could you either prod someone at AMD or do yourself, audit the AMD perf code for all the various new models?
Re: perf: fuzzer crashes immediately on AMD system
On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote: > On Thu, 18 Aug 2016, Vince Weaver wrote: > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > > falls over more or less immediately. > > > > This maps to variable_test_bit() > > called by ctx = find_get_context(pmu, task, event); > > in kernel/events/core.c:9467 > > > > It happens quickly enough I can probably track down the exact event that > > causes this, if needed. > > I have a one line reproducer: > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls OK, cannot reproduce on my fam15h/model1h. I'll go dig through the various manuals to see if I can spot the fail. Huang could you either prod someone at AMD or do yourself, audit the AMD perf code for all the various new models?
Re: perf: fuzzer crashes immediately on AMD system
On Thu, 18 Aug 2016, Vince Weaver wrote: > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > falls over more or less immediately. > > This maps to variable_test_bit() > called by ctx = find_get_context(pmu, task, event); > in kernel/events/core.c:9467 > > It happens quickly enough I can probably track down the exact event that > causes this, if needed. I have a one line reproducer: perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
Re: perf: fuzzer crashes immediately on AMD system
On Thu, 18 Aug 2016, Vince Weaver wrote: > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it > falls over more or less immediately. > > This maps to variable_test_bit() > called by ctx = find_get_context(pmu, task, event); > in kernel/events/core.c:9467 > > It happens quickly enough I can probably track down the exact event that > causes this, if needed. I have a one line reproducer: perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
perf: fuzzer crashes immediately on AMD system
Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it falls over more or less immediately. This maps to variable_test_bit() called by ctx = find_get_context(pmu, task, event); in kernel/events/core.c:9467 It happens quickly enough I can probably track down the exact event that causes this, if needed. [ 101.970659] BUG: unable to handle kernel paging request at 8653d8a0 [ 101.977676] IP: [] find_get_context.isra.75+0x28/0x20f [ 101.984405] PGD 2807067 PUD 2808063 PMD 0 [ 101.988563] Oops: [#1] SMP [ 102.069521] CPU: 0 PID: 2205 Comm: perf_fuzzer Not tainted 4.8.0-rc2+ #27 [ 102.076313] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013 [ 102.085268] task: 880223ae5000 task.stack: 880224ea8000 [ 102.091188] RIP: 0010:[] [] find_get_context.isra.75+0x28/0x20f [ 102.100339] RSP: 0018:880224eabe20 EFLAGS: 00010246 [ 102.105657] RAX: 2633e300 RBX: RCX: 2633e300 [ 102.112795] RDX: RSI: RDI: 8180ea00 [ 102.119929] RBP: 8180ea00 R08: 0004 R09: [ 102.127063] R10: 0003 R11: 0246 R12: 2633e300 [ 102.134196] R13: R14: R15: 8180ea00 [ 102.141327] FS: 7f743b391700() GS:88022ec0() knlGS: [ 102.149416] CS: 0010 DS: ES: CR0: 80050033 [ 102.155167] CR2: 8653d8a0 CR3: 0002255b9000 CR4: 000407f0 [ 102.162309] Stack: [ 102.164323] 880223b9d800 880224fdd000 [ 102.171804] 880223b9d800 [ 102.179284] 8180ea00 810e72be 0002 88022e0006c0 [ 102.186765] Call Trace: [ 102.189216] [] ? SYSC_perf_event_open+0x525/0xa34 [ 102.195579] [] ? entry_SYSCALL_64_fastpath+0x17/0x93 [ 102.202203] Code: 41 5c c3 41 57 41 56 41 55 41 54 55 53 48 89 fd 48 89 f3 48 83 ec 18 48 85 f6 75 6c 83 3d 2f 2a 7f 00 00 41 89 cc 7f 1e 44 89 e0 <48> 0f a3 05 87 0f 7f 00 0f 92 c0 84 c0 75 26 48 c7 c0 ed ff ff [ 102.56] RIP [] find_get_context.isra.75+0x28/0x20f [ 102.229065] RSP [ 102.232556] CR2: 8653d8a0 [ 102.235879] ---[ end trace fa649074c022bab1 ]---
perf: fuzzer crashes immediately on AMD system
Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it falls over more or less immediately. This maps to variable_test_bit() called by ctx = find_get_context(pmu, task, event); in kernel/events/core.c:9467 It happens quickly enough I can probably track down the exact event that causes this, if needed. [ 101.970659] BUG: unable to handle kernel paging request at 8653d8a0 [ 101.977676] IP: [] find_get_context.isra.75+0x28/0x20f [ 101.984405] PGD 2807067 PUD 2808063 PMD 0 [ 101.988563] Oops: [#1] SMP [ 102.069521] CPU: 0 PID: 2205 Comm: perf_fuzzer Not tainted 4.8.0-rc2+ #27 [ 102.076313] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013 [ 102.085268] task: 880223ae5000 task.stack: 880224ea8000 [ 102.091188] RIP: 0010:[] [] find_get_context.isra.75+0x28/0x20f [ 102.100339] RSP: 0018:880224eabe20 EFLAGS: 00010246 [ 102.105657] RAX: 2633e300 RBX: RCX: 2633e300 [ 102.112795] RDX: RSI: RDI: 8180ea00 [ 102.119929] RBP: 8180ea00 R08: 0004 R09: [ 102.127063] R10: 0003 R11: 0246 R12: 2633e300 [ 102.134196] R13: R14: R15: 8180ea00 [ 102.141327] FS: 7f743b391700() GS:88022ec0() knlGS: [ 102.149416] CS: 0010 DS: ES: CR0: 80050033 [ 102.155167] CR2: 8653d8a0 CR3: 0002255b9000 CR4: 000407f0 [ 102.162309] Stack: [ 102.164323] 880223b9d800 880224fdd000 [ 102.171804] 880223b9d800 [ 102.179284] 8180ea00 810e72be 0002 88022e0006c0 [ 102.186765] Call Trace: [ 102.189216] [] ? SYSC_perf_event_open+0x525/0xa34 [ 102.195579] [] ? entry_SYSCALL_64_fastpath+0x17/0x93 [ 102.202203] Code: 41 5c c3 41 57 41 56 41 55 41 54 55 53 48 89 fd 48 89 f3 48 83 ec 18 48 85 f6 75 6c 83 3d 2f 2a 7f 00 00 41 89 cc 7f 1e 44 89 e0 <48> 0f a3 05 87 0f 7f 00 0f 92 c0 84 c0 75 26 48 c7 c0 ed ff ff [ 102.56] RIP [] find_get_context.isra.75+0x28/0x20f [ 102.229065] RSP [ 102.232556] CR2: 8653d8a0 [ 102.235879] ---[ end trace fa649074c022bab1 ]---