On Jan 25, 2008 11:24 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 25, 2008 at 11:08:53PM -0800, Yinghai Lu wrote: > > On Jan 25, 2008 10:14 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote: > > > > On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > > > > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote: > > > > > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote: > > > > > > > > > > > > > > * Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote: > > > > > > > > > current linus tree + x86.git > > > > > > > > > > > > > > > > > > got > > > > > > > > > > > > > > > > > > Calling initcall 0xffffffff80b93d98: > > > > > > > > > threshold_init_device+0x0/0x3f() > > > > > > > > > BUG: unable to handle kernel NULL pointer dereference at > > > > > > > > > 0000000000000040 > > > > > > > > > IP: [<ffffffff80458e20>] kobject_uevent_env+0x2a/0x3d9 > > > > > > > > > > > > > > > > Does this happen on just Linus's tree? > > > > > > > > > > > > > > > > Can you send me a .config file for this? > > > > > > > > > > > > > > > > What is threshold_init()? Is it something new in the x86.git > > > > > > > > tree? > > > > > > > > > > > > > > no. A quick grep shows that it is in a file that _your_ changes in > > > > > > > Linus' latest have touched: > > > > > > > > > > > > > > arch/x86/kernel/cpu/mcheck/mce_amd_64.c > > > > > > > > > > > > Ok, those are pretty much just search/and/replace type changes, but > > > > > > I > > > > > > have been running x86-64 boxes with these changes in place. > > > > > > > > > > Oh wait, I do see a change. We are now (finally) emitting a kobject > > > > > uevent for these devices, which somehow the code can't handle > > > > > properly. > > > > > > > > > > Let me go poke this some more, unfortunatly I don't have any AMD 64 > > > > > boxes here anymore, only Intel based processors, so I can't run this > > > > > module... > > > > > > > > it only happens with AMD Quad Core CPU or Fam 10h. > > > > > > > > works well with AMD opteron Rev E, and Rev F. > > > > > > So this only dies on a multi-core system? Or does 2 processor boxes > > > work, but not 4? > > > > 2 sockets x quad core will fail (fam 10h) > > 2 sockets x dual core works....( rev E, and rev F opteron) > > > > there are some changs between opteron and fam10h. fam10h may have > > more local vectors for MCE... > > or more banks and blocks... > > > > will look at AMD64 Bios and kernel porting guide for Fam 10h again.. > > > > wonder if your code uncover some bugs ... > > No, the logic in this function is just crazy. It's recursive, but we > can circumvent the creation for the kobject and whole creation of the > threshold_block if some conditions are met. That's why we see the > allocate_threshold_blocks so many times in the callstack, yet only a few > kobjects created.
i produced one patch that remove the recursive. will test it and your patch Monday. YH
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c index 65621fd..5c4cb21 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c +++ b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c @@ -553,8 +552,9 @@ static __cpuinit int threshold_create_device(unsigned int cpu) unsigned int bank; int err = 0; + printk(KERN_DEBUG "threshold_create_device: cpu %d, bank_map=%02x\n", cpu, per_cpu(bank_map,cpu)); for (bank = 0; bank < NR_BANKS; ++bank) { - if (!(per_cpu(bank_map, cpu) & 1 << bank)) + if (!(per_cpu(bank_map, cpu) & (1 << bank))) continue; err = threshold_create_bank(cpu, bank); if (err) @@ -637,7 +637,7 @@ static void threshold_remove_device(unsigned int cpu) unsigned int bank; for (bank = 0; bank < NR_BANKS; ++bank) { - if (!(per_cpu(bank_map, cpu) & 1 << bank)) + if (!(per_cpu(bank_map, cpu) & (1 << bank))) continue; threshold_remove_bank(cpu, bank); }