Bad page state on AMD Opteron Dual System with kernel 2.6.13-rc6-git13
Hi there, i have read some postings concerning the following Kernel Messages: Aug 26 18:04:01 montdsnsu3 kernel: grep[11619] general protection rip:2aaaed43 rsp:7f9c0740 error:0 Aug 26 18:08:02 montdsnsu3 kernel: ping[14867] general protection rip:2aaaed43 rsp:7fdbf300 error:0 Aug 26 18:08:03 montdsnsu3 kernel: grep[14987] general protection rip:2aaaed43 rsp:7fdbfce0 error:0 Aug 26 18:08:03 montdsnsu3 kernel: grep[15041] general protection rip:2aaaed43 rsp:7f9bf550 error:0 And the Bad Page State Messages: Bad page state at prep_new_page (in process 'sh', page 8100011a69c8) flags:0x0114 mapping: mapcount:-3 count:0 Backtrace: Call Trace:{bad_page+99} {prep_new_page+65} {buffered_rmqueue+306} {__alloc_pages+261} {get_zeroed_page+37} {__pmd_alloc+55} {copy_page_range+462} {copy_mm+820} {copy_process+2282} {do_fork+215} {system_call+126} {ptregscall_common+103} Trying to fix it up, but a reboot is needed I have the same issues on an SUN V20z with an dual opteron 248. montdsnsu3:~# lspci :00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) :00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) :00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) :00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) :00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) :00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) :00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) :00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) :00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge :00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge :00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge :00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge :00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge :00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge :00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge :00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge :01:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) :01:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) :01:05.0 VGA compatible controller: Trident Microsystems Blade 3D PCI/AGP (rev 3a) :02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03) :02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03) :02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) With the running kernel i get 2 kernel panics within the last week and the machine crash totally. I would like to offer my help if i can do anything in debugging this or deal with more informations to fix this issue. HTH, weiti -- Interpunktion und Orthographie dieser Email ist frei erfunden. Eine Übereinstimmung mit aktuellen oder ehemaligen Regeln wäre rein zufällig und ist nicht beabsichtigt. Tim Weippert <[EMAIL PROTECTED]> http://www.topf-sicret.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bad page state on AMD Opteron Dual System with kernel 2.6.13-rc6-git13
On Sun, Aug 28, 2005 at 01:20:51AM +0100, Daniel Drake wrote: > Hi, > > Tim Weippert wrote: > >i have read some postings concerning the following Kernel Messages: > > > >Aug 26 18:04:01 montdsnsu3 kernel: grep[11619] general protection > >rip:2aaaed43 rsp:7f9c0740 error:0 > >Aug 26 18:08:02 montdsnsu3 kernel: ping[14867] general protection > >rip:2aaaed43 rsp:7fdbf300 error:0 > >Aug 26 18:08:03 montdsnsu3 kernel: grep[14987] general protection > >rip:2aaaed43 rsp:7fdbfce0 error:0 > >Aug 26 18:08:03 montdsnsu3 kernel: grep[15041] general protection > >rip:2aaaed43 rsp:7f9bf550 error:0 > > > >And the Bad Page State Messages: > > > >Bad page state at prep_new_page (in process 'sh', page 8100011a69c8) > >flags:0x0114 mapping: mapcount:-3 count:0 > >Backtrace: > > > >Call Trace:{bad_page+99} > >{prep_new_page+65} > > {buffered_rmqueue+306} > >{__alloc_pages+261} > > {get_zeroed_page+37} > >{__pmd_alloc+55} > > {copy_page_range+462} > >{copy_mm+820} > > {copy_process+2282} > >{do_fork+215} > > {system_call+126} > >{ptregscall_common+103} > > > >Trying to fix it up, but a reboot is needed > > Seems to be an identical problem as was filed here: > > http://bugs.gentoo.org/show_bug.cgi?id=103497 > > This bug report seems to suggest that the ondemand scaling governor may be > at fault. Does your setup use this too? > > (CC'ing some extra people to make sure problem is known) > As this is an Server, i don't even use cpufreq on this machine. So it think this isn't the same problem ... Kind regards, weiti p.s: Please CC me, cause i am not subscribed on lkml. -- Interpunktion und Orthographie dieser Email ist frei erfunden. Eine Übereinstimmung mit aktuellen oder ehemaligen Regeln wäre rein zufällig und ist nicht beabsichtigt. Tim Weippert <[EMAIL PROTECTED]> http://www.topf-sicret.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bad page state on AMD Opteron Dual System with kernel 2.6.13-rc6-git13
Hi, On Mon, Aug 29, 2005 at 07:24:54AM +0200, Tim Weippert wrote: > On Sun, Aug 28, 2005 at 01:20:51AM +0100, Daniel Drake wrote: > > > > Seems to be an identical problem as was filed here: > > > > http://bugs.gentoo.org/show_bug.cgi?id=103497 > > > > This bug report seems to suggest that the ondemand scaling governor may be > > at fault. Does your setup use this too? > > > > (CC'ing some extra people to make sure problem is known) > > > > As this is an Server, i don't even use cpufreq on this machine. So it > think this isn't the same problem ... Update, with stable 2.6.13. I get nearly the same behavior. One new oops: swap_free: Bad swap file entry c07f802f swap_free: Bad swap file entry c87f802f swap_free: Bad swap file entry d07f802f swap_free: Bad swap file entry d87f802f swap_free: Bad swap file entry e07f802f swap_free: Bad swap file entry 4000 --- [cut here ] - [please bite here ] - Kernel BUG at "mm/rmap.c":493 invalid operand: [1] SMP CPU 1 Modules linked in: autofs4 floppy i2c_amd756 i2c_core hw_random ohci_hcd tg3 tsdev evdev evbug psmouse genrtc unix Pid: 9014, comm: sh Not tainted 2.6.13 RIP: 0010:[] {page_remove_rmap+43} RSP: 0018:8100481c3da0 EFLAGS: 00010286 RAX: RBX: 81004a5fc420 RCX: 8100d000 RDX: RSI: RDI: 8100011a69c8 RBP: 00484000 R08: 0001 R09: 000f R10: 0001 R11: R12: 078bfbff R13: 810040e133e0 R14: 8100011a69c8 R15: FS: 457ff970() GS:8056f880() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 2aabd000 CR3: 48205000 CR4: 06e0 Process sh (pid: 9014, threadinfo 8100481c2000, task 810048e7e270) Stack: 801663f4 00497000 81004937f010 00497000 00497000 00496fff 8100497dd000 00497000 801666ab Call Trace:{zap_pte_range+436} {unmap_page_range+507} {unmap_vmas+293} {exit_mmap+162} {mmput+49} {do_exit+442} {sys_exit_group+0} {system_call+126} Code: 0f 0b a3 b4 5b 3f 80 ff ff ff ff c2 ed 01 66 66 66 90 66 66 RIP {page_remove_rmap+43} RSP <1>Fixing recursive fault but reboot is needed! With this i get an hanging [sh] process which can't be killed, only cleanable with reboot: www-data 7701 0.0 0.3 74448 6452 ?S11:56 0:00 /usr/sbin/cactid 0 93 www-data 7721 0.0 0.5 56296 10504 ? S11:56 0:00 \_ /usr/bin/php /usr/share/cacti/site/script_server.php cactid 0 www-data 9014 0.0 0.0 00 ?D11:56 0:00 \_ [sh] The machine is an cacti system with generally high load ... seems the kernel does only have problems on higher load. HTH, weiti -- Interpunktion und Orthographie dieser Email ist frei erfunden. Eine Übereinstimmung mit aktuellen oder ehemaligen Regeln wäre rein zufällig und ist nicht beabsichtigt. Tim Weippert <[EMAIL PROTECTED]> http://www.topf-sicret.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bad page state on AMD Opteron Dual System with kernel 2.6.13-rc6-git13
On Mon, Aug 29, 2005 at 10:04:05PM +0200, Bongani Hlope wrote: > On Monday 29 August 2005 12:28 pm, you wrote: > > Hi, > > > > 8< > > > > > Update, with stable 2.6.13. I get nearly the same behavior. > > > > I haven't tried 2.6.13 yet (still downloading), could you first try this > (with > yor last working kernel, since you seem to have a problem with 2.6.13) It's not only with 2.6.13, i think it is with all kernels > 2.6.9 or .10. > echo 0 > /proc/sys/kernel/randomize_va_space Ok, i have set this within my 2.6.13 kernel and will look what happens. > If this "hides" the problems for you, please take a look at this bug report, > and add your details: > > http://bugzilla.kernel.org/show_bug.cgi?id=4851 Ok, i will add my details within the next days after examine if this helps. bye, tim -- Interpunktion und Orthographie dieser Email ist frei erfunden. Eine Übereinstimmung mit aktuellen oder ehemaligen Regeln wäre rein zufällig und ist nicht beabsichtigt. Tim Weippert <[EMAIL PROTECTED]> http://www.topf-sicret.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bad page state on AMD Opteron Dual System with kernel 2.6.13-rc6-git13
On Tue, Aug 30, 2005 at 10:51:18AM +0100, Hugh Dickins wrote: > On Tue, 30 Aug 2005, Tim Weippert wrote: > > > > It seems the error is still there: > > > > Aug 29 19:42:09 montdsnsu3 kernel: swap_free: Bad swap file entry > > c87fc02f > > Aug 29 19:42:09 montdsnsu3 kernel: Kernel BUG at "mm/rmap.c":493 > > Aug 29 19:42:09 montdsnsu3 kernel: invalid operand: [3] SMP > > Aug 29 19:42:09 montdsnsu3 kernel: Pid: 26550, comm: sh Tainted: GB > > Thanks a lot for trying and reporting back. That's bad news: I was > willing to bet that the MSR would fix it. Sorry for wasting your time. > > Not quite conclusive though: I think (from the "[3]" above) that you've > not rebooted since the earlier failures? (Nor did I suggest you should.) > > It's conceivable (though not very likely) that here you have the error > reported on exit from a long-running "sh", running since before you made > the MSR fix (the error I'm thinking of occurs when originally exec'ed, > but may pass unnoticed while running). Yes, this can possible, that the sh run before the changes were made. but. The later problem suggest me that this not entirely fix the problem. > > Bongani Hlope suggest me to try this: > > > > echo 0 > /proc/sys/kernel/randomize_va_space and look for > > http://bugzilla.kernel.org/show_bug.cgi?id=4851 > > Please do try that. And if no luck with that, next time it's convenient > for you to reboot, please write the MSR as early as you can to see if > that makes any difference (probably not, but there's a chance). Well, now i haven't any entries for the last 5 hours: Aug 30 08:52:05 montdsnsu3 kernel: ping[6241] general protection rip:2aaaed43 rsp:7f9bff00 error:0 Aug 30 08:54:04 montdsnsu3 kernel: grep[7422] general protection rip:2aaaed43 rsp:7fdbf9b0 error:0 Aug 30 09:14:01 montdsnsu3 kernel: ping[22938] general protection rip:2aaaed43 rsp:7f9c04b0 error:0 Maybe the randomize_va_space will fix or suppress it ... i will wait until tomorrow and then i think i will reboot the machine and do both, the MSR write and the randomize_va_space settings. Thanks for your help, tim -- Interpunktion und Orthographie dieser Email ist frei erfunden. Eine Übereinstimmung mit aktuellen oder ehemaligen Regeln wäre rein zufällig und ist nicht beabsichtigt. Tim Weippert <[EMAIL PROTECTED]> http://www.topf-sicret.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bad page state on AMD Opteron Dual System with kernel 2.6.13-rc6-git13
Hello all together, for now i can confirm, that the problem disappears! I have done the following: First try the msr fix, this doesn't solve the problem entirely, but there were no kernel panics. With the randomize_va_space setting the general protection disappeared too ... I'm now happy and will go for holiday ( 4 weeks *g*). After that i will look if the machine gets some problems back, but i don't think ... I thank you all for your help! On Tue, Aug 30, 2005 at 02:35:29PM +0200, Tim Weippert wrote: > On Tue, Aug 30, 2005 at 10:51:18AM +0100, Hugh Dickins wrote: > > It's conceivable (though not very likely) that here you have the error > > reported on exit from a long-running "sh", running since before you made > > the MSR fix (the error I'm thinking of occurs when originally exec'ed, > > but may pass unnoticed while running). > > Yes, this can possible, that the sh run before the changes were made. > but. The later problem suggest me that this not entirely fix the > problem. > > > Bongani Hlope suggest me to try this: > > > > > > echo 0 > /proc/sys/kernel/randomize_va_space and look for > > > http://bugzilla.kernel.org/show_bug.cgi?id=4851 > > > > Please do try that. And if no luck with that, next time it's convenient > > for you to reboot, please write the MSR as early as you can to see if > > that makes any difference (probably not, but there's a chance). -- Every time I think I know where it's at, they move it. Tim Weippert <[EMAIL PROTECTED]> http://www.topf-sicret.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/