Re: Problems with bluesmoke.c in 2.2.17
Hi! > e other day I got the patch for 2.2.17 and after just over a day of normal > > operation, while my sister was playing kpat (KDE solitaire) yesterday > > afternoon, X died and dropped her out to the console. > > After she told me about it later on I found this at the bottom of my dmesg: > > > > CPU 0: Machine Check Exception: 0004<0>Bank 3: b2080a01general >protection fault: > > Ok I print low,high which was wrong - it should read 0004 > which is 'machine check in progress' > > So its a real machinme check > > > CPU:0 > > EIP:0010:[] > > Oh beautiful. This is wonderful. I've been hoping for a chance to test the MCE > code. Ok you might not agree 8) > > Right there is missing \n I'll go fix but the rest of it says > > b2- register valid, uncorrected error, error enabled > 0008 - model specific data > 0a01 - memory access, > generic error > l1 cache > processor responding to request Does that mean that processor itself detected that its L1 cache s failing and was honest enough to tell the OS? -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
Hi! e other day I got the patch for 2.2.17 and after just over a day of normal operation, while my sister was playing kpat (KDE solitaire) yesterday afternoon, X died and dropped her out to the console. After she told me about it later on I found this at the bottom of my dmesg: CPU 0: Machine Check Exception: 00040Bank 3: b2080a01general protection fault: Ok I print low,high which was wrong - it should read 0004 which is 'machine check in progress' So its a real machinme check CPU:0 EIP:0010:[c010e59b] Oh beautiful. This is wonderful. I've been hoping for a chance to test the MCE code. Ok you might not agree 8) Right there is missing \n I'll go fix but the rest of it says b2- register valid, uncorrected error, error enabled 0008 - model specific data 0a01 - memory access, generic error l1 cache processor responding to request Does that mean that processor itself detected that its L1 cache s failing and was honest enough to tell the OS? -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
> > b2 - register valid, uncorrected error, error enabled > > 0008- model specific data > > 0a01- memory access, > > generic error > > l1 cache > > processor responding to request > > Does that mean that processor itself detected that its L1 cache s failing and > was honest enough to tell the OS? That is what the MCE feature is meant to do. On current CPU's its not to hot on reporting errors and recovering but the first stage to that kind of mainframe level stuff is in current x86 silicon yes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
b2 - register valid, uncorrected error, error enabled 0008- model specific data 0a01- memory access, generic error l1 cache processor responding to request Does that mean that processor itself detected that its L1 cache s failing and was honest enough to tell the OS? That is what the MCE feature is meant to do. On current CPU's its not to hot on reporting errors and recovering but the first stage to that kind of mainframe level stuff is in current x86 silicon yes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
> And here's another sample output for you: > > CPU 1: Machine Check Exception: 0004<0>Bank 0: f20001000800general >protection fault: > CPU:1 > EIP:0010:[mcheck_fault+263/368] > EFLAGS: 00010246 > ... > > I seldom get a log entry, most of the time I get the first line on all my > xterms and then a hard lock. > > And Alan, in case you don't know about it, there's a kernel module > that reports RAM ECC errors at http://www.anime.net/~goemon/linux-ecc/ > It seems related, maybe you will find some of the code useful. It handles the bus/chipset layer above what the MCE code monitors. Its the other half of the code. And yes both are important. Its difficult to claim a highly available and reliable system when it doesnt detect and log errors for analysis. I've fixed a couple of bugs, when 2.2.18pre4 is out can you try it and get another trap out of it. It should this time log it nicely and panic cleanly not oops. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
And here's another sample output for you: CPU 1: Machine Check Exception: 0004<0>Bank 0: f20001000800general protection fault: CPU:1 EIP:0010:[mcheck_fault+263/368] EFLAGS: 00010246 ... I seldom get a log entry, most of the time I get the first line on all my xterms and then a hard lock. And Alan, in case you don't know about it, there's a kernel module that reports RAM ECC errors at http://www.anime.net/~goemon/linux-ecc/ It seems related, maybe you will find some of the code useful. Thanks again Andrew - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
> Based on what bluesmoke.c said about my 2nd PII-333 CPU I just got > Intel to give me an RMA number for its replacement. Thank you Alan Cox ;-) I'd like to finish verifying the code first but umm ok. Do send me traces if you get any of these exceptions. I've still had no answer to my request for intel to testbed verify this stuff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
> Based on what bluesmoke.c said about my 2nd PII-333 CPU I just got Intel to give me an RMA number for its replacement. Thank you Alan Cox ;-) : ~v - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
> The other day I got the patch for 2.2.17 and after just over a day of normal > operation, while my sister was playing kpat (KDE solitaire) yesterday > afternoon, X died and dropped her out to the console. > After she told me about it later on I found this at the bottom of my dmesg: > > CPU 0: Machine Check Exception: 0004<0>Bank 3: b2080a01general >protection fault: Ok I print low,high which was wrong - it should read 0004 which is 'machine check in progress' So its a real machinme check > CPU:0 > EIP:0010:[] Oh beautiful. This is wonderful. I've been hoping for a chance to test the MCE code. Ok you might not agree 8) Right there is missing \n I'll go fix but the rest of it says b2 - register valid, uncorrected error, error enabled 0008- model specific data 0a01- memory access, generic error l1 cache processor responding to request So there we are - a real live CPU error > Sep 7 17:51:57 fury kernel: CPU 0: Machine Check Exception: 0004<0>Bank >0: b200 > 00800800general protection fault: b2 - register valid uncorrected error, error enabled 0008- model specific data 0800- memory access, local processor originated request, l0 generic error - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Problems with bluesmoke.c in 2.2.17
The other day I got the patch for 2.2.17 and after just over a day of normal operation, while my sister was playing kpat (KDE solitaire) yesterday afternoon, X died and dropped her out to the console. After she told me about it later on I found this at the bottom of my dmesg: CPU 0: Machine Check Exception: 0004<0>Bank 3: b2080a01general protection fault: CPU:0 EIP:0010:[] EFLAGS: 00010246 eax: 00080a01 ebx: 3200 ecx: 040d edx: 3200 esi: 000c edi: 0003 ebp: 0003 esp: c8931f98 ds: 0018 es: 0018 ss: 0018 Process kwm (pid: 998, process nr: 33, stackpage=c8931000) Stack: bfffe458 bfffe438 0005 c893 040d 0004 00080a01 c010a035 c8931fc4 40276c60 4055b548 0028 080dd4f0 bfffe458 bfffe438 080dd4f0 002b 002b 40189e68 0023 00010216 Call Trace: [] Code: 0f 30 a1 04 e4 1b c0 89 44 24 10 45 3b 6c 24 10 0f 8c 3b ff I was a bit surprised to get something like this in a stable kernel (I have been running 2.3.99 and up with no significant problems, until today anyway). I rebooted, mostly to change to a smaller fb mode. And then a few hours later while in X again, pushing netscape a bit hard maybe.. and then it rebooted itself. Found this in my syslog afterwards: Sep 7 17:51:57 fury kernel: CPU 0: Machine Check Exception: 0004<0>Bank 0: b200 00800800general protection fault: Sep 7 17:51:57 fury kernel: CPU 0: Machine Check Exception: 0004<0>Bank 0: b200 00800800general protection fault: Sep 7 17:51:57 fury kernel: CPU:0 I tracked the "Machine Check Exception:" bit down to arch/kernel/bluesmoke.c (a forboding name for a piece of code). A new file according to the patch and something that wasn't in Alan's list of changes for the release. Most of it is well over my head, but I was wondering how something like this gets into a stable kernel when it has worked fine without it for so long and buggers up when it's added in. Entire dmesg from that day follows. Thought /proc/cpuinfo might be useful too.. Linux version 2.2.17 (root@fury) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #3 Tue Sep 5 19:20:19 EST 2000 Detected 300686 kHz processor. Console: colour dummy device 80x25 Calibrating delay loop... 599.65 BogoMIPS Memory: 193204k/196608k available (748k kernel code, 412k reserved, 2188k data, 56k init) Dentry hash table entries: 32768 (order 6, 256k) Buffer cache hash table entries: 262144 (order 8, 1024k) Page cache hash table entries: 65536 (order 6, 256k) Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: Intel Pentium II (Klamath) stepping 04 Checking 386/387 coupling... OK, FPU using exception 16 error reporting. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX mtrr: v1.35a (19990819) Richard Gooch ([EMAIL PROTECTED]) PCI: PCI BIOS revision 2.10 entry at 0xfae60 PCI: Using configuration type 1 PCI: Probing PCI hardware Linux NET4.0 for Linux 2.2 Based upon Swansea University Computer Society NET3.039 NET4: Unix domain sockets 1.0 for Linux NET4.0. NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP TCP: Hash tables configured (ehash 262144 bhash 65536) Starting kswapd v 1.5 vesafb: framebuffer at 0xe600, mapped to 0xcc80, size 4096k vesafb: mode is 1024x768x8, linelength=1024, pages=4 vesafb: protected mode interface info at c000:0584 vesafb: scrolling: redraw Console: switching to colour frame buffer device 128x48 fb0: VESA VGA frame buffer device Detected PS/2 Mouse Port. pty: 256 Unix98 ptys configured apm: BIOS version 1.2 Flags 0x07 (Driver version 1.13) Real Time Clock Driver v1.09 loop: registered device at major 7 PIIX4: IDE controller on PCI bus 00 dev 39 PIIX4: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:pio, hdb:pio ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio hda: WDC AC36400L, ATA DISK drive hdb: CD-ROM 36X/AKU, ATAPI CDROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: WDC AC36400L, 6149MB w/256kB Cache, CHS=784/255/63, UDMA Floppy drive(s): fd0 is 1.44M FDC 0 is a post-1991 82077 Partition check: hda: hda1 hda2 hda3 hda4 VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 56k freed rtl8139.c:v1.07 5/6/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/rtl8139.html eth0: RealTek RTL8139 Fast Ethernet at 0xe400, IRQ 9, 00:48:54:3f:62:a3. es1370: version v0.31 time 19:25:42 Sep 5 2000 es1370: found adapter at io 0xe800 irq 10 es1370: features: joystick off, line in, mic impedance 0 es1370: unloading CPU 0: Machine Check Exception: 0004<0>Bank 3: b2080a01general protection fault: CPU:0 EIP:0010:[] EFLAGS: 00010246 eax: 00080a01 ebx: 3200 ecx: 040d edx: 3200 esi: 000c edi: 0003 ebp: 0003 esp: c8931f98 ds: 0018 es: 0018 ss:
Problems with bluesmoke.c in 2.2.17
The other day I got the patch for 2.2.17 and after just over a day of normal operation, while my sister was playing kpat (KDE solitaire) yesterday afternoon, X died and dropped her out to the console. After she told me about it later on I found this at the bottom of my dmesg: CPU 0: Machine Check Exception: 00040Bank 3: b2080a01general protection fault: CPU:0 EIP:0010:[c010e59b] EFLAGS: 00010246 eax: 00080a01 ebx: 3200 ecx: 040d edx: 3200 esi: 000c edi: 0003 ebp: 0003 esp: c8931f98 ds: 0018 es: 0018 ss: 0018 Process kwm (pid: 998, process nr: 33, stackpage=c8931000) Stack: bfffe458 bfffe438 0005 c893 040d 0004 00080a01 c010a035 c8931fc4 40276c60 4055b548 0028 080dd4f0 bfffe458 bfffe438 080dd4f0 002b 002b 40189e68 0023 00010216 Call Trace: [c010a035] Code: 0f 30 a1 04 e4 1b c0 89 44 24 10 45 3b 6c 24 10 0f 8c 3b ff I was a bit surprised to get something like this in a stable kernel (I have been running 2.3.99 and up with no significant problems, until today anyway). I rebooted, mostly to change to a smaller fb mode. And then a few hours later while in X again, pushing netscape a bit hard maybe.. and then it rebooted itself. Found this in my syslog afterwards: Sep 7 17:51:57 fury kernel: CPU 0: Machine Check Exception: 00040Bank 0: b200 00800800general protection fault: Sep 7 17:51:57 fury kernel: CPU 0: Machine Check Exception: 00040Bank 0: b200 00800800general protection fault: Sep 7 17:51:57 fury kernel: CPU:0 I tracked the "Machine Check Exception:" bit down to arch/kernel/bluesmoke.c (a forboding name for a piece of code). A new file according to the patch and something that wasn't in Alan's list of changes for the release. Most of it is well over my head, but I was wondering how something like this gets into a stable kernel when it has worked fine without it for so long and buggers up when it's added in. Entire dmesg from that day follows. Thought /proc/cpuinfo might be useful too.. Linux version 2.2.17 (root@fury) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #3 Tue Sep 5 19:20:19 EST 2000 Detected 300686 kHz processor. Console: colour dummy device 80x25 Calibrating delay loop... 599.65 BogoMIPS Memory: 193204k/196608k available (748k kernel code, 412k reserved, 2188k data, 56k init) Dentry hash table entries: 32768 (order 6, 256k) Buffer cache hash table entries: 262144 (order 8, 1024k) Page cache hash table entries: 65536 (order 6, 256k) Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: Intel Pentium II (Klamath) stepping 04 Checking 386/387 coupling... OK, FPU using exception 16 error reporting. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX mtrr: v1.35a (19990819) Richard Gooch ([EMAIL PROTECTED]) PCI: PCI BIOS revision 2.10 entry at 0xfae60 PCI: Using configuration type 1 PCI: Probing PCI hardware Linux NET4.0 for Linux 2.2 Based upon Swansea University Computer Society NET3.039 NET4: Unix domain sockets 1.0 for Linux NET4.0. NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP TCP: Hash tables configured (ehash 262144 bhash 65536) Starting kswapd v 1.5 vesafb: framebuffer at 0xe600, mapped to 0xcc80, size 4096k vesafb: mode is 1024x768x8, linelength=1024, pages=4 vesafb: protected mode interface info at c000:0584 vesafb: scrolling: redraw Console: switching to colour frame buffer device 128x48 fb0: VESA VGA frame buffer device Detected PS/2 Mouse Port. pty: 256 Unix98 ptys configured apm: BIOS version 1.2 Flags 0x07 (Driver version 1.13) Real Time Clock Driver v1.09 loop: registered device at major 7 PIIX4: IDE controller on PCI bus 00 dev 39 PIIX4: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:pio, hdb:pio ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio hda: WDC AC36400L, ATA DISK drive hdb: CD-ROM 36X/AKU, ATAPI CDROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: WDC AC36400L, 6149MB w/256kB Cache, CHS=784/255/63, UDMA Floppy drive(s): fd0 is 1.44M FDC 0 is a post-1991 82077 Partition check: hda: hda1 hda2 hda3 hda4 VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 56k freed rtl8139.c:v1.07 5/6/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/rtl8139.html eth0: RealTek RTL8139 Fast Ethernet at 0xe400, IRQ 9, 00:48:54:3f:62:a3. es1370: version v0.31 time 19:25:42 Sep 5 2000 es1370: found adapter at io 0xe800 irq 10 es1370: features: joystick off, line in, mic impedance 0 es1370: unloading CPU 0: Machine Check Exception: 00040Bank 3: b2080a01general protection fault: CPU:0 EIP:0010:[c010e59b] EFLAGS: 00010246 eax: 00080a01 ebx: 3200 ecx: 040d edx: 3200 esi: 000c edi: 0003 ebp: 0003 esp: c8931f98 ds: 0018
Re: Problems with bluesmoke.c in 2.2.17
The other day I got the patch for 2.2.17 and after just over a day of normal operation, while my sister was playing kpat (KDE solitaire) yesterday afternoon, X died and dropped her out to the console. After she told me about it later on I found this at the bottom of my dmesg: CPU 0: Machine Check Exception: 00040Bank 3: b2080a01general protection fault: Ok I print low,high which was wrong - it should read 0004 which is 'machine check in progress' So its a real machinme check CPU:0 EIP:0010:[c010e59b] Oh beautiful. This is wonderful. I've been hoping for a chance to test the MCE code. Ok you might not agree 8) Right there is missing \n I'll go fix but the rest of it says b2 - register valid, uncorrected error, error enabled 0008- model specific data 0a01- memory access, generic error l1 cache processor responding to request So there we are - a real live CPU error Sep 7 17:51:57 fury kernel: CPU 0: Machine Check Exception: 00040Bank 0: b200 00800800general protection fault: b2 - register valid uncorrected error, error enabled 0008- model specific data 0800- memory access, local processor originated request, l0 generic error - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
Based on what bluesmoke.c said about my 2nd PII-333 CPU I just got Intel to give me an RMA number for its replacement. Thank you Alan Cox ;-) : ~v - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
Based on what bluesmoke.c said about my 2nd PII-333 CPU I just got Intel to give me an RMA number for its replacement. Thank you Alan Cox ;-) I'd like to finish verifying the code first but umm ok. Do send me traces if you get any of these exceptions. I've still had no answer to my request for intel to testbed verify this stuff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
sorry about that truncated email, this is the rest of what I ment to send And here's another sample output for you: CPU 1: Machine Check Exception: 00040Bank 0: f20001000800general protection fault: CPU:1 EIP:0010:[mcheck_fault+263/368] EFLAGS: 00010246 ... I seldom get a log entry, most of the time I get the first line on all my xterms and then a hard lock. And Alan, in case you don't know about it, there's a kernel module that reports RAM ECC errors at http://www.anime.net/~goemon/linux-ecc/ It seems related, maybe you will find some of the code useful. Thanks again Andrew - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with bluesmoke.c in 2.2.17
And here's another sample output for you: CPU 1: Machine Check Exception: 00040Bank 0: f20001000800general protection fault: CPU:1 EIP:0010:[mcheck_fault+263/368] EFLAGS: 00010246 ... I seldom get a log entry, most of the time I get the first line on all my xterms and then a hard lock. And Alan, in case you don't know about it, there's a kernel module that reports RAM ECC errors at http://www.anime.net/~goemon/linux-ecc/ It seems related, maybe you will find some of the code useful. It handles the bus/chipset layer above what the MCE code monitors. Its the other half of the code. And yes both are important. Its difficult to claim a highly available and reliable system when it doesnt detect and log errors for analysis. I've fixed a couple of bugs, when 2.2.18pre4 is out can you try it and get another trap out of it. It should this time log it nicely and panic cleanly not oops. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/