I've been having lots of problems with SMP kernels
Linux-2.0.3{5,6} Linux-2.1.127 and Linux-2.2.0.
As you can see I've been trying different kernels
over time, each time hoping that a newer kernel would
solve the problem, but no luck.
Problem:
Linux crashes/freezes while I'm running simulations which produce lots
of data on my HDDs. This happens with all SMP-enabled kernels mentioned above.
When the same kernels are compiled without SMP, there's no problem.
Sometimes, there's no sign of the crash in /var/log/messages,
sometimes there are messages like this:
kernel: Unable to handle kernel NULL pointer dereference at virtual address
00000008
kernel: current->tss.cr3 = 045c5000, `r3 = 045c5000
kernel: *pde = 00000000
kernel: Oops: 0000
kernel: CPU: 1
kernel: EIP: 0010:[<c0160770>]
kernel: EFLAGS: 00013282
kernel: eax: 00000000 ebx: c545b960 ecx: c2a36000 edx: c545b960
kernel: esi: c468e000 edi: 00000020 ebp: 00000000 esp: c468ff44
kernel: ds: 0018 es: 0018 ss: 0018
kernel: Process X (pid: 358, process nr: 36, stackpage=c468f000)
kernel: Stack: c545b960 c2a36000 00000010 00000004 c0732240 00000000 00000001
00000145
kernel: c468e000 0000189a c2a36000 c2a36000 c012ea09 00000022 c468ffa8
c468ffa4
kernel: c468e000 00000000 bffff898 bffffa20 00003286 c468ffa8 00000050
c07321e0
kernel: Call Trace: [<c012ea09>] [<c0108c00>] [<c010002b>]
kernel: Code: 8b 40 08 8d 90 a8 00 00 00 8b 80 b0 00 00 00 51 52 53 8b 40
At other times there's a whole series of Oops's,
sometimes with a few minutes in between them, followed by a crash.
kernel: Unable to handle kernel paging request at virtual address 756e696c
kernel: current->tss.cr3 = 04d10000, `r3 = 04d10000
kernel: *pde = 00000000
kernel: Oops: 0002
kernel: CPU: 1
kernel: EIP: 0010:[<c012baac>]
kernel: EFLAGS: 00010206
kernel: eax: 756e696c ebx: c4473f84 ecx: c013a7c4 edx: 00000001
kernel: esi: c4473f85 edi: 00000000 ebp: 00000000 esp: c6f5fcb0
kernel: ds: 0018 es: 0018 ss: 0018
kernel: Process analyse_diffcap (pid: 16723, process nr: 112,
stackpage=c6f5f000)
kernel: Stack: 00000000 00000013 000000e7 00000000 00000001 00000001 c012bcd9
c4473f84
kernel: 00000000 00000001 c0091dc0 c4473f86 c6f5e000 00000013 00000000
c013236f
kernel: 00008000 c01323e4 c4473f84 00000000 00000000 c0204df8 fffffff8
c6f5e000
kernel: Call Trace: [<c012bcd9>] [<c013236f>] [<c01323e4>] [<c0131715>]
[<c011c926>] [<c012a4ad>] [<c012a527>]
kernel: [<c012ac23>] [<c012ae2c>] [<c0107a27>] [<c0108be0>]
kernel: Code: ff 00 89 c5 80 3e 00 0f 84 1d 01 00 00 8b 5d 08 83 64 24 2c
kernel: Unable to handle kernel paging request at virtual address 756e696c
kernel: current->tss.cr3 = 063cb000, `r3 = 063cb000
kernel: *pde = 00000000
kernel: Oops: 0002
kernel: CPU: 1
kernel: EIP: 0010:[<c012baac>]
kernel: EFLAGS: 00010206
kernel: eax: 756e696c ebx: c4473f84 ecx: c013a7c4 edx: 00000001
kernel: esi: c4473f85 edi: 00000000 ebp: 00000000 esp: c663fcb0
kernel: ds: 0018 es: 0018 ss: 0018
kernel: Process analyse_diffcap (pid: 16725, process nr: 105,
stackpage=c663f000)
kernel: Stack: 00000000 00000013 000000e7 00000000 00000001 00000001 c012bcd9
c4473f84
kernel: 00000000 00000001 c6fc5020 c4473f86 c663e000 00000013 00000000
c013236f
kernel: 00008000 c01323e4 c4473f84 00000000 00000000 c0204df8 fffffff8
c663e000
kernel: Call Trace: [<c012bcd9>] [<c013236f>] [<c01323e4>] [<c0131715>]
[<c011c926>] [<c012a4ad>] [<c012a527>]
kernel: [<c012ac23>] [<c012ae2c>] [<c0107a27>] [<c0108be0>]
kernel: Code: ff 00 89 c5 80 3e 00 0f 84 1d 01 00 00 8b 5d 08 83 64 24 2c
kernel: Unable to handle kernel paging request at virtual address 756e696c
kernel: current->tss.cr3 = 06445000, `r3 = 06445000
kernel: *pde = 00000000
kernel: Oops: 0002
kernel: CPU: 1
kernel: EIP: 0010:[<c012baac>]
kernel: EFLAGS: 00010206
kernel: eax: 756e696c ebx: c4473f84 ecx: c013a7c4 edx: 00000001
kernel: esi: c4473f85 edi: 00000000 ebp: 00000000 esp: c663fbc8
kernel: ds: 0018 es: 0018 ss: 0018
kernel: Process analyse_diffcap (pid: 16726, process nr: 105,
stackpage=c663f000)
kernel: Stack: 00000000 00000013 000000e7 00000000 00000001 00000001 c012bcd9
c4473f84
kernel: 00000000 00000001 c6fc5140 c4473f86 c663e000 00000013 00000000
c013236f
kernel: 00008000 c01323e4 c4473f84 00000000 00000000 c0204df8 fffffff8
c663e000
kernel: Call Trace: [<c012bcd9>] [<c013236f>] [<c01323e4>] [<c0131715>]
[<c011c926>] [<c012a4ad>] [<c012a527>]
kernel: [<c012ac23>] [<c0135449>] [<c0120068>] [<c012a282>] [<c012a282>]
[<c013547b>] [<c012ac23>] [<c012ae2c>]
kernel: [<c0107a27>] [<c0108be0>]
kernel: Code: ff 00 89 c5 80 3e 00 0f 84 1d 01 00 00 8b 5d 08 83 64 24 2c
(... a lot more Oops messages follow :-(.
BTW1: "analyse_diffcap" is a shell script which calls lots of
fortran programs
BTW2: the Oops do not always report CPU 1, sometimes, CPU 0
I don't know if it's related, but at bootup the kernel always
disables DMA on hdd and hdc:
kernel: hdd: timeout waiting for DMA
kernel: hdd: irq timeout: status=0x58 { DriveReady SeekComplete DataRequest }
kernel: hdc: DMA disabled
kernel: hdd: DMA disabled
kernel: ide1: reset: success
Some system info:
Architecture: dual PII-333, same stepping
Motherboard: SuperMicro P6DLE
Bios: AmiBios R1.33
/sbin/hdparm -i /dev/hda:
Model=WDC AC32500H, FwRev=10.07H09, SerialNo=WD-WT360
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=4960/16/63, TrkSize=57600, SectSize=600, ECCbytes=22
BuffType=3(DualPortCache), BuffSize=128kB, MaxMultSect=16, MultSect=off
DblWordIO=no, maxPIO=2(fast), DMA=yes, maxDMA=0(slow)
CurCHS=620/128/63, CurSects=4999680, LBA=yes, LBAsects=4999680
tDMA={min:120,rec:120}, DMA modes: mword0 mword1 *mword2
IORDY=on/off, tPIO={min:160,w/IORDY:120}, PIO modes: mode3 mode4
/sbin/hdparm -i /dev/hdc:
Model=Maxtor 90840D6, FwRev=WAS82739, SerialNo=K606F9YA
Config={ Fixed }
RawCHS=16276/16/63, TrkSize=0, SectSize=0, ECCbytes=29
BuffType=3(DualPortCache), BuffSize=256kB, MaxMultSect=16, MultSect=off
DblWordIO=no, maxPIO=2(fast), DMA=yes, maxDMA=2(fast)
CurCHS=1021/255/63, CurSects=16406208, LBA=yes, LBAsects=16406208
tDMA={min:120,rec:120}, DMA modes: mword0 mword1 mword2
IORDY=on/off, tPIO={min:120,w/IORDY:120}, PIO modes: mode3 mode4
/sbin/hdparm -i /dev/hdd:
Model=Maxtor 90840D5, FwRev=PAS23B15, SerialNo=V5063RKA
Config={ Fixed }
RawCHS=16351/16/63, TrkSize=0, SectSize=0, ECCbytes=29
BuffType=3(DualPortCache), BuffSize=256kB, MaxMultSect=16, MultSect=off
DblWordIO=no, maxPIO=2(fast), DMA=yes, maxDMA=2(fast)
CurCHS=16351/16/63, CurSects=16481808, LBA=yes, LBAsects=16481808
tDMA={min:120,rec:120}, DMA modes: mword0 mword1 mword2
IORDY=on/off, tPIO={min:120,w/IORDY:120}, PIO modes: mode3 mode4
Could it be the motherboard, the HDDs, the CPUs, the RAM,
or the kernel?
Let me know if anyone needs more info.
Many thanks for any suggestions.
Cheers,
Matthijs
__________________________________________________
Ir Matthijs van Leeuwen ([EMAIL PROTECTED])
Room 304, c/o Ms. Angela Frederick
Department of Civil and Environmental Engineering
Imperial College of Science, Technology & Medicine
Exhibition Road, London SW7 2BU, United Kingdom
Tel: (44) 171 594 6120 Fax: (44) 171 594 6124
-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]