Hi:

I'm having problems with a new SuperMicro PIIIDM3 motherboard. Due to
other testing, using a 2nd motherboard, I came to the conclusion that
the hardware itself is good. The problem I will describe is repeatable,
and when logging is available (often not due to crash), the log message
says to report the unknown APIC to you. I had tried to send a more
detailed description, but it crashed and died while working with the log
files. The filesystem damage tends to destroy some of the information
used to report this, so I'll send this more brief description, and offer
to send more info if you would find it useful.

The SuperMicro PIIIDM3 motherboard (see http://www.supermicro.com) is a
dual cpu, i840 chipset, 4x AGP, onboard scsi ultra 160, onboard ide,
onboard ethernet board. Perhaps the most different aspect of this board,
for intel architecture, is the 64 bit, 66 MHz bus slots (the ultra 160
is tied to this). Also, I believe that instead of standard memory
translation hubs, it uses memory repeater hubs, interleaving matched
DIMM modules similar to raid striping. The kernels this has been tested
with and which unanimously fail are SMP versions of 2.2.12 through
2.2.16. Uniprocessor will also fail, but this is a dual Intel brand pIII
machine.

Attached are two files. One contains log excerpts and comments. The
other contains scsi error messages, hand copied (the filesystem is
unavailable after a crash, but shell builtins and magic sysrq is still
available...no filesystem, of any kind, /proc or physical, is available
during a crash).

Messages from /var/log/messages, when they manage to be available
(depends on how fast the filesystem access is destroyed as to whether or
not logging works) mention your email and unknown APIC reporting. See
attachment, log.txt.

Another error sometimes makes its way to the logs, sometimes mentioning
cpu 0, sometimes cpu 1, but always irq vector 217:
kernel: unexpected IRQ vector 217 on CPU#0!
kernel: unexpected IRQ vector 217 on CPU#1!

The message appears to be generated from linux kernel file
arch/i386/kernel/irq.h.

After this message, certain death occurs for all file systems. Death
does not occur simultaneously, however. This unrecoverable failure
occurs when reading many files quickly on any file system type, on any
controller, whether IDE or SCSI, or even ram (/proc failure). Shell
builtins continue to work, but any attempt to access a file causes shell
failure. Magic sysrq does work, however, I can't log anything from this,
I must copy it by hand to another machine or paper. Mounting or
umounting of any device or filesystem type is also highly risky, with
the cd rom on IDE being one way to repeat this problem. In the case of
killing the system via IDE usage, scsi works for a while, but the system
appears to begin destabilizing, and scsi will also be a problem in a few
minutes. /proc also becomes inaccessible.

Since X reads significant data on startup, this has also become risky (I
sync before trying to startx). I do not believe the aic7xxx scsi driver
is responsible for the problem, but probably the new APIC technology,
since I can easily reproduce crashing through read of too many files, or
mounting, or umounting, any device on any controller of any file system
type, with the remaining controllers remaining stable for a few minutes
after the first device dies (if the first device is the scsi controller
for root partition, logging immediately dies).

In the past I have spoken with SuperMicro engineers on other subjects,
and have found them quite interested in helping with any issue, despite
being busy. I believe they would help in any APIC questions. Is there
any information, logging, tests, or other help which I might be able to
provide you with to help solve this? I am very interested in seeing
linux succeed with the higher performance motherboards and systems.
Please don't hesitate to ask if you want more information or testing,
this is very important to me.

Thanks,
Dan Stimits, [EMAIL PROTECTED]
////////////////////////////////////////////////////////////////////////////////////////////////
>From several /var/log/messages locations before system lockup (hard lockup), after a 
>large number
of small files are manipulated, or any mount point is mounted and remounted too 
quickly (includes
IDE or SCSI, any file system type), using any kernel 2.2.12 through 2.2.15 SMP:
Jun  4 20:40:19 thanteros kernel: unexpected IRQ vector 217 on CPU#0!
Jun  5 23:06:18 thanteros kernel: unexpected IRQ vector 217 on CPU#1!

It appears that logging before hard lockup depends on which cpu is handling logging at 
the time one dies,
but lockup and filesystem failure always occurs within a second regardless of cpu #.

2.2.14 or 15? SMP

Jun  1 22:30:34 thanteros kernel: Unable to handle kernel NULL pointer dereference at 
virtual address 00000050
Jun  1 22:30:34 thanteros kernel: current->tss.cr3 = 09065000, %cr3 = 09065000
Jun  1 22:30:34 thanteros kernel: *pde = 00000000
Jun  1 22:30:34 thanteros kernel: Oops: 0000
Jun  1 22:30:34 thanteros kernel: CPU:    0
Jun  1 22:30:34 thanteros kernel: EIP:    0010:[isonum_731+8/41]
Jun  1 22:30:34 thanteros kernel: EFLAGS: 00010246
Jun  1 22:30:34 thanteros kernel: eax: 00000000   ebx: 00000050   ecx: 00000000   edx: 
00000050
Jun  1 22:30:34 thanteros kernel: esi: 8ffdc406   edi: 0000009c   ebp: 8d420600   esp: 
8b929e6c
Jun  1 22:30:34 thanteros kernel: ds: 0018   es: 0018   ss: 0018
Jun  1 22:30:34 thanteros kernel: Process mount (pid: 1600, process nr: 85, 
stackpage=8b929000)
Jun  1 22:30:34 thanteros kernel: Stack: 8015889f 00000050 8d420600 00000300 80259eac 
00000300 00000300 00000064
Jun  1 22:30:34 thanteros kernel:        00000300 0000000a 00000000 00000011 00000000 
00000003 8ffdc400 00000000
Jun  1 22:30:34 thanteros kernel:        00000000 00000000 8ffdd6e0 6e79796e 0000756e 
00000400 0000016d 80120000
Jun  1 22:30:34 thanteros kernel: Call Trace: [isofs_read_super+814/1624] 
[do_generic_file_read+618/1507] [read_super+146/183] [do_mount+161/267] 
[cprt+23202/30048] [sys_mount+719/824] [cprt+23202/30048]
Jun  1 22:30:34 thanteros kernel:        [system_call+52/56]
Jun  1 22:30:34 thanteros kernel: Code: 8a 0a 8a 42 01 c1 e0 08 09 c1 31 c0 8a 42 02 
c1 e0 10 09 c1



////////////////////////////////////////////////////////////////////////////////////////////////
FROM /var/log/messages on 2.2.16-3, SMP (other APIC failure messages have made it to 
the console with
2.2.12 through 2.2.15 SMP, but logging hasn't been as successful, SCSI died before it 
could log):

Jun 24 21:52:43 thanteros kernel: enabling symmetric IO mode... ...done.
Jun 24 21:52:43 thanteros kernel: ENABLING IO-APIC IRQs
Jun 24 21:52:43 thanteros kernel: init IO_APIC IRQs
-9, 3-10, 3-11, 3-12, 3-13, 3-14, 3-15, 3-16, 3-17, 3-18, 3-19, 3-20, 3-21, 3-22, 3-23 
not connected.
Jun 24 21:52:43 thanteros kernel: number of MP IRQ sources: 25.
Jun 24 21:52:43 thanteros kernel: number of IO-APIC #2 registers: 24.
Jun 24 21:52:43 thanteros kernel: number of IO-APIC #3 registers: 24.
Jun 24 21:52:43 thanteros kernel: testing the IO APIC.......................
Jun 24 21:52:43 thanteros kernel:
Jun 24 21:52:43 thanteros kernel: IO APIC #2......
Jun 24 21:52:43 thanteros kernel: .... register #00: 02000000
Jun 24 21:52:43 thanteros kernel: .......    : physical APIC id: 02
Jun 24 21:52:43 thanteros kernel: .... register #01: 00170020
Jun 24 21:52:43 thanteros kernel: .......     : max redirection entries: 0017
Jun 24 21:52:43 thanteros kernel: .......     : IO APIC version: 0020
Jun 24 21:52:43 thanteros kernel:  WARNING: unexpected IO-APIC, please mail
Jun 24 21:52:43 thanteros kernel:           to [EMAIL PROTECTED]
Jun 24 21:52:43 thanteros kernel: .... register #02: 00000000
Jun 24 21:52:43 thanteros kernel: .......     : arbitration: 00
Jun 24 21:52:43 thanteros kernel: .... IRQ redirection table:
Jun 24 21:52:43 thanteros kernel:  NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
Jun 24 21:52:43 thanteros kernel:  00 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:43 thanteros kernel:  01 000 00  0    0    0   0   0    1    1    59
Jun 24 21:52:43 thanteros kernel:  02 0FF 0F  0    0    0   0   0    1    1    51
Jun 24 21:52:43 thanteros kernel:  03 000 00  0    0    0   0   0    1    1    61
Jun 24 21:52:43 thanteros kernel:  04 000 00  0    0    0   0   0    1    1    69
Jun 24 21:52:43 thanteros kernel:  05 000 00  0    0    0   0   0    1    1    71
Jun 24 21:52:43 thanteros kernel:  06 000 00  0    0    0   0   0    1    1    79
Jun 24 21:52:43 thanteros kernel:  07 000 00  0    0    0   0   0    1    1    81
Jun 24 21:52:43 thanteros kernel:  08 000 00  0    0    0   0   0    1    1    89
Jun 24 21:52:43 thanteros kernel:  09 000 00  0    0    0   0   0    1    1    91
Jun 24 21:52:43 thanteros kernel:  0a 000 00  0    0    0   0   0    1    1    99
Jun 24 21:52:43 thanteros kernel:  0b 000 00  0    0    0   0   0    1    1    A1
Jun 24 21:52:43 thanteros kernel:  0c 000 00  0    0    0   0   0    1    1    A9
Jun 24 21:52:43 thanteros kernel:  0d 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:43 thanteros kernel:  0e 000 00  0    0    0   0   0    1    1    B1
Jun 24 21:52:43 thanteros kernel:  0f 000 00  0    0    0   0   0    1    1    B9
Jun 24 21:52:43 thanteros kernel:  10 0FF 0F  1    1    0   1   0    1    1    C1
Jun 24 21:52:43 thanteros kernel:  11 0FF 0F  1    1    0   1   0    1    1    C9
Jun 24 21:52:43 thanteros kernel:  12 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:43 thanteros kernel:  13 0FF 0F  1    1    0   1   0    1    1    D1
Jun 24 21:52:43 thanteros kernel:  14 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:43 thanteros kernel:  15 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:43 thanteros kernel:  16 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:43 thanteros kernel:  17 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:43 thanteros kernel:
Jun 24 21:52:43 thanteros kernel: IO APIC #3......
Jun 24 21:52:43 thanteros kernel: .... register #00: 03000000
Jun 24 21:52:43 thanteros kernel: .......    : physical APIC id: 03
Jun 24 21:52:43 thanteros kernel: .... register #01: 00178020
Jun 24 21:52:43 thanteros kernel: .......     : max redirection entries: 0017
Jun 24 21:52:44 thanteros kernel: .......     : IO APIC version: 0020
Jun 24 21:52:44 thanteros kernel:  WARNING: unexpected IO-APIC, please mail
Jun 24 21:52:44 thanteros kernel:           to [EMAIL PROTECTED]
Jun 24 21:52:44 thanteros kernel:  WARNING: unexpected IO-APIC, please mail
Jun 24 21:52:44 thanteros kernel:           to [EMAIL PROTECTED]
Jun 24 21:52:44 thanteros kernel: .... register #02: 0F000000
Jun 24 21:52:44 thanteros kernel: .......     : arbitration: 0F
Jun 24 21:52:44 thanteros kernel: .... IRQ redirection table:
Jun 24 21:52:44 thanteros kernel:  NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
Jun 24 21:52:44 thanteros kernel:  00 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  01 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  02 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  03 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  04 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  05 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  06 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  07 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  08 0FF 0F  1    1    0   1   0    1    1    D9
Jun 24 21:52:44 thanteros kernel:  09 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  0a 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  0b 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  0c 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  0d 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  0e 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  0f 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  10 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  11 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  12 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  13 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  14 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  15 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  16 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel:  17 000 00  1    0    0   0   0    0    0    00
Jun 24 21:52:44 thanteros kernel: .................................... done.
Jun 24 21:52:44 thanteros kernel: checking TSC synchronization across CPUs: passed.
Jun 24 21:52:44 thanteros kernel: PCI: PCI BIOS revision 2.10 entry at 0xfdb91
Jun 24 21:52:44 thanteros kernel: PCI: Using configuration type 1
Jun 24 21:52:44 thanteros kernel: PCI: Probing PCI hardware


////////////////////////////////////////////////////////////////////////////////////////////////
strace logging just before cd rom mount/umount fails:

open("/usr/share/locale/en_US/LC_CTYPE", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=87756, ...}) = 0
old_mmap(NULL, 87756, PROT_READ, MAP_PRIVATE, 4, 0) = 0x2abc4000
close(4)                                = 0
open("/dev/null", O_RDWR)               = 4
close(4)                                = 0
getuid()                                = 0
geteuid()                               = 0
lstat("/etc/mtab", {st_mode=S_IFREG|0644, st_size=90, ...}) = 0
rt_sigprocmask(SIG_BLOCK, ~[TRAP SEGV], NULL, 8) = 0
mount("/dev/cdrom", "/mnt/cdrom", "iso9660", MS_RDONLY|0xc0ed0000, 0


scsi : aborting command due to timeout : pid 18689, scsi0, channel 0, id 0, lun 0 
Write (6) 00 00 47 10 00
scsi : aborting command due to timeout : pid 18690, scsi0, channel 0, id 0, lun 0 
Write (10) 00 00 2c 02 3f 00 00 08 00
SCSI host 0 abort (pid 18688) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
SCSI host 0 channel 0 reset (pid 18688) timed out - trying harder
...
probably an unrecoverable SCSI bus or device hang.
...


NOTE: seagate ST336704LC, rev. 0002, direct access, ansi scsi rev 03. 
AHA274x/284x/294x aic 7892.

/dev/log
/dev/gpmctl
/dev/printer
/tmp/...gnome orbit stuff
/tmp/ X related files


During X crash, switched to console, attempt shutdown:

scsi : aborting command due to timeout : pid 24923, scsi0, channel 0, id 0, lun 0 
Write (10) 00 00 38 00 6f 00 00 08 00
SCSI host 0 abort (pid .....
...
probably an unrecoverable SCSI bus or device hang.
SCSI host 0 abort (pid 24922) timed out - resetting
SCSI Bus is being reset for host 0 channel 0.
scsi : aborting command due to timeout : pid 24922, sci0....Write...(10) 00 00 39 00 
6f 00 00 08 00
....


scsi : aborting command due to timeout : pid 92269, scsi0, channel 0, id 0, lun 0 Read 
(10) 00 04 14 84 37 00 00 02 00
scsi : aborting command due to timeout : pid 92270, scsi0, channel 0, id 0, lun 0 
Write (10) 00 00 38 e2 f7 00 00 08 00
SCSI host 0 abort (pid 92270) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
scsi : aborting command due to timeout : pid 92272, scsi0, channel 0, id 0, lun 0 
Write (10) 00 04 12 b5 ad 00 00 10 00
scsi : aborting command due to timeout : pid 92271, scsi0, channel 0, id 0, lun 0 
Write (10) 00 04 12 b5 65 00 00 38 00
....
SCSI host 0 channel 0 reset (pid 92270) timed out - trying harder
...
probably an unrecoverable SCSI bus or device hang.


Reply via email to