Information: 5.4 kernel (2.6.18-164.el5).
I have a vmcore (from kdump), if the developers are interested, let me know
a place to upload the vmcore file.
I used the crash command to do a backtrace.
I manage to get machines with later 5.4 and 5.5 to panic the same way.
Broadcom or Intel NICs panic the same way.
This is an NFS client where the NFS server is restarting several times;
NFSv3, mount it with defaults,noatime.
The client was busy writing things on NFS-mounted space while the NFS
servers was restarting several times.
So far, if I mount it with udp option, I've not managed to panic the
machines.
The bad news is that NFSv4 is strictly TCP, if I were to go down that route.
From the backtrace, it seems the crash is TCP-related. I'll be trying
couple Linux TCP settings changes.
It's a possibility that the issues are with TCP in general (not NFS).
I would like to enlist community's help in further understanding this and
potential work-arounds with this TCP issues.
crash sys
KERNEL: vmlinux
DUMPFILE: vmcore
CPUS: 4
DATE: Tue Apr 20 15:04:09 2010
UPTIME: 18:55:25
LOAD AVERAGE: 0.13, 0.09, 0.03
TASKS: 340
RELEASE: 2.6.18-164.el5
VERSION: #1 SMP Thu Sep 3 03:28:30 EDT 2009
MACHINE: x86_64 (2660 Mhz)
MEMORY: 23.6 GB
PANIC: Oops: [1] SMP (check log for details)
crash bt -a
PID: 0 TASK: 802ffae0 CPU: 0 COMMAND: swapper
#0 [8043ef20] crash_nmi_callback at 8007a3bf
#1 [8043ef40] do_nmi at 8006585a
#2 [8043ef50] nmi at 80064ebf
[exception RIP: acpi_processor_idle+579]
RIP: 8019765e RSP: 803f1f48 RFLAGS: 0093
RAX: 0073111a RBX: 0073111a RCX: 0808
RDX: 0815 RSI: 0003 RDI:
RBP: 81063e480100 R8: 803f R9: 804b5e2c
R10: 0046 R11: 0046 R12:
R13: 81063e48 R14: R15:
ORIG_RAX: CS: 0010 SS: 0018
--- exception stack ---
#3 [803f1f48] acpi_processor_idle at 8019765e
#4 [803f1f90] cpu_idle at 8004939e
PID: 0 TASK: 810115f11100 CPU: 1 COMMAND: swapper
#0 [810115f38f20] crash_nmi_callback at 8007a3bf
#1 [810115f38f40] do_nmi at 8006585a
#2 [810115f38f50] nmi at 80064ebf
[exception RIP: acpi_processor_idle+579]
RIP: 8019765e RSP: 810115f2fea8 RFLAGS: 0093
RAX: 00731145 RBX: 00731145 RCX: 0808
RDX: 0815 RSI: 0003 RDI:
RBP: 81063f173900 R8: 810115f2e000 R9: 804b5e2c
R10: 0046 R11: 0046 R12: 00ff
R13: 81063f173800 R14: 0100 R15: 803ea280
ORIG_RAX: CS: 0010 SS: 0018
--- exception stack ---
#3 [810115f2fea8] acpi_processor_idle at 8019765e
#4 [810115f2fef0] cpu_idle at 8004939e
PID: 0 TASK: 810115f20080 CPU: 2 COMMAND: swapper
#0 [810115f6bbc0] crash_kexec at 800ac5b9
#1 [810115f6bc80] __die at 80065127
#2 [810115f6bcc0] do_page_fault at 80066da7
#3 [810115f6bdb0] error_exit at 8005dde9
[exception RIP: pskb_copy+307]
RIP: 8022486b RSP: 810115f6be60 RFLAGS: 00010282
RAX: 81062cd5f540 RBX: 81062cac3980 RCX: 81046fb1e550
RDX: RSI: 81062cd5f550 RDI: 0004
RBP: 810466f54a80 R8: 081f02b4 R9:
R10: 81062cac3980 R11: 00c8 R12: 0220
R13: 810466f54a80 R14: 0002 R15: 803ea2a0
ORIG_RAX: CS: 0010 SS: 0018
#4 [810115f6be78] tcp_transmit_skb at 800217b7
#5 [810115f6bec8] tcp_retransmit_skb at 80250ccd
#6 [810115f6bf08] tcp_write_timer at 80252652
#7 [810115f6bf28] run_timer_softirq at 800968be
#8 [810115f6bf58] __do_softirq at 8001235a
#9 [810115f6bf88] call_softirq at 8005e2fc
#10 [810115f6bfa0] do_softirq at 8006cb14
#11 [810115f6bfb0] apic_timer_interrupt at 8005dc8e
--- IRQ stack ---
#12 [810115f67df8] apic_timer_interrupt at 8005dc8e
[exception RIP: acpi_processor_idle+628]
RIP: 8019768f RSP: 810115f67ea8 RFLAGS: 0282
RAX: 810115f67fd8 RBX: 81063f173100 RCX: 80184973
RDX: 81063f173000 RSI: 0082 RDI: 804b5e2c
RBP: 810115f67ee8 R8: 810115f66000 R9: 810115f67ecc
R10: 0046 R11: 810115f67ee8 R12: 81063f6e1180
R13: 10008040 R14: 81063f6e1180 R15: 81063f6e1180
ORIG_RAX: ff10 CS: 0010 SS: