On Sat, 26 Feb 2011, Robin Humble wrote:
> Hi, Hi Robin, its been a while... > our cluster login nodes get these messages every few days, and on one > occasion a crash -> > > 2011-02-22 15:00:30 do_IRQ: 13.123 No irq handler for vector (irq -1) > 2011-02-18 12:26:55 do_IRQ: 12.180 No irq handler for vector (irq -1) > 2011-02-17 22:01:15 do_IRQ: 10.114 No irq handler for vector (irq -1) > 2011-02-16 12:54:25 do_IRQ: 12.209 No irq handler for vector (irq -1) > 2011-02-11 16:08:15 do_IRQ: 15.138 No irq handler for vector (irq -1) > 2011-02-09 15:56:28 do_IRQ: 10.200 No irq handler for vector (irq -1) > 2011-02-09 09:28:47 do_IRQ: 15.121 No irq handler for vector (irq -1) > 2011-02-04 10:08:45 do_IRQ: 10.136 No irq handler for vector (irq -1) > 2011-02-01 21:55:30 do_IRQ: 2.145 No irq handler for vector (irq -1) > 2011-01-31 21:43:00 do_IRQ: 8.80 No irq handler for vector (irq -1) > > unfortunately I haven't been able to find any indication where these > messages come from. > however as an experiment I recently changed from using the ixgbe card > to a built-in igb port, and I got another message, but interestingly > there is now a netdev watchdog too -> the messages are coming from http://lxr.linux.no/linux+*/arch/x86/kernel/irq.c#L243 which in later kernels becomes irq_32.c and irq_64.c > ... > do_IRQ: 8.213 No irq handler for vector (irq -1) these messages are due to irqbalance moving the interrupt, combined with a kernel bug that is not handling the per-cpu stuff correctly for the irq I can't seem to find the patch that is relevant in upstream that might have fixed this but I'm pretty sure there is one. > ------------[ cut here ]------------ > WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x146/0x1e5() > Hardware name: SUN FIRE X4170 SERVER > NETDEV WATCHDOG: eth1 (igb): transmit queue 1 timed out > Modules linked in: ib_ucm rdma_ucm coretemp hwmon xt_tcpudp xt_multiport > nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter > ip_tables x_tables 8021q binfmt_misc ext3 jbd dm_mirror dm_region_hash dm_log > dm_multipath scsi_dh dm_mod raid1 video output pci_slot battery ac nvram > sd_mod crc_t10dif sg sr_mod cdrom joydev usb_storage mptsas rtc_cmos mptscsih > rtc_core mptbase mdio button scsi_transport_sas rtc_lib ahci libata scsi_mod > pcspkr ehci_hcd shpchp ioatdma i2c_i801 uhci_hcd pci_hotplug mgc lustre osc > mdc lov lquota ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_cm ib_addr > iw_cm ib_umad ib_uverbs ib_ipoib ib_cm ib_sa mlx4_ib mlx4_core ib_mad ib_core > igb dca [last unloaded: ixgbe] > Pid: 0, comm: swapper Not tainted 2.6.32.29-1.8.5a #1 > Call Trace: > <IRQ> [<ffffffff8129013e>] ? dev_watchdog+0x146/0x1e5 > [<ffffffff810425fe>] warn_slowpath_common+0x7c/0x94 > [<ffffffff810426d0>] warn_slowpath_fmt+0xa4/0xa6 > [<ffffffff8103ca67>] ? enqueue_task_fair+0x109/0x116 > [<ffffffff81064fd1>] ? sched_clock_cpu+0x42/0xc7 > [<ffffffff8127cb61>] ? netdev_drivername+0x48/0x4f > [<ffffffff8129013e>] dev_watchdog+0x146/0x1e5 > [<ffffffff8104edf6>] run_timer_softirq+0x1a9/0x245 > [<ffffffff8128fff8>] ? dev_watchdog+0x0/0x1e5 > [<ffffffff81048e0a>] __do_softirq+0xd6/0x197 > [<ffffffff8100cbdc>] call_softirq+0x1c/0x28 > [<ffffffff8100dfcb>] do_softirq+0x38/0x70 > [<ffffffff81048cf5>] irq_exit+0x3b/0x7a > [<ffffffff812f979d>] smp_apic_timer_interrupt+0x8e/0x9c > [<ffffffff8100c5b3>] apic_timer_interrupt+0x13/0x20 > <EOI> [<ffffffff811cfea4>] ? acpi_idle_enter_bm+0x25c/0x288 > [<ffffffff811cfe9a>] ? acpi_idle_enter_bm+0x252/0x288 > [<ffffffff8125b6a9>] ? menu_select+0x15a/0x228 > [<ffffffff8125a880>] ? cpuidle_idle_call+0x8a/0xe6 > [<ffffffff8100aad1>] ? cpu_idle+0x57/0x7a > [<ffffffff812ef1c2>] ? start_secondary+0x195/0x199 > ---[ end trace 4184d428e8ff6f3e ]--- > igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX > > where the unhandled vector happened 18 seconds before the netdev > watchdog -> > > 2011-02-27 12:42:12 do_IRQ: 8.213 No irq handler for vector (irq -1) > 2011-02-27 12:42:30 ------------[ cut here ]------------ > ... > > I'm using a vanilla latest 2.6.32.29 x86_64 kernel with its igb > 1.3.16-k2 (and was using ixgbe 2.0.44-k2) and CentOS5.5 userland. > hw is -> can you just pin the interrupts using irq_affinity? > 01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network > Connection (rev 02) > 01:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network > Connection (rev 02) > 07:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network > Connection (rev 02) > 07:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network > Connection (rev 02) > 19:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF > Network Connection (rev 01) > > > does this irq vector stuff sound like it might be an igb/ixgbe problem > to you guys? as per above, I don't think it is a driver problem. > were any problems like this fixed in your sf drivers? no. > I'm happy to try out those newer drivers if you think it might help... > but if I'm on completely the wrong track and if it's more likely that > these login nodes have bad BIOS or hardware or something then please > let me know. Would it work to try a newer kernel? That or try my suggestion of pinning interrupts. Jesse ------------------------------------------------------------------------------ What You Don't Know About Data Connectivity CAN Hurt You This paper provides an overview of data connectivity, details its effect on application quality, and explores various alternative solutions. http://p.sf.net/sfu/progress-d2d _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired