On Sat, 26 Feb 2011, Robin Humble wrote:

> Hi,

Hi Robin, its been a while...
 
> our cluster login nodes get these messages every few days, and on one
> occasion a crash ->
> 
>   2011-02-22 15:00:30 do_IRQ: 13.123 No irq handler for vector (irq -1)
>   2011-02-18 12:26:55 do_IRQ: 12.180 No irq handler for vector (irq -1)
>   2011-02-17 22:01:15 do_IRQ: 10.114 No irq handler for vector (irq -1)
>   2011-02-16 12:54:25 do_IRQ: 12.209 No irq handler for vector (irq -1)
>   2011-02-11 16:08:15 do_IRQ: 15.138 No irq handler for vector (irq -1)
>   2011-02-09 15:56:28 do_IRQ: 10.200 No irq handler for vector (irq -1)
>   2011-02-09 09:28:47 do_IRQ: 15.121 No irq handler for vector (irq -1)
>   2011-02-04 10:08:45 do_IRQ: 10.136 No irq handler for vector (irq -1)
>   2011-02-01 21:55:30 do_IRQ: 2.145 No irq handler for vector (irq -1)
>   2011-01-31 21:43:00 do_IRQ: 8.80 No irq handler for vector (irq -1)
> 
> unfortunately I haven't been able to find any indication where these
> messages come from.
> however as an experiment I recently changed from using the ixgbe card
> to a built-in igb port, and I got another message, but interestingly
> there is now a netdev watchdog too ->

the messages are coming from
http://lxr.linux.no/linux+*/arch/x86/kernel/irq.c#L243

which in later kernels becomes irq_32.c and irq_64.c
 
> ...
> do_IRQ: 8.213 No irq handler for vector (irq -1)

these messages are due to irqbalance moving the interrupt, combined with a 
kernel bug that is not handling the per-cpu stuff correctly for the irq

I can't seem to find the patch that is relevant in upstream that might 
have fixed this but I'm pretty sure there is one.

> ------------[ cut here ]------------
> WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x146/0x1e5()
> Hardware name: SUN FIRE X4170 SERVER          
> NETDEV WATCHDOG: eth1 (igb): transmit queue 1 timed out
> Modules linked in: ib_ucm rdma_ucm coretemp hwmon xt_tcpudp xt_multiport 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter 
> ip_tables x_tables 8021q binfmt_misc ext3 jbd dm_mirror dm_region_hash dm_log 
> dm_multipath scsi_dh dm_mod raid1 video output pci_slot battery ac nvram 
> sd_mod crc_t10dif sg sr_mod cdrom joydev usb_storage mptsas rtc_cmos mptscsih 
> rtc_core mptbase mdio button scsi_transport_sas rtc_lib ahci libata scsi_mod 
> pcspkr ehci_hcd shpchp ioatdma i2c_i801 uhci_hcd pci_hotplug mgc lustre osc 
> mdc lov lquota ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_cm ib_addr 
> iw_cm ib_umad ib_uverbs ib_ipoib ib_cm ib_sa mlx4_ib mlx4_core ib_mad ib_core 
> igb dca [last unloaded: ixgbe]
> Pid: 0, comm: swapper Not tainted 2.6.32.29-1.8.5a #1
> Call Trace:
>  <IRQ>  [<ffffffff8129013e>] ? dev_watchdog+0x146/0x1e5
>  [<ffffffff810425fe>] warn_slowpath_common+0x7c/0x94
>  [<ffffffff810426d0>] warn_slowpath_fmt+0xa4/0xa6
>  [<ffffffff8103ca67>] ? enqueue_task_fair+0x109/0x116
>  [<ffffffff81064fd1>] ? sched_clock_cpu+0x42/0xc7
>  [<ffffffff8127cb61>] ? netdev_drivername+0x48/0x4f
>  [<ffffffff8129013e>] dev_watchdog+0x146/0x1e5
>  [<ffffffff8104edf6>] run_timer_softirq+0x1a9/0x245
>  [<ffffffff8128fff8>] ? dev_watchdog+0x0/0x1e5
>  [<ffffffff81048e0a>] __do_softirq+0xd6/0x197
>  [<ffffffff8100cbdc>] call_softirq+0x1c/0x28
>  [<ffffffff8100dfcb>] do_softirq+0x38/0x70
>  [<ffffffff81048cf5>] irq_exit+0x3b/0x7a
>  [<ffffffff812f979d>] smp_apic_timer_interrupt+0x8e/0x9c
>  [<ffffffff8100c5b3>] apic_timer_interrupt+0x13/0x20
>  <EOI>  [<ffffffff811cfea4>] ? acpi_idle_enter_bm+0x25c/0x288
>  [<ffffffff811cfe9a>] ? acpi_idle_enter_bm+0x252/0x288
>  [<ffffffff8125b6a9>] ? menu_select+0x15a/0x228
>  [<ffffffff8125a880>] ? cpuidle_idle_call+0x8a/0xe6
>  [<ffffffff8100aad1>] ? cpu_idle+0x57/0x7a
>  [<ffffffff812ef1c2>] ? start_secondary+0x195/0x199
> ---[ end trace 4184d428e8ff6f3e ]---
> igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
> 
> where the unhandled vector happened 18 seconds before the netdev
> watchdog ->
> 
>   2011-02-27 12:42:12 do_IRQ: 8.213 No irq handler for vector (irq -1)
>   2011-02-27 12:42:30 ------------[ cut here ]------------
>   ...
> 
> I'm using a vanilla latest 2.6.32.29 x86_64 kernel with its igb
> 1.3.16-k2 (and was using ixgbe 2.0.44-k2) and CentOS5.5 userland.
> hw is ->

can you just pin the interrupts using irq_affinity?

 
>   01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network 
> Connection (rev 02)
>   01:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network 
> Connection (rev 02)
>   07:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network 
> Connection (rev 02)
>   07:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network 
> Connection (rev 02)
>   19:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF 
> Network Connection (rev 01)
> 
> 
> does this irq vector stuff sound like it might be an igb/ixgbe problem
> to you guys?

as per above, I don't think it is a driver problem.

> were any problems like this fixed in your sf drivers?

no.
 
> I'm happy to try out those newer drivers if you think it might help...
> but if I'm on completely the wrong track and if it's more likely that
> these login nodes have bad BIOS or hardware or something then please
> let me know.

Would it work to try a newer kernel?  That or try my suggestion of pinning 
interrupts.

Jesse

------------------------------------------------------------------------------
What You Don't Know About Data Connectivity CAN Hurt You
This paper provides an overview of data connectivity, details
its effect on application quality, and explores various alternative
solutions. http://p.sf.net/sfu/progress-d2d
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to