Hi, I've updated all three boxes to 4.1.15. I've just had link outage again, but this time I got more detailed backtrace..
not sure, but maybe it could be of some help? [Jan30 23:53] ixgbe 0000:03:00.0 eth0: NIC Link is Down [ +0.097285] bond0: link status definitely down for interface eth0, disabling it [ +0.007695] bond0: first active interface up! [ +0.000224] ------------[ cut here ]------------ [ +0.000007] WARNING: CPU: 6 PID: 19351 at kernel/softirq.c:150 __local_bh_enable_ip+0x7a/0xb0() [ +0.000031] Modules linked in: cbc ceph libceph fscache dlm sctp crc32c_intel crc32c_generic libcrc32c configfs netconsole autofs4 sunrpc ipmi_devintf bridge stp llc 8021 [ +0.000002] CPU: 6 PID: 19351 Comm: kworker/u32:1 Not tainted 4.1.15lb6.03 #1 [ +0.000000] Hardware name: Supermicro X10DRW/X10DRW-i, BIOS 1.0c 01/07/2015 [ +0.000005] Workqueue: bond0 bond_mii_monitor [bonding] [ +0.000002] 0000000000000096 ffff8804c2213798 ffffffff814c104b 0000000000000096 [ +0.000001] 0000000000000000 ffff8804c22137d8 ffffffff810535a5 ffff881036f03e00 [ +0.000002] 0000000000000200 ffff8804c2213830 0000000000000000 ffffffffa05250c0 [ +0.000000] Call Trace: [ +0.000004] [<ffffffff814c104b>] dump_stack+0x4f/0x74 [ +0.000002] [<ffffffff810535a5>] warn_slowpath_common+0x95/0xe0 [ +0.000002] [<ffffffff8105360a>] warn_slowpath_null+0x1a/0x20 [ +0.000002] [<ffffffff81057b4a>] __local_bh_enable_ip+0x7a/0xb0 [ +0.000003] [<ffffffffa07abc41>] bond_poll_controller+0x111/0x150 [bonding] [ +0.000003] [<ffffffff814242cc>] netpoll_poll_dev+0x5c/0x1b0 [ +0.000003] [<ffffffff814072be>] ? netif_skb_features+0xfe/0x1f0 [ +0.000001] [<ffffffff81424589>] netpoll_send_skb_on_dev+0x169/0x250 [ +0.000002] [<ffffffffa07d3975>] vlan_dev_hard_start_xmit+0x105/0x120 [8021q] [ +0.000001] [<ffffffff81423c2c>] netpoll_start_xmit+0x15c/0x1f0 [ +0.000002] [<ffffffff8142456b>] netpoll_send_skb_on_dev+0x14b/0x250 [ +0.000001] [<ffffffff8142492f>] netpoll_send_udp+0x2bf/0x400 [ +0.000002] [<ffffffffa087b234>] write_msg+0xb4/0xf0 [netconsole] [ +0.000003] [<ffffffff810a2154>] call_console_drivers.clone.1+0xa4/0x120 [ +0.000002] [<ffffffff810a2454>] console_unlock+0x284/0x400 [ +0.000002] [<ffffffff810a2e7b>] vprintk_emit+0x20b/0x4a0 [ +0.000002] [<ffffffff810a312f>] vprintk_default+0x1f/0x30 [ +0.000001] [<ffffffff814c0f39>] printk+0x46/0x48 [ +0.000002] [<ffffffff81402ef6>] __netdev_printk+0x176/0x2e0 [ +0.000002] [<ffffffff814030b3>] netdev_info+0x53/0x60 [ +0.000003] [<ffffffffa07b30f7>] ? bond_3ad_set_carrier+0x57/0xa0 [bonding] [ +0.000003] [<ffffffffa07ae468>] ? bond_set_carrier+0xb8/0xd0 [bonding] [ +0.000003] [<ffffffffa07ae5fe>] bond_select_active_slave+0x17e/0x200 [bonding] [ +0.000002] [<ffffffffa07aeb3f>] bond_mii_monitor+0x4bf/0x700 [bonding] [ +0.000003] [<ffffffff8106b119>] process_one_work+0x139/0x470 [ +0.000001] [<ffffffff8106b573>] worker_thread+0x123/0x520 [ +0.000002] [<ffffffff8106b450>] ? process_one_work+0x470/0x470 [ +0.000001] [<ffffffff8106b450>] ? process_one_work+0x470/0x470 [ +0.000002] [<ffffffff810707ce>] kthread+0xde/0x100 [ +0.000001] [<ffffffff810706f0>] ? __init_kthread_worker+0x40/0x40 [ +0.000003] [<ffffffff814c6b52>] ret_from_fork+0x42/0x70 [ +0.000001] [<ffffffff810706f0>] ? __init_kthread_worker+0x40/0x40 [ +0.000001] ---[ end trace c168d14d53373934 ]--- [ +1.635277] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX anyways, next step we'll do now is switch firmware update (although there's only one minor update, so I don't expect much..) BR nik On Mon, Jan 25, 2016 at 11:08:51AM +0100, Nikola Ciprich wrote: > Hello netdev readers, > > I'd like to consult following problem we're dealing with: > > I have a cluster of three nodes connected to stacked Brocade ICX6610 > switches using bonded AOC-STGN-i2S adapters (they're using 82599ES > chipsets). > > The problem is, I see random link failures on practically all > interfaces. Link always goes down for very short time, then adapter > is reset and link goes up again. > > Here's dmesg snippet: > > [Jan22 22:09] ixgbe 0000:03:00.0 eth0: NIC Link is Down > [ +0.005610] ixgbe 0000:03:00.0 eth0: initiating reset to clear Tx work > after link loss > [ +0.012792] bond0: link status definitely down for interface eth0, > disabling it > [ +1.105826] ixgbe 0000:03:00.0 eth0: Reset adapter > [ +0.307518] ixgbe 0000:03:00.0 eth0: detected SFP+: 3 > [ +0.145881] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: > RX/TX > > since I'm using bonding, it doesn't disrupt traffic, but I'd still like to > resolve it. We're using 5m passive SFP cables, we tried replacing one with 3m > piece, to no avail. > > all three boxes are supermicro X10DRW, running vanilla x86_64 4.0.5 kernel > (I'll upgrade it to 4.1.16 soon) > > we were using broadcom adapter before and they were working without such > problems > (except for one particular port, which showed mysterious packet drops every > few > months, thats why we switched to intel-based adapters), so I think cables and > switches > should be fine, but I'm not sure of course > > I think I've seen similar problems and they were PM related, but I'm not > sure.. > > anyone seen similar problem? > > or some tips on how could I debug it? > > If I could provide more information, please let me know > > BR > > nik > > -- > ------------------------------------- > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax: +420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > ------------------------------------- -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz -------------------------------------
pgpvCtvo9OVr6.pgp
Description: PGP signature