[PATCH 5.10 131/663] net: amd-xgbe: Fix NETDEV WATCHDOG transmit queue timeout warning
From: Shyam Sundar S K [ Upstream commit 186edbb510bd60e748f93975989ccba25ee99c50 ] The current driver calls netif_carrier_off() late in the link tear down which can result in a netdev watchdog timeout. Calling netif_carrier_off() immediately after netif_tx_stop_all_queues() avoids the warning. [ cut here ] NETDEV WATCHDOG: enp3s0f2 (amd-xgbe): transmit queue 0 timed out WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x20d/0x220 Modules linked in: amd_xgbe(E) amd-xgbe :03:00.2 enp3s0f2: Link is Down CPU: 3 PID: 0 Comm: swapper/3 Tainted: GE Hardware name: AMD Bilby-RV2/Bilby-RV2, BIOS RBB1202A 10/18/2019 RIP: 0010:dev_watchdog+0x20d/0x220 Code: 00 49 63 4e e0 eb 92 4c 89 e7 c6 05 c6 e2 c1 00 01 e8 e7 ce fc ff 89 d9 48 RSP: 0018:90cfc28c3e88 EFLAGS: 00010286 RAX: RBX: RCX: 0006 RDX: 0007 RSI: 0086 RDI: 90cfc28d63c0 RBP: 90cfb977845c R08: 0050 R09: 00196018 R10: 90cfc28c3ef8 R11: R12: 90cfb9778000 R13: 0003 R14: 90cfb9778480 R15: 0010 FS: () GS:90cfc28c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f240ff2d9d0 CR3: 0001e3e0a000 CR4: 003406e0 Call Trace: ? pfifo_fast_reset+0x100/0x100 call_timer_fn+0x2b/0x130 run_timer_softirq+0x3e8/0x440 ? enqueue_hrtimer+0x39/0x90 Fixes: e722ec82374b ("amd-xgbe: Update the BelFuse quirk to support SGMII") Co-developed-by: Sudheesh Mavila Signed-off-by: Sudheesh Mavila Signed-off-by: Shyam Sundar S K Acked-by: Tom Lendacky Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 1 + drivers/net/ethernet/amd/xgbe/xgbe-mdio.c | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c index 2709a2db56577..395eb0b526802 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c @@ -1368,6 +1368,7 @@ static void xgbe_stop(struct xgbe_prv_data *pdata) return; netif_tx_stop_all_queues(netdev); + netif_carrier_off(pdata->netdev); xgbe_stop_timers(pdata); flush_workqueue(pdata->dev_workqueue); diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c index 93ef5a30cb8d9..19ee4db0156d6 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c @@ -1396,7 +1396,6 @@ static void xgbe_phy_stop(struct xgbe_prv_data *pdata) pdata->phy_if.phy_impl.stop(pdata); pdata->phy.link = 0; - netif_carrier_off(pdata->netdev); xgbe_phy_adjust_link(pdata); } -- 2.27.0
[PATCH 5.11 135/775] net: amd-xgbe: Fix NETDEV WATCHDOG transmit queue timeout warning
From: Shyam Sundar S K [ Upstream commit 186edbb510bd60e748f93975989ccba25ee99c50 ] The current driver calls netif_carrier_off() late in the link tear down which can result in a netdev watchdog timeout. Calling netif_carrier_off() immediately after netif_tx_stop_all_queues() avoids the warning. [ cut here ] NETDEV WATCHDOG: enp3s0f2 (amd-xgbe): transmit queue 0 timed out WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x20d/0x220 Modules linked in: amd_xgbe(E) amd-xgbe :03:00.2 enp3s0f2: Link is Down CPU: 3 PID: 0 Comm: swapper/3 Tainted: GE Hardware name: AMD Bilby-RV2/Bilby-RV2, BIOS RBB1202A 10/18/2019 RIP: 0010:dev_watchdog+0x20d/0x220 Code: 00 49 63 4e e0 eb 92 4c 89 e7 c6 05 c6 e2 c1 00 01 e8 e7 ce fc ff 89 d9 48 RSP: 0018:90cfc28c3e88 EFLAGS: 00010286 RAX: RBX: RCX: 0006 RDX: 0007 RSI: 0086 RDI: 90cfc28d63c0 RBP: 90cfb977845c R08: 0050 R09: 00196018 R10: 90cfc28c3ef8 R11: R12: 90cfb9778000 R13: 0003 R14: 90cfb9778480 R15: 0010 FS: () GS:90cfc28c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f240ff2d9d0 CR3: 0001e3e0a000 CR4: 003406e0 Call Trace: ? pfifo_fast_reset+0x100/0x100 call_timer_fn+0x2b/0x130 run_timer_softirq+0x3e8/0x440 ? enqueue_hrtimer+0x39/0x90 Fixes: e722ec82374b ("amd-xgbe: Update the BelFuse quirk to support SGMII") Co-developed-by: Sudheesh Mavila Signed-off-by: Sudheesh Mavila Signed-off-by: Shyam Sundar S K Acked-by: Tom Lendacky Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 1 + drivers/net/ethernet/amd/xgbe/xgbe-mdio.c | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c index 2709a2db56577..395eb0b526802 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c @@ -1368,6 +1368,7 @@ static void xgbe_stop(struct xgbe_prv_data *pdata) return; netif_tx_stop_all_queues(netdev); + netif_carrier_off(pdata->netdev); xgbe_stop_timers(pdata); flush_workqueue(pdata->dev_workqueue); diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c index 93ef5a30cb8d9..19ee4db0156d6 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c @@ -1396,7 +1396,6 @@ static void xgbe_phy_stop(struct xgbe_prv_data *pdata) pdata->phy_if.phy_impl.stop(pdata); pdata->phy.link = 0; - netif_carrier_off(pdata->netdev); xgbe_phy_adjust_link(pdata); } -- 2.27.0
[PATCH 4.19 069/247] net: amd-xgbe: Fix NETDEV WATCHDOG transmit queue timeout warning
From: Shyam Sundar S K [ Upstream commit 186edbb510bd60e748f93975989ccba25ee99c50 ] The current driver calls netif_carrier_off() late in the link tear down which can result in a netdev watchdog timeout. Calling netif_carrier_off() immediately after netif_tx_stop_all_queues() avoids the warning. [ cut here ] NETDEV WATCHDOG: enp3s0f2 (amd-xgbe): transmit queue 0 timed out WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x20d/0x220 Modules linked in: amd_xgbe(E) amd-xgbe :03:00.2 enp3s0f2: Link is Down CPU: 3 PID: 0 Comm: swapper/3 Tainted: GE Hardware name: AMD Bilby-RV2/Bilby-RV2, BIOS RBB1202A 10/18/2019 RIP: 0010:dev_watchdog+0x20d/0x220 Code: 00 49 63 4e e0 eb 92 4c 89 e7 c6 05 c6 e2 c1 00 01 e8 e7 ce fc ff 89 d9 48 RSP: 0018:90cfc28c3e88 EFLAGS: 00010286 RAX: RBX: RCX: 0006 RDX: 0007 RSI: 0086 RDI: 90cfc28d63c0 RBP: 90cfb977845c R08: 0050 R09: 00196018 R10: 90cfc28c3ef8 R11: R12: 90cfb9778000 R13: 0003 R14: 90cfb9778480 R15: 0010 FS: () GS:90cfc28c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f240ff2d9d0 CR3: 0001e3e0a000 CR4: 003406e0 Call Trace: ? pfifo_fast_reset+0x100/0x100 call_timer_fn+0x2b/0x130 run_timer_softirq+0x3e8/0x440 ? enqueue_hrtimer+0x39/0x90 Fixes: e722ec82374b ("amd-xgbe: Update the BelFuse quirk to support SGMII") Co-developed-by: Sudheesh Mavila Signed-off-by: Sudheesh Mavila Signed-off-by: Shyam Sundar S K Acked-by: Tom Lendacky Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 1 + drivers/net/ethernet/amd/xgbe/xgbe-mdio.c | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c index 5519eff584417..80cf6af822f72 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c @@ -1444,6 +1444,7 @@ static void xgbe_stop(struct xgbe_prv_data *pdata) return; netif_tx_stop_all_queues(netdev); + netif_carrier_off(pdata->netdev); xgbe_stop_timers(pdata); flush_workqueue(pdata->dev_workqueue); diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c index 8a3a60bb26888..4d5506d928973 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c @@ -1396,7 +1396,6 @@ static void xgbe_phy_stop(struct xgbe_prv_data *pdata) pdata->phy_if.phy_impl.stop(pdata); pdata->phy.link = 0; - netif_carrier_off(pdata->netdev); xgbe_phy_adjust_link(pdata); } -- 2.27.0
[PATCH 5.4 066/340] net: amd-xgbe: Fix NETDEV WATCHDOG transmit queue timeout warning
From: Shyam Sundar S K [ Upstream commit 186edbb510bd60e748f93975989ccba25ee99c50 ] The current driver calls netif_carrier_off() late in the link tear down which can result in a netdev watchdog timeout. Calling netif_carrier_off() immediately after netif_tx_stop_all_queues() avoids the warning. [ cut here ] NETDEV WATCHDOG: enp3s0f2 (amd-xgbe): transmit queue 0 timed out WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x20d/0x220 Modules linked in: amd_xgbe(E) amd-xgbe :03:00.2 enp3s0f2: Link is Down CPU: 3 PID: 0 Comm: swapper/3 Tainted: GE Hardware name: AMD Bilby-RV2/Bilby-RV2, BIOS RBB1202A 10/18/2019 RIP: 0010:dev_watchdog+0x20d/0x220 Code: 00 49 63 4e e0 eb 92 4c 89 e7 c6 05 c6 e2 c1 00 01 e8 e7 ce fc ff 89 d9 48 RSP: 0018:90cfc28c3e88 EFLAGS: 00010286 RAX: RBX: RCX: 0006 RDX: 0007 RSI: 0086 RDI: 90cfc28d63c0 RBP: 90cfb977845c R08: 0050 R09: 00196018 R10: 90cfc28c3ef8 R11: R12: 90cfb9778000 R13: 0003 R14: 90cfb9778480 R15: 0010 FS: () GS:90cfc28c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f240ff2d9d0 CR3: 0001e3e0a000 CR4: 003406e0 Call Trace: ? pfifo_fast_reset+0x100/0x100 call_timer_fn+0x2b/0x130 run_timer_softirq+0x3e8/0x440 ? enqueue_hrtimer+0x39/0x90 Fixes: e722ec82374b ("amd-xgbe: Update the BelFuse quirk to support SGMII") Co-developed-by: Sudheesh Mavila Signed-off-by: Sudheesh Mavila Signed-off-by: Shyam Sundar S K Acked-by: Tom Lendacky Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 1 + drivers/net/ethernet/amd/xgbe/xgbe-mdio.c | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c index 3bd20f7651207..da8c2c4aca7ef 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c @@ -1443,6 +1443,7 @@ static void xgbe_stop(struct xgbe_prv_data *pdata) return; netif_tx_stop_all_queues(netdev); + netif_carrier_off(pdata->netdev); xgbe_stop_timers(pdata); flush_workqueue(pdata->dev_workqueue); diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c index 8a3a60bb26888..4d5506d928973 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c @@ -1396,7 +1396,6 @@ static void xgbe_phy_stop(struct xgbe_prv_data *pdata) pdata->phy_if.phy_impl.stop(pdata); pdata->phy.link = 0; - netif_carrier_off(pdata->netdev); xgbe_phy_adjust_link(pdata); } -- 2.27.0
Re: NETDEV WATCHDOG: WARNING: at net/sched/sch_generic.c:442 dev_watchdog
On Wed, 19 Aug 2020 10:29:09 -0700 Jesse Brandeburg wrote: > What I don't understand in the stack trace is this: > > > [ 107.654661] Call Trace: > > > [ 107.657735] > > > [ 107.663155] ? ftrace_graph_caller+0xc0/0xc0 > > > [ 107.667929] call_timer_fn+0x3b/0x1b0 > > > [ 107.672238] ? netif_carrier_off+0x70/0x70 > > > [ 107.61] ? netif_carrier_off+0x70/0x70 > > > [ 107.682656] ? ftrace_graph_caller+0xc0/0xc0 > > > [ 107.687379] run_timer_softirq+0x3e8/0xa10 > > > [ 107.694653] ? call_timer_fn+0x1b0/0x1b0 > > > [ 107.699382] ? trace_event_raw_event_softirq+0xdd/0x150 > > > [ 107.706768] ? ring_buffer_unlock_commit+0xf5/0x210 > > > [ 107.712213] ? call_timer_fn+0x1b0/0x1b0 > > > [ 107.716625] ? __do_softirq+0x155/0x467 > > > If the carrier was turned off by something, that could cause the stack > to timeout since it appears the driver didn't call this itself after > finishing all transmits like it normally would have. > > Is the trace above correct? Usually the ? indicate unsure backtrace due > to missing symbols, right? The "?" means that there wasn't a stack frame to confirm that this was the true call stack. What happens is that the scan of the stack will look for any address in the stack that is for a function. If it finds one, it will print it and add a "?" to that address. Basically, those functions with the "?" are just addresses found in the stack but was not part of a stack frame link. -- Steve
Re: NETDEV WATCHDOG: WARNING: at net/sched/sch_generic.c:442 dev_watchdog
Steven Rostedt wrote: > On Wed, 19 Aug 2020 17:01:06 +0530 > Naresh Kamboju wrote: > > > kernel warning noticed on x86_64 while running LTP tracing > > ftrace-stress-test > > case. started noticing on the stable-rc linux-5.8.y branch. > > > > This device booted with KASAN config and DYNAMIC tracing configs and more. > > This reported issue is not easily reproducible. > > > > metadata: > > git branch: linux-5.8.y > > git repo: > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > git commit: ad8c735b1497520df959f675718f39dca8cb8019 > > git describe: v5.8.2 > > make_kernelversion: 5.8.2 > > kernel-config: > > https://builds.tuxbuild.com/bOz0eAwkcraRiWALTW9D3Q/kernel.config > > > > > > [ 88.139387] Scheduler tracepoints stat_sleep, stat_iowait, > > stat_blocked and stat_runtime require the kernel parameter > > schedstats=enable or kernel.sched_schedstats=1 > > [ 88.139387] Scheduler tracepoints stat_sleep, stat_iowait, > > stat_blocked and stat_runtime require the kernel parameter > > schedstats=enable or kernel.sched_schedstats=1 > > [ 107.507991] [ cut here ] > > [ 107.513103] NETDEV WATCHDOG: eth0 (igb): transmit queue 2 timed out > > [ 107.519973] WARNING: CPU: 1 PID: 331 at net/sched/sch_generic.c:442 > > dev_watchdog+0x4c7/0x4d0 > > [ 107.528907] Modules linked in: x86_pkg_temp_thermal > > [ 107.534541] CPU: 1 PID: 331 Comm: systemd-journal Not tainted 5.8.2 #1 > > [ 107.541480] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS > > 2.2 05/23/2018 > > [ 107.549314] RIP: 0010:dev_watchdog+0x4c7/0x4d0 > > [ 107.554226] Code: ff ff 48 8b 5d c8 c6 05 6d f7 94 01 01 48 89 df > > e8 9e b4 f8 ff 44 89 e9 48 89 de 48 c7 c7 20 49 51 9c 48 89 c2 e8 91 > > 7e e9 fe <0f> 0b e9 03 ff ff ff 66 90 e8 9b 23 db fe 55 48 89 e5 41 57 > > I've triggered this myself in my testing, and I assumed that adding the > overhead of tracing and here KASAN too, made some watchdog a bit > unhappy. By commenting out the warning, I've seen no ill effects. > > Perhaps this is something we need to dig a bit deeper into. Looked into it a little, igb uses a timeout of 5 seconds, and the stack prints the warning if we haven't completed the transmit in that time. What I don't understand in the stack trace is this: > > [ 107.654661] Call Trace: > > [ 107.657735] > > [ 107.663155] ? ftrace_graph_caller+0xc0/0xc0 > > [ 107.667929] call_timer_fn+0x3b/0x1b0 > > [ 107.672238] ? netif_carrier_off+0x70/0x70 > > [ 107.61] ? netif_carrier_off+0x70/0x70 > > [ 107.682656] ? ftrace_graph_caller+0xc0/0xc0 > > [ 107.687379] run_timer_softirq+0x3e8/0xa10 > > [ 107.694653] ? call_timer_fn+0x1b0/0x1b0 > > [ 107.699382] ? trace_event_raw_event_softirq+0xdd/0x150 > > [ 107.706768] ? ring_buffer_unlock_commit+0xf5/0x210 > > [ 107.712213] ? call_timer_fn+0x1b0/0x1b0 > > [ 107.716625] ? __do_softirq+0x155/0x467 If the carrier was turned off by something, that could cause the stack to timeout since it appears the driver didn't call this itself after finishing all transmits like it normally would have. Is the trace above correct? Usually the ? indicate unsure backtrace due to missing symbols, right?
Re: NETDEV WATCHDOG: WARNING: at net/sched/sch_generic.c:442 dev_watchdog
On Wed, 19 Aug 2020 17:01:06 +0530 Naresh Kamboju wrote: > kernel warning noticed on x86_64 while running LTP tracing ftrace-stress-test > case. started noticing on the stable-rc linux-5.8.y branch. > > This device booted with KASAN config and DYNAMIC tracing configs and more. > This reported issue is not easily reproducible. > > metadata: > git branch: linux-5.8.y > git repo: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > git commit: ad8c735b1497520df959f675718f39dca8cb8019 > git describe: v5.8.2 > make_kernelversion: 5.8.2 > kernel-config: > https://builds.tuxbuild.com/bOz0eAwkcraRiWALTW9D3Q/kernel.config > > > [ 88.139387] Scheduler tracepoints stat_sleep, stat_iowait, > stat_blocked and stat_runtime require the kernel parameter > schedstats=enable or kernel.sched_schedstats=1 > [ 88.139387] Scheduler tracepoints stat_sleep, stat_iowait, > stat_blocked and stat_runtime require the kernel parameter > schedstats=enable or kernel.sched_schedstats=1 > [ 107.507991] [ cut here ] > [ 107.513103] NETDEV WATCHDOG: eth0 (igb): transmit queue 2 timed out > [ 107.519973] WARNING: CPU: 1 PID: 331 at net/sched/sch_generic.c:442 > dev_watchdog+0x4c7/0x4d0 > [ 107.528907] Modules linked in: x86_pkg_temp_thermal > [ 107.534541] CPU: 1 PID: 331 Comm: systemd-journal Not tainted 5.8.2 #1 > [ 107.541480] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS > 2.2 05/23/2018 > [ 107.549314] RIP: 0010:dev_watchdog+0x4c7/0x4d0 > [ 107.554226] Code: ff ff 48 8b 5d c8 c6 05 6d f7 94 01 01 48 89 df > e8 9e b4 f8 ff 44 89 e9 48 89 de 48 c7 c7 20 49 51 9c 48 89 c2 e8 91 > 7e e9 fe <0f> 0b e9 03 ff ff ff 66 90 e8 9b 23 db fe 55 48 89 e5 41 57 I've triggered this myself in my testing, and I assumed that adding the overhead of tracing and here KASAN too, made some watchdog a bit unhappy. By commenting out the warning, I've seen no ill effects. Perhaps this is something we need to dig a bit deeper into. -- Steve > 41 56 > [ 107.573476] RSP: 0018:888230889d88 EFLAGS: 00010286 > [ 107.579264] RAX: RBX: 88822bbb RCX: > dc00 > [ 107.586928] RDX: 111046114c99 RSI: 9a7e4dbe RDI: > 9b7a6da7 > [ 107.594473] RBP: 888230889de0 R08: 9a7e4dd3 R09: > ed1044de2529 > [ 107.602101] R10: 888226f12943 R11: ed1044de2528 R12: > 88822bbb0440 > [ 107.609648] R13: 0002 R14: 88822bbb0388 R15: > 88822bbb0380 > [ 107.617197] FS: 7f8b471bb480() GS:88823088() > knlGS: > [ 107.625698] CS: 0010 DS: ES: CR0: 80050033 > [ 107.631944] CR2: 0008 CR3: 000226a64001 CR4: > 003606e0 > [ 107.639496] DR0: DR1: DR2: > > [ 107.647092] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 107.654661] Call Trace: > [ 107.657735] > [ 107.663155] ? ftrace_graph_caller+0xc0/0xc0 > [ 107.667929] call_timer_fn+0x3b/0x1b0 > [ 107.672238] ? netif_carrier_off+0x70/0x70 > [ 107.61] ? netif_carrier_off+0x70/0x70 > [ 107.682656] ? ftrace_graph_caller+0xc0/0xc0 > [ 107.687379] run_timer_softirq+0x3e8/0xa10 > [ 107.694653] ? call_timer_fn+0x1b0/0x1b0 > [ 107.699382] ? trace_event_raw_event_softirq+0xdd/0x150 > [ 107.706768] ? ring_buffer_unlock_commit+0xf5/0x210 > [ 107.712213] ? call_timer_fn+0x1b0/0x1b0 > [ 107.716625] ? __do_softirq+0x155/0x467 > Aug 22 04:21:44 intel-corei7-64 [ 107.721972] ? run_timer_softirq+0x5/0xa10 > user.warn kernel[ 107.727997] ? asm_call_on_stack+0x12/0x20 > : [ 107.507991] [ c[ 107.735546] ? > ftrace_graph_caller+0xc0/0xc0 > ut here ]---[ 107.740453] __do_softirq+0x160/0x467 > - > [ 107.745737] ? hrtimer_interrupt+0x5/0x340 > [ 107.753961] asm_call_on_stack+0x12/0x20 > [ 107.758672] > [ 107.761555] do_softirq_own_stack+0x3f/0x50 > [ 107.766521] ? ftrace_graph_caller+0xc0/0xc0 > [ 107.771246] irq_exit_rcu+0xff/0x110 > [ 107.776116] ? ftrace_graph_caller+0xc0/0xc0 > [ 107.780808] sysvec_apic_timer_interrupt+0x38/0x90 > [ 107.786971] asm_sysvec_apic_timer_interrupt+0x12/0x20 > [ 107.792598] RIP: 0010:profile_graph_return+0x111/0x1d0 > [ 107.798204] Code: 75 e1 48 8b 45 d0 f6 c4 02 75 16 50 9d e8 f7 ff > 02 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 c3 fb 02 00 ff > 75 d0 9d <48> 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 8d 7b 20 e8 > 77 78 > [ 107.817416] RSP: 0018:8882269b73a0 EFLAGS: 0286 > [ 107.823201] RAX: 8882269b73d8 RBX: 8882269b7428 RCX: > dc00 > [ 107.830785] RDX: dc000
NETDEV WATCHDOG: WARNING: at net/sched/sch_generic.c:442 dev_watchdog
kernel warning noticed on x86_64 while running LTP tracing ftrace-stress-test case. started noticing on the stable-rc linux-5.8.y branch. This device booted with KASAN config and DYNAMIC tracing configs and more. This reported issue is not easily reproducible. metadata: git branch: linux-5.8.y git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git commit: ad8c735b1497520df959f675718f39dca8cb8019 git describe: v5.8.2 make_kernelversion: 5.8.2 kernel-config: https://builds.tuxbuild.com/bOz0eAwkcraRiWALTW9D3Q/kernel.config [ 88.139387] Scheduler tracepoints stat_sleep, stat_iowait, stat_blocked and stat_runtime require the kernel parameter schedstats=enable or kernel.sched_schedstats=1 [ 88.139387] Scheduler tracepoints stat_sleep, stat_iowait, stat_blocked and stat_runtime require the kernel parameter schedstats=enable or kernel.sched_schedstats=1 [ 107.507991] [ cut here ] [ 107.513103] NETDEV WATCHDOG: eth0 (igb): transmit queue 2 timed out [ 107.519973] WARNING: CPU: 1 PID: 331 at net/sched/sch_generic.c:442 dev_watchdog+0x4c7/0x4d0 [ 107.528907] Modules linked in: x86_pkg_temp_thermal [ 107.534541] CPU: 1 PID: 331 Comm: systemd-journal Not tainted 5.8.2 #1 [ 107.541480] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 107.549314] RIP: 0010:dev_watchdog+0x4c7/0x4d0 [ 107.554226] Code: ff ff 48 8b 5d c8 c6 05 6d f7 94 01 01 48 89 df e8 9e b4 f8 ff 44 89 e9 48 89 de 48 c7 c7 20 49 51 9c 48 89 c2 e8 91 7e e9 fe <0f> 0b e9 03 ff ff ff 66 90 e8 9b 23 db fe 55 48 89 e5 41 57 41 56 [ 107.573476] RSP: 0018:888230889d88 EFLAGS: 00010286 [ 107.579264] RAX: RBX: 88822bbb RCX: dc00 [ 107.586928] RDX: 111046114c99 RSI: 9a7e4dbe RDI: 9b7a6da7 [ 107.594473] RBP: 888230889de0 R08: 9a7e4dd3 R09: ed1044de2529 [ 107.602101] R10: 888226f12943 R11: ed1044de2528 R12: 88822bbb0440 [ 107.609648] R13: 0002 R14: 88822bbb0388 R15: 88822bbb0380 [ 107.617197] FS: 7f8b471bb480() GS:88823088() knlGS: [ 107.625698] CS: 0010 DS: ES: CR0: 80050033 [ 107.631944] CR2: 0008 CR3: 000226a64001 CR4: 003606e0 [ 107.639496] DR0: DR1: DR2: [ 107.647092] DR3: DR6: fffe0ff0 DR7: 0400 [ 107.654661] Call Trace: [ 107.657735] [ 107.663155] ? ftrace_graph_caller+0xc0/0xc0 [ 107.667929] call_timer_fn+0x3b/0x1b0 [ 107.672238] ? netif_carrier_off+0x70/0x70 [ 107.61] ? netif_carrier_off+0x70/0x70 [ 107.682656] ? ftrace_graph_caller+0xc0/0xc0 [ 107.687379] run_timer_softirq+0x3e8/0xa10 [ 107.694653] ? call_timer_fn+0x1b0/0x1b0 [ 107.699382] ? trace_event_raw_event_softirq+0xdd/0x150 [ 107.706768] ? ring_buffer_unlock_commit+0xf5/0x210 [ 107.712213] ? call_timer_fn+0x1b0/0x1b0 [ 107.716625] ? __do_softirq+0x155/0x467 Aug 22 04:21:44 intel-corei7-64 [ 107.721972] ? run_timer_softirq+0x5/0xa10 user.warn kernel[ 107.727997] ? asm_call_on_stack+0x12/0x20 : [ 107.507991] [ c[ 107.735546] ? ftrace_graph_caller+0xc0/0xc0 ut here ]---[ 107.740453] __do_softirq+0x160/0x467 - [ 107.745737] ? hrtimer_interrupt+0x5/0x340 [ 107.753961] asm_call_on_stack+0x12/0x20 [ 107.758672] [ 107.761555] do_softirq_own_stack+0x3f/0x50 [ 107.766521] ? ftrace_graph_caller+0xc0/0xc0 [ 107.771246] irq_exit_rcu+0xff/0x110 [ 107.776116] ? ftrace_graph_caller+0xc0/0xc0 [ 107.780808] sysvec_apic_timer_interrupt+0x38/0x90 [ 107.786971] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 107.792598] RIP: 0010:profile_graph_return+0x111/0x1d0 [ 107.798204] Code: 75 e1 48 8b 45 d0 f6 c4 02 75 16 50 9d e8 f7 ff 02 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 c3 fb 02 00 ff 75 d0 9d <48> 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 8d 7b 20 e8 77 78 [ 107.817416] RSP: 0018:8882269b73a0 EFLAGS: 0286 [ 107.823201] RAX: 8882269b73d8 RBX: 8882269b7428 RCX: dc00 [ 107.830785] RDX: dc00 RSI: 9a7e4dbe RDI: 9a7a955d [ 107.838411] RBP: 8882269b73d8 R08: 9a7e4dd3 R09: ed1044de2529 [ 107.846072] R10: 888226f12943 R11: ed1044de2528 R12: 8882308a67c0 [ 107.853621] R13: 888226f12930 R14: 8882308a67c8 R15: 88822c7e4000 [ 107.863449] ? ftrace_return_to_handler+0x1a3/0x230 Aug 22 04:21:44 [ 107.869545] ? ftrace_return_to_handler+0x18e/0x230 intel-corei7-64 [ 107.875178] ? profile_graph_return+0x10d/0x1d0 user.info kernel: [ 107.513103][ 107.882521] ? unwind_dump+0x100/0x100 NETDEV WATCHDOG: eth0 (igb): tr[ 107.889054] ? unwind_next_frame.part.0+0xe0/0x360 ansmit queue 2 t[ 107.895638] ftrace_return_to_handler+0x18e/0x230 imed out [ 107.902594] ? function_graph_enter+0x2d0/0x2d0 [ 107.907616] ? unwind_next_fra
Re: stable-rc 4.19: NETDEV WATCHDOG: eth0 (asix): transmit queue 0 timed out - net/sched/sch_generic.c:466 dev_watchdog
While running selftests bpf test_sysctl on stable rc 5.6 branch kernel on arm64 hikey device. The following warning was noticed. [ 1097.207013] NETDEV WATCHDOG: eth0 (asix): transmit queue 0 timed out [ 1097.387913] WARNING: CPU: 0 PID: 206 at /usr/src/kernel/net/sched/sch_generic.c:443 dev_watchdog+0x438/0x470 [ 1097.479820] Modules linked in: cls_bpf sch_fq sch_ingress test_bpf algif_hash af_alg wl18xx wlcore mac80211 libarc4 cfg80211 hci_uart snd_soc_audio_graph_card snd_soc_simple_card_utils btqca crct10dif_ce btbcm adv7511 wlcore_sdio bluetooth cec ecdh_generic ecc lima rfkill kirin_drm gpu_sched drm_kms_helper dw_drm_dsi drm fuse [last unloaded: trace_printk] [ 1097.684705] CPU: 0 PID: 206 Comm: jbd2/mmcblk0p9- Not tainted 5.6.13-rc1 #1 [ 1097.776526] Hardware name: HiKey Development Board (DT) [ 1097.865766] pstate: 6005 (nZCv daif -PAN -UAO) [ 1097.954668] pc : dev_watchdog+0x438/0x470 [ 1098.042508] lr : dev_watchdog+0x438/0x470 ref: https://qa-reports.linaro.org/lkft/linux-stable-rc-5.6-oe/build/v5.6.12-119-gf1d28d1c7608/testrun/1430360/log On Tue, 5 May 2020 at 17:01, Naresh Kamboju wrote: > > While running selftests bpf test_sysctl on stable rc 4.19 branch kernel > on arm64 hikey device. The following warning was noticed. > > [ 118.957395] test_bpf: #296 BPF_MAXINSNS: exec all MSH > [ 148.966435] [ cut here ] > [ 148.988349] NETDEV WATCHDOG: eth0 (asix): transmit queue 0 timed out > [ 149.000832] WARNING: CPU: 0 PID: 0 at > /usr/src/kernel/net/sched/sch_generic.c:466 dev_watchdog+0x2b4/0x2c0 > [ 149.016470] Modules linked in: test_bpf(+) wl18xx wlcore mac80211 > cfg80211 crc32_ce hci_uart crct10dif_ce btbcm snd_soc_audio_graph_card > bluetooth snd_soc_simple_card_utils adv7511 cec wlcore_sdio kirin_drm > dw_drm_dsi rfkill drm_kms_helper drm drm_panel_orientation_quirks fuse > [last unloaded: test_bpf] > [ 149.056507] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.121-rc1 #1 > [ 149.069594] Hardware name: HiKey Development Board (DT) > [ 149.081514] pstate: 8005 (Nzcv daif -PAN -UAO) > [ 149.093062] pc : dev_watchdog+0x2b4/0x2c0 > [ 149.103862] lr : dev_watchdog+0x2b4/0x2c0 > [ 149.114575] sp : 08003d10 > [ 149.124613] x29: 08003d10 x28: 0002 > [ 149.136698] x27: 0001 x26: > [ 149.148810] x25: 0180 x24: 800074c654b8 > [ 149.160891] x23: 800074c65460 x22: 8000748dd680 > [ 149.172993] x21: 0974a000 x20: 800074c65000 > [ 149.185065] x19: x18: > [ 149.197172] x17: x16: > [ 149.209243] x15: 0001 x14: 09062cd8 > [ 149.221234] x13: 45a6fc2a x12: 0975b630 > [ 149.233166] x11: x10: 0974fa48 > [ 149.245023] x9 : 097e3000 x8 : 0974fa48 > [ 149.256818] x7 : 08173694 x6 : 800077ee62d0 > [ 149.268639] x5 : 800077ee62d0 x4 : > [ 149.280412] x3 : 800077eef6c8 x2 : 0103 > [ 149.292120] x1 : d13523b333b73d00 x0 : > [ 149.303783] Call trace: > [ 149.312481] dev_watchdog+0x2b4/0x2c0 > [ 149.322463] call_timer_fn+0xbc/0x3f0 > [ 149.332463] expire_timers+0x104/0x220 > [ 149.342493] run_timer_softirq+0xec/0x1a8 > [ 149.352784] __do_softirq+0x114/0x554 > [ 149.362668] irq_exit+0x144/0x150 > [ 149.372235] __handle_domain_irq+0x6c/0xc0 > [ 149.382633] gic_handle_irq+0x60/0xb0 > [ 149.392606] el1_irq+0xb4/0x130 > [ 149.402031] cpuidle_enter_state+0xbc/0x3f0 > [ 149.412572] cpuidle_enter+0x34/0x48 > [ 149.422539] call_cpuidle+0x44/0x78 > [ 149.432410] do_idle+0x228/0x2a8 > [ 149.441959] cpu_startup_entry+0x2c/0x30 > [ 149.452185] rest_init+0x25c/0x270 > [ 149.461821] start_kernel+0x468/0x494 > [ 149.471659] irq event stamp: 5706193 > [ 149.481376] hardirqs last enabled at (5706192): > [] console_unlock+0x424/0x638 > [ 149.496628] hardirqs last disabled at (5706193): > [] do_debug_exception+0xf8/0x1d0 > [ 149.512207] softirqs last enabled at (5706160): > [] _local_bh_enable+0x28/0x48 > [ 149.527590] softirqs last disabled at (5706161): > [] irq_exit+0x144/0x150 > [ 149.542410] ---[ end trace 4c7bd8e08a6a3d65 ]--- > [ 177.828500] jited:1 1366234 PASS > > ref: > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.19-oe/build/v4.19.120-38-g2e3613309d93/testrun/1415357/log > > metadata: > git branch: linux-4.19.y > git repo: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > make_kernelversion: 4.19.121-rc1 > kernel-config: > http://snapshots.linaro.org/openembedded/lkft/lkft/sumo/hikey/lkft/linux-stable-rc-4.19/530/config > -- Linaro LKFT https://lkft.linaro.org
stable-rc 4.19: NETDEV WATCHDOG: eth0 (asix): transmit queue 0 timed out - net/sched/sch_generic.c:466 dev_watchdog
While running selftests bpf test_sysctl on stable rc 4.19 branch kernel on arm64 hikey device. The following warning was noticed. [ 118.957395] test_bpf: #296 BPF_MAXINSNS: exec all MSH [ 148.966435] [ cut here ] [ 148.988349] NETDEV WATCHDOG: eth0 (asix): transmit queue 0 timed out [ 149.000832] WARNING: CPU: 0 PID: 0 at /usr/src/kernel/net/sched/sch_generic.c:466 dev_watchdog+0x2b4/0x2c0 [ 149.016470] Modules linked in: test_bpf(+) wl18xx wlcore mac80211 cfg80211 crc32_ce hci_uart crct10dif_ce btbcm snd_soc_audio_graph_card bluetooth snd_soc_simple_card_utils adv7511 cec wlcore_sdio kirin_drm dw_drm_dsi rfkill drm_kms_helper drm drm_panel_orientation_quirks fuse [last unloaded: test_bpf] [ 149.056507] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.121-rc1 #1 [ 149.069594] Hardware name: HiKey Development Board (DT) [ 149.081514] pstate: 8005 (Nzcv daif -PAN -UAO) [ 149.093062] pc : dev_watchdog+0x2b4/0x2c0 [ 149.103862] lr : dev_watchdog+0x2b4/0x2c0 [ 149.114575] sp : 08003d10 [ 149.124613] x29: 08003d10 x28: 0002 [ 149.136698] x27: 0001 x26: [ 149.148810] x25: 0180 x24: 800074c654b8 [ 149.160891] x23: 800074c65460 x22: 8000748dd680 [ 149.172993] x21: 0974a000 x20: 800074c65000 [ 149.185065] x19: x18: [ 149.197172] x17: x16: [ 149.209243] x15: 0001 x14: 09062cd8 [ 149.221234] x13: 45a6fc2a x12: 0975b630 [ 149.233166] x11: x10: 0974fa48 [ 149.245023] x9 : 097e3000 x8 : 0974fa48 [ 149.256818] x7 : 08173694 x6 : 800077ee62d0 [ 149.268639] x5 : 800077ee62d0 x4 : [ 149.280412] x3 : 800077eef6c8 x2 : 0103 [ 149.292120] x1 : d13523b333b73d00 x0 : [ 149.303783] Call trace: [ 149.312481] dev_watchdog+0x2b4/0x2c0 [ 149.322463] call_timer_fn+0xbc/0x3f0 [ 149.332463] expire_timers+0x104/0x220 [ 149.342493] run_timer_softirq+0xec/0x1a8 [ 149.352784] __do_softirq+0x114/0x554 [ 149.362668] irq_exit+0x144/0x150 [ 149.372235] __handle_domain_irq+0x6c/0xc0 [ 149.382633] gic_handle_irq+0x60/0xb0 [ 149.392606] el1_irq+0xb4/0x130 [ 149.402031] cpuidle_enter_state+0xbc/0x3f0 [ 149.412572] cpuidle_enter+0x34/0x48 [ 149.422539] call_cpuidle+0x44/0x78 [ 149.432410] do_idle+0x228/0x2a8 [ 149.441959] cpu_startup_entry+0x2c/0x30 [ 149.452185] rest_init+0x25c/0x270 [ 149.461821] start_kernel+0x468/0x494 [ 149.471659] irq event stamp: 5706193 [ 149.481376] hardirqs last enabled at (5706192): [] console_unlock+0x424/0x638 [ 149.496628] hardirqs last disabled at (5706193): [] do_debug_exception+0xf8/0x1d0 [ 149.512207] softirqs last enabled at (5706160): [] _local_bh_enable+0x28/0x48 [ 149.527590] softirqs last disabled at (5706161): [] irq_exit+0x144/0x150 [ 149.542410] ---[ end trace 4c7bd8e08a6a3d65 ]--- [ 177.828500] jited:1 1366234 PASS ref: https://qa-reports.linaro.org/lkft/linux-stable-rc-4.19-oe/build/v4.19.120-38-g2e3613309d93/testrun/1415357/log metadata: git branch: linux-4.19.y git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git make_kernelversion: 4.19.121-rc1 kernel-config: http://snapshots.linaro.org/openembedded/lkft/lkft/sumo/hikey/lkft/linux-stable-rc-4.19/530/config -- Linaro LKFT https://lkft.linaro.org
Re: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
On Tue, Mar 20, 2018 at 11:41:06AM +0530, Satish Baddipadige wrote: > Can you please test the attached patch? Well, the network connection just died with it. It didn't fire the netdev watchdog but I still had to down and up eth0 in order to continue using it. ssh connection into the box survived so I didn't have to login again but it still died intermittently. I'll keep playing with it to see if I'll catch some sort of splat... Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
On Tue, Mar 20, 2018 at 11:41:06AM +0530, Satish Baddipadige wrote: > Can you please test the attached patch? Well, the network connection just died with it. It didn't fire the netdev watchdog but I still had to down and up eth0 in order to continue using it. ssh connection into the box survived so I didn't have to login again but it still died intermittently. I'll keep playing with it to see if I'll catch some sort of splat... Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
On Tue, Mar 20, 2018 at 11:41:06AM +0530, Satish Baddipadige wrote: > Can you please test the attached patch? Sure, will do when I get back next week. Thx. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. Srsly.
Re: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
On Tue, Mar 20, 2018 at 11:41:06AM +0530, Satish Baddipadige wrote: > Can you please test the attached patch? Sure, will do when I get back next week. Thx. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. Srsly.
Re: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
On Wed, Feb 28, 2018 at 7:40 PM, Siva Reddy Kallamwrote: > On Sat, Feb 24, 2018 at 3:48 PM, Borislav Petkov wrote: >> Hi, >> >> this didn't happen before but after 4.16-rc1 my tg3 nic stops for >> whatever reason and the connection to the machine is dead. It didn't show >> anything in dmesg until today. >> >> The IO pagefaults look like it is trying to access something it >> shouldn't and maybe that's why it times out. >> >> It triggers pretty quickly so I'd call it a reliable reproducer and thus >> I can test patches... :-) >> >> Thx. > Thanks for reporting this. Somehow, this mail moved to my spam folder. > Hence, delay in response. > Looks like this is similar to below issue and it was reported some time back. > https://www.spinics.net/lists/netdev/msg482757.html > We are actively working on this. We will soon provide you an update on this. Hi Borislav, Can you please test the attached patch? Thanks, Satish tg3_5762_clock_override.patch Description: Binary data
Re: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
On Wed, Feb 28, 2018 at 7:40 PM, Siva Reddy Kallam wrote: > On Sat, Feb 24, 2018 at 3:48 PM, Borislav Petkov wrote: >> Hi, >> >> this didn't happen before but after 4.16-rc1 my tg3 nic stops for >> whatever reason and the connection to the machine is dead. It didn't show >> anything in dmesg until today. >> >> The IO pagefaults look like it is trying to access something it >> shouldn't and maybe that's why it times out. >> >> It triggers pretty quickly so I'd call it a reliable reproducer and thus >> I can test patches... :-) >> >> Thx. > Thanks for reporting this. Somehow, this mail moved to my spam folder. > Hence, delay in response. > Looks like this is similar to below issue and it was reported some time back. > https://www.spinics.net/lists/netdev/msg482757.html > We are actively working on this. We will soon provide you an update on this. Hi Borislav, Can you please test the attached patch? Thanks, Satish tg3_5762_clock_override.patch Description: Binary data
Re: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
On Sat, Feb 24, 2018 at 3:48 PM, Borislav Petkovwrote: > Hi, > > this didn't happen before but after 4.16-rc1 my tg3 nic stops for > whatever reason and the connection to the machine is dead. It didn't show > anything in dmesg until today. > > The IO pagefaults look like it is trying to access something it > shouldn't and maybe that's why it times out. > > It triggers pretty quickly so I'd call it a reliable reproducer and thus > I can test patches... :-) > > Thx. Thanks for reporting this. Somehow, this mail moved to my spam folder. Hence, delay in response. Looks like this is similar to below issue and it was reported some time back. https://www.spinics.net/lists/netdev/msg482757.html We are actively working on this. We will soon provide you an update on this.
Re: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
On Sat, Feb 24, 2018 at 3:48 PM, Borislav Petkov wrote: > Hi, > > this didn't happen before but after 4.16-rc1 my tg3 nic stops for > whatever reason and the connection to the machine is dead. It didn't show > anything in dmesg until today. > > The IO pagefaults look like it is trying to access something it > shouldn't and maybe that's why it times out. > > It triggers pretty quickly so I'd call it a reliable reproducer and thus > I can test patches... :-) > > Thx. Thanks for reporting this. Somehow, this mail moved to my spam folder. Hence, delay in response. Looks like this is similar to below issue and it was reported some time back. https://www.spinics.net/lists/netdev/msg482757.html We are actively working on this. We will soon provide you an update on this.
NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
Hi, this didn't happen before but after 4.16-rc1 my tg3 nic stops for whatever reason and the connection to the machine is dead. It didn't show anything in dmesg until today. The IO pagefaults look like it is trying to access something it shouldn't and maybe that's why it times out. It triggers pretty quickly so I'd call it a reliable reproducer and thus I can test patches... :-) Thx. ... [ 15.916840] random: crng init done [ 44.792699] tg3 :01:00.0 eth0: Link is up at 100 Mbps, full duplex [ 44.793024] tg3 :01:00.0 eth0: Flow control is on for TX and on for RX [ 44.793315] tg3 :01:00.0 eth0: EEE is disabled [ 44.793395] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 58.216474] tg3 :01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x0001f0c0 flags=0x] [ 58.216943] tg3 :01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x0001f100 flags=0x] [ 58.217395] tg3 :01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x0001f140 flags=0x] [ 58.217844] tg3 :01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x0001f180 flags=0x] [ 64.992145] [ cut here ] [ 64.992406] NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out [ 64.992742] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:464 dev_watchdog+0x1fe/0x210 [ 64.992744] Modules linked in: arc4 iwlmvm mac80211 amdgpu kvm_amd kvm iwlwifi irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel snd_hda_codec_conexant snd_hda_codec_hdmi snd_hda_codec_generic aesni_intel sha256_generic aes_x86_64 crypto_simd snd_hda_intel cryptd glue_helper tg3 snd_hda_codec pcspkr snd_hwdep cfg80211 joydev psmouse ptp snd_hda_core hp_wmi pps_core snd_pcm ehci_pci chash tpm_infineon rfkill libphy i2c_piix4 snd_timer fam15h_power xhci_pci ehci_hcd snd sg gpu_sched k10temp soundcore xhci_hcd tpm_tis tpm_tis_core video tpm battery button ac acpi_cpufreq evdev input_leds serio_raw sd_mod thermal pinctrl_amd [ 64.993216] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.16.0-rc1+ #2 [ 64.993222] Hardware name: HP HP EliteBook 745 G3/807E, BIOS N73 Ver. 01.08 01/28/2016 [ 64.996048] RIP: 0010:dev_watchdog+0x1fe/0x210 [ 64.996050] RSP: 0018:88043dc83e88 EFLAGS: 00010282 [ 64.996052] RAX: RBX: RCX: 0103 [ 64.996054] RDX: 8103 RSI: 0086 RDI: [ 64.996055] RBP: 88042b86e39c R08: 81c0a400 R09: 0001 [ 64.996057] R10: 035a R11: R12: 88042b86e3b0 [ 64.996058] R13: 88042b86e000 R14: 0005 R15: 88042a0ced80 [ 64.996061] FS: () GS:88043dc8() knlGS: [ 64.996063] CS: 0010 DS: ES: CR0: 80050033 [ 64.996065] CR2: 7f98ed87eb00 CR3: 000428ea CR4: 001406e0 [ 64.996068] Call Trace: [ 64.996074] [ 64.996082] ? qdisc_reset+0xe0/0xe0 [ 64.996085] ? qdisc_reset+0xe0/0xe0 [ 64.996092] call_timer_fn+0x2b/0x150 [ 64.996097] run_timer_softirq+0x415/0x460 [ 64.996101] ? tick_sched_timer+0x42/0x90 [ 64.996106] ? _raw_spin_lock_irq+0x1a/0x40 [ 64.996110] ? __hrtimer_run_queues+0x113/0x2d0 [ 64.996114] __do_softirq+0xeb/0x2d5 [ 64.996121] irq_exit+0xaa/0xb0 [ 64.996125] smp_apic_timer_interrupt+0x73/0x150 [ 64.996128] apic_timer_interrupt+0x7d/0x90 [ 64.996131] [ 64.996136] RIP: 0010:cpuidle_enter_state+0xa3/0x2f0 [ 64.996138] RSP: 0018:c900019c3ea8 EFLAGS: 0246 ORIG_RAX: ff12 [ 64.996141] RAX: 88043dc8 RBX: 000f21d4b954 RCX: 001f [ 64.996142] RDX: 000f21d4b954 RSI: 81da4ca1 RDI: 81db2a9e [ 64.996144] RBP: 88042a39a200 R08: 0005a0b5 R09: 000585fa [ 64.996145] R10: 0018 R11: 00049370 R12: 0002 [ 64.996146] R13: 82095db8 R14: R15: 000f0b23994e [ 64.996157] ? cpuidle_enter_state+0x93/0x2f0 [ 65.003171] do_idle+0x19a/0x1f0 [ 65.003176] cpu_startup_entry+0x6f/0x80 [ 65.003181] start_secondary+0x1a5/0x200 [ 65.003185] secondary_startup_64+0xa5/0xb0 [ 65.003189] Code: 00 49 63 4c 24 f0 eb 93 4c 89 ef c6 05 5b 10 af 00 01 e8 b6 67 fd ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 20 f6 df 81 e8 e2 8d a7 ff <0f> ff eb be 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 [ 65.003234] ---[ end trace b191673f18a75f41 ]--- [ 65.003243] tg3 :01:00.0 eth0: transmit timed out, resetting [ 67.679695] tg3 :01:00.0 eth0: 0x: 0x168714e4, 0x10100406, 0x0210, 0x [ 67.680053] tg3 :01:00.0 eth0: 0x0010: 0xd082000c, 0x, 0xd081000c, 0x [ 67.680406] tg3 :01:00.0 eth0: 0x0020: 0xd08c, 0x, 0x, 0x807e103c [ 67.680419] tg3 :01:00.0 eth0: 0x0030: 0x00
NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
Hi, this didn't happen before but after 4.16-rc1 my tg3 nic stops for whatever reason and the connection to the machine is dead. It didn't show anything in dmesg until today. The IO pagefaults look like it is trying to access something it shouldn't and maybe that's why it times out. It triggers pretty quickly so I'd call it a reliable reproducer and thus I can test patches... :-) Thx. ... [ 15.916840] random: crng init done [ 44.792699] tg3 :01:00.0 eth0: Link is up at 100 Mbps, full duplex [ 44.793024] tg3 :01:00.0 eth0: Flow control is on for TX and on for RX [ 44.793315] tg3 :01:00.0 eth0: EEE is disabled [ 44.793395] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 58.216474] tg3 :01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x0001f0c0 flags=0x] [ 58.216943] tg3 :01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x0001f100 flags=0x] [ 58.217395] tg3 :01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x0001f140 flags=0x] [ 58.217844] tg3 :01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x0001f180 flags=0x] [ 64.992145] [ cut here ] [ 64.992406] NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out [ 64.992742] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:464 dev_watchdog+0x1fe/0x210 [ 64.992744] Modules linked in: arc4 iwlmvm mac80211 amdgpu kvm_amd kvm iwlwifi irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel snd_hda_codec_conexant snd_hda_codec_hdmi snd_hda_codec_generic aesni_intel sha256_generic aes_x86_64 crypto_simd snd_hda_intel cryptd glue_helper tg3 snd_hda_codec pcspkr snd_hwdep cfg80211 joydev psmouse ptp snd_hda_core hp_wmi pps_core snd_pcm ehci_pci chash tpm_infineon rfkill libphy i2c_piix4 snd_timer fam15h_power xhci_pci ehci_hcd snd sg gpu_sched k10temp soundcore xhci_hcd tpm_tis tpm_tis_core video tpm battery button ac acpi_cpufreq evdev input_leds serio_raw sd_mod thermal pinctrl_amd [ 64.993216] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.16.0-rc1+ #2 [ 64.993222] Hardware name: HP HP EliteBook 745 G3/807E, BIOS N73 Ver. 01.08 01/28/2016 [ 64.996048] RIP: 0010:dev_watchdog+0x1fe/0x210 [ 64.996050] RSP: 0018:88043dc83e88 EFLAGS: 00010282 [ 64.996052] RAX: RBX: RCX: 0103 [ 64.996054] RDX: 8103 RSI: 0086 RDI: [ 64.996055] RBP: 88042b86e39c R08: 81c0a400 R09: 0001 [ 64.996057] R10: 035a R11: R12: 88042b86e3b0 [ 64.996058] R13: 88042b86e000 R14: 0005 R15: 88042a0ced80 [ 64.996061] FS: () GS:88043dc8() knlGS: [ 64.996063] CS: 0010 DS: ES: CR0: 80050033 [ 64.996065] CR2: 7f98ed87eb00 CR3: 000428ea CR4: 001406e0 [ 64.996068] Call Trace: [ 64.996074] [ 64.996082] ? qdisc_reset+0xe0/0xe0 [ 64.996085] ? qdisc_reset+0xe0/0xe0 [ 64.996092] call_timer_fn+0x2b/0x150 [ 64.996097] run_timer_softirq+0x415/0x460 [ 64.996101] ? tick_sched_timer+0x42/0x90 [ 64.996106] ? _raw_spin_lock_irq+0x1a/0x40 [ 64.996110] ? __hrtimer_run_queues+0x113/0x2d0 [ 64.996114] __do_softirq+0xeb/0x2d5 [ 64.996121] irq_exit+0xaa/0xb0 [ 64.996125] smp_apic_timer_interrupt+0x73/0x150 [ 64.996128] apic_timer_interrupt+0x7d/0x90 [ 64.996131] [ 64.996136] RIP: 0010:cpuidle_enter_state+0xa3/0x2f0 [ 64.996138] RSP: 0018:c900019c3ea8 EFLAGS: 0246 ORIG_RAX: ff12 [ 64.996141] RAX: 88043dc8 RBX: 000f21d4b954 RCX: 001f [ 64.996142] RDX: 000f21d4b954 RSI: 81da4ca1 RDI: 81db2a9e [ 64.996144] RBP: 88042a39a200 R08: 0005a0b5 R09: 000585fa [ 64.996145] R10: 0018 R11: 00049370 R12: 0002 [ 64.996146] R13: 82095db8 R14: R15: 000f0b23994e [ 64.996157] ? cpuidle_enter_state+0x93/0x2f0 [ 65.003171] do_idle+0x19a/0x1f0 [ 65.003176] cpu_startup_entry+0x6f/0x80 [ 65.003181] start_secondary+0x1a5/0x200 [ 65.003185] secondary_startup_64+0xa5/0xb0 [ 65.003189] Code: 00 49 63 4c 24 f0 eb 93 4c 89 ef c6 05 5b 10 af 00 01 e8 b6 67 fd ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 20 f6 df 81 e8 e2 8d a7 ff <0f> ff eb be 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 [ 65.003234] ---[ end trace b191673f18a75f41 ]--- [ 65.003243] tg3 :01:00.0 eth0: transmit timed out, resetting [ 67.679695] tg3 :01:00.0 eth0: 0x: 0x168714e4, 0x10100406, 0x0210, 0x [ 67.680053] tg3 :01:00.0 eth0: 0x0010: 0xd082000c, 0x, 0xd081000c, 0x [ 67.680406] tg3 :01:00.0 eth0: 0x0020: 0xd08c, 0x, 0x, 0x807e103c [ 67.680419] tg3 :01:00.0 eth0: 0x0030: 0x00
NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Hi Folks, I'm running slackware linux 14 32 bits as a firewall/ipsec gateway with linux 3.18.0 I got this error just after 12 hours uptime: WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0xee/0x174() NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Modules linked in: xfrm4_tunnel af_key authenc xfrm4_mode_tunnel deflate zlib_deflate zlib_inflate ctr twofish_generic twofish_i586 twofish_common serpent_sse2_i586 serpent_generic glue_helper blowfish_generic blowfish_common cbc ecb sha512_generic hmac tunnel4 ipcomp xfrm_ipcomp esp4 xts lrw gf128mul ablk_helper cryptd aes_i586 des_generic md5 sha1_generic sha256_generic nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_ftp nf_conntrack_ftp xt_helper xt_mark xt_statistic xt_nat xt_multiport xt_limit xt_tcpudp xt_policy xt_conntrack iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables uhci_hcd ehci_pci tg3 ehci_hcd 8250 ptp usbcore pps_core serial_core libphy rtc_cmos r8169 usb_common mii processor thermal_sys hwmon loop [last unloaded: af_key] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0 #1 Hardware name: Hewlett-Packard HP xw4400 Workstation/0A68h, BIOS 786D7 v02.07 10/28/2010 c125fe2a f580bf6c c10282cb c1202130 f5904000 02b1e170 c1028346 0009 f580bf6c c1302f9c f580bf84 c1202130 c1302fd5 012f c1302f9c f5904000 f812e2e0 8100 c1202042 f580bfbc Call Trace: [] ? dump_stack+0x3e/0x4e [] ? warn_slowpath_common+0x61/0x74 [] ? dev_watchdog+0xee/0x174 [] ? warn_slowpath_fmt+0x29/0x2d [] ? dev_watchdog+0xee/0x174 [] ? pfifo_fast_dequeue+0xa2/0xa2 [] ? call_timer_fn.isra.30+0xf/0x5a [] ? run_timer_softirq+0x126/0x16f [] ? __do_softirq+0x8f/0x16c [] ? __hrtimer_tasklet_trampoline+0x13/0x13 [] ? do_softirq_own_stack+0x1a/0x1f [] ? irq_exit+0x31/0x70 [] ? smp_apic_timer_interrupt+0x30/0x39 [] ? apic_timer_interrupt+0x2d/0x34 [] ? default_idle+0x2/0x3 [] ? arch_cpu_idle+0x6/0x7 [] ? cpu_startup_entry+0xeb/0x211 [] ? start_kernel+0x2d2/0x2d5 ---[ end trace 71b9cfb317b62846 ]--- r8169 :10:00.0 eth0: link up Is this error related to ASPM? Any response are welcome. TIA Here is dmesg output Linux version 3.18.0 (root@Pleiadi) (gcc version 4.7.1 (GCC) ) #1 SMP Tue Dec 9 16:46:26 CET 2014 e820: BIOS-provided physical RAM map: BIOS-e820: [mem 0x-0x0009fbff] usable BIOS-e820: [mem 0x0009fc00-0x0009] reserved BIOS-e820: [mem 0x000e8000-0x000f] reserved BIOS-e820: [mem 0x0010-0xbffc62ff] usable BIOS-e820: [mem 0xbffc6300-0xbfff] reserved BIOS-e820: [mem 0xf000-0xf3ff] reserved BIOS-e820: [mem 0xfec0-0xfed3] reserved BIOS-e820: [mem 0xfed45000-0x] reserved Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel! SMBIOS 2.4 present. DMI: Hewlett-Packard HP xw4400 Workstation/0A68h, BIOS 786D7 v02.07 10/28/2010 e820: update [mem 0x-0x0fff] usable ==> reserved e820: remove [mem 0x000a-0x000f] usable e820: last_pfn = 0xbffc6 max_arch_pfn = 0x10 MTRR default type: uncachable MTRR fixed ranges enabled: 0-9 write-back A-B uncachable C-E3FFF write-protect E4000-E write-back F-F write-protect MTRR variable ranges enabled: 0 base 0 mask F8000 write-back 1 base 08000 mask FC000 write-back 2 disabled 3 disabled 4 disabled 5 disabled 6 disabled 7 disabled x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 found SMP MP-table at [mem 0x000f9bf0-0x000f9bff] mapped at [c00f9bf0] initial memory mapped: [mem 0x-0x017f] Base memory trampoline at [c009b000] 9b000 size 16384 init_memory_mapping: [mem 0x-0x000f] [mem 0x-0x000f] page 4k init_memory_mapping: [mem 0x3700-0x373f] [mem 0x3700-0x373f] page 2M init_memory_mapping: [mem 0x3000-0x36ff] [mem 0x3000-0x36ff] page 2M init_memory_mapping: [mem 0x0010-0x2fff] [mem 0x0010-0x003f] page 4k [mem 0x0040-0x2fff] page 2M init_memory_mapping: [mem 0x3740-0x377fdfff] [mem 0x3740-0x377fdfff] page 4k BRK [0x0141c000, 0x0141cfff] PGTABLE ACPI: Early table checksum verification disabled ACPI: RSDP 0x000E7810 14 (v00 COMPAQ) ACPI: RSDT 0xBFFC6340 44 (v01 HPQOEM SLIC-WKS 20101028 ) ACPI: FACP 0xBFFC63EC 74 (v01 COMPAQ GLENWOOD 0001 ) ACPI: DSDT 0xBFFC6763 00037B (v01 COMPAQ DSDT_PRJ 0001 MSFT 010E) ACPI: FACS 0xBFFC6300 40 ACPI: SSDT 0xBFFC6ADE 008C6C (v01 COMPAQ DSDT_HW 0001 MSFT 010E) ACPI: APIC 0xBFFC6460 84 (v01 COMPAQ GLENWOOD 0001 ) ACPI: ASF! 0xBFFC64E4 63 (v32 COMPAQ GLENWOOD 0001 ) ACPI: MCFG 0xBFFC6547 3C (v01 COMPAQ GLENW
NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Hi Folks, I'm running slackware linux 14 32 bits as a firewall/ipsec gateway with linux 3.18.0 I got this error just after 12 hours uptime: WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0xee/0x174() NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Modules linked in: xfrm4_tunnel af_key authenc xfrm4_mode_tunnel deflate zlib_deflate zlib_inflate ctr twofish_generic twofish_i586 twofish_common serpent_sse2_i586 serpent_generic glue_helper blowfish_generic blowfish_common cbc ecb sha512_generic hmac tunnel4 ipcomp xfrm_ipcomp esp4 xts lrw gf128mul ablk_helper cryptd aes_i586 des_generic md5 sha1_generic sha256_generic nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_ftp nf_conntrack_ftp xt_helper xt_mark xt_statistic xt_nat xt_multiport xt_limit xt_tcpudp xt_policy xt_conntrack iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables uhci_hcd ehci_pci tg3 ehci_hcd 8250 ptp usbcore pps_core serial_core libphy rtc_cmos r8169 usb_common mii processor thermal_sys hwmon loop [last unloaded: af_key] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0 #1 Hardware name: Hewlett-Packard HP xw4400 Workstation/0A68h, BIOS 786D7 v02.07 10/28/2010 c125fe2a f580bf6c c10282cb c1202130 f5904000 02b1e170 c1028346 0009 f580bf6c c1302f9c f580bf84 c1202130 c1302fd5 012f c1302f9c f5904000 f812e2e0 8100 c1202042 f580bfbc Call Trace: [c125fe2a] ? dump_stack+0x3e/0x4e [c10282cb] ? warn_slowpath_common+0x61/0x74 [c1202130] ? dev_watchdog+0xee/0x174 [c1028346] ? warn_slowpath_fmt+0x29/0x2d [c1202130] ? dev_watchdog+0xee/0x174 [c1202042] ? pfifo_fast_dequeue+0xa2/0xa2 [c1051446] ? call_timer_fn.isra.30+0xf/0x5a [c10516b1] ? run_timer_softirq+0x126/0x16f [c102a10c] ? __do_softirq+0x8f/0x16c [c102a07d] ? __hrtimer_tasklet_trampoline+0x13/0x13 [c1002e71] ? do_softirq_own_stack+0x1a/0x1f IRQ [c102a313] ? irq_exit+0x31/0x70 [c101cbec] ? smp_apic_timer_interrupt+0x30/0x39 [c1263609] ? apic_timer_interrupt+0x2d/0x34 [c1007502] ? default_idle+0x2/0x3 [c1007a0f] ? arch_cpu_idle+0x6/0x7 [c1046686] ? cpu_startup_entry+0xeb/0x211 [c135a95c] ? start_kernel+0x2d2/0x2d5 ---[ end trace 71b9cfb317b62846 ]--- r8169 :10:00.0 eth0: link up Is this error related to ASPM? Any response are welcome. TIA Here is dmesg output Linux version 3.18.0 (root@Pleiadi) (gcc version 4.7.1 (GCC) ) #1 SMP Tue Dec 9 16:46:26 CET 2014 e820: BIOS-provided physical RAM map: BIOS-e820: [mem 0x-0x0009fbff] usable BIOS-e820: [mem 0x0009fc00-0x0009] reserved BIOS-e820: [mem 0x000e8000-0x000f] reserved BIOS-e820: [mem 0x0010-0xbffc62ff] usable BIOS-e820: [mem 0xbffc6300-0xbfff] reserved BIOS-e820: [mem 0xf000-0xf3ff] reserved BIOS-e820: [mem 0xfec0-0xfed3] reserved BIOS-e820: [mem 0xfed45000-0x] reserved Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel! SMBIOS 2.4 present. DMI: Hewlett-Packard HP xw4400 Workstation/0A68h, BIOS 786D7 v02.07 10/28/2010 e820: update [mem 0x-0x0fff] usable == reserved e820: remove [mem 0x000a-0x000f] usable e820: last_pfn = 0xbffc6 max_arch_pfn = 0x10 MTRR default type: uncachable MTRR fixed ranges enabled: 0-9 write-back A-B uncachable C-E3FFF write-protect E4000-E write-back F-F write-protect MTRR variable ranges enabled: 0 base 0 mask F8000 write-back 1 base 08000 mask FC000 write-back 2 disabled 3 disabled 4 disabled 5 disabled 6 disabled 7 disabled x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 found SMP MP-table at [mem 0x000f9bf0-0x000f9bff] mapped at [c00f9bf0] initial memory mapped: [mem 0x-0x017f] Base memory trampoline at [c009b000] 9b000 size 16384 init_memory_mapping: [mem 0x-0x000f] [mem 0x-0x000f] page 4k init_memory_mapping: [mem 0x3700-0x373f] [mem 0x3700-0x373f] page 2M init_memory_mapping: [mem 0x3000-0x36ff] [mem 0x3000-0x36ff] page 2M init_memory_mapping: [mem 0x0010-0x2fff] [mem 0x0010-0x003f] page 4k [mem 0x0040-0x2fff] page 2M init_memory_mapping: [mem 0x3740-0x377fdfff] [mem 0x3740-0x377fdfff] page 4k BRK [0x0141c000, 0x0141cfff] PGTABLE ACPI: Early table checksum verification disabled ACPI: RSDP 0x000E7810 14 (v00 COMPAQ) ACPI: RSDT 0xBFFC6340 44 (v01 HPQOEM SLIC-WKS 20101028 ) ACPI: FACP 0xBFFC63EC 74 (v01 COMPAQ GLENWOOD 0001 ) ACPI: DSDT 0xBFFC6763 00037B (v01 COMPAQ DSDT_PRJ 0001 MSFT 010E) ACPI: FACS 0xBFFC6300 40 ACPI: SSDT 0xBFFC6ADE 008C6C (v01 COMPAQ DSDT_HW 0001 MSFT 010E) ACPI: APIC 0xBFFC6460 84 (v01 COMPAQ GLENWOOD
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
Hi, I still haven't managed to crack this problem. I've made sure the below mentioned skb's look the same as the other ones: linear buffer with header, and the rest is aggregated into frags. Utilizing the skb destructor I've also checked that these packets are all freed before the TX hang happens. So the only difference from current upstream is that the pages are grant mapped into Dom0 instead of grant copy to a local page. I've also found some of my older notes about this issue, where I managed to reproduce this on igb, and in that particular case the TX hang could be solved with ifconfig down/up. Does the "Detected Tx Unit Hang" messages give any hint to igb developers? Nov 26 04:18:34 localhost kernel: [ 7814.197868] [ cut here ] Nov 26 04:18:34 localhost kernel: [ 7814.197889] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x165/0x220() Nov 26 04:18:34 localhost kernel: [ 7814.197892] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out Nov 26 04:18:34 localhost kernel: [ 7814.197894] Modules linked in: tun nfsv3 nfs_acl nfs fscache dm_multipath scsi_dh lockd sunrpc openvswitch ipt_REJECT nf_conntrack_ipv4 nf_defrag_ip v4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter ip_tables x_tables nls_utf8 isofs dm_mirror video backlight sbs sbshc hed acpi_ipmi ipmi_msghandler nvram sg psmouse serio_raw igb i2c_algo_bit ptp pps_core hpilo tpm_tis tpm tpm_bios lpc_ich mfd_core ehci_pci crc32_pclmul aesni_intel ablk_helper cryptd lrw aes_i586 xts gf128mul dm_region_hash dm_log dm_mod shpchp hpsa sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd fbcon font tileblit bitblit softcursor [last unloaded: microcode] Nov 26 04:18:34 localhost kernel: [ 7814.197957] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.10.11-0.xs1.8.50.127.377543 #1 Nov 26 04:18:34 localhost kernel: [ 7814.197959] Hardware name: HP ProLiant BL420c Gen8, BIOS I30 12/14/2012 Nov 26 04:18:34 localhost kernel: [ 7814.197962] e5cd9e10 c13e4c55 e5cd9ddc c1278546 e5cd9e00 c1047fd3 c1643220 e5cd9e2c Nov 26 04:18:34 localhost kernel: [ 7814.197969] 00ff c13e4c55 e1fa8700 0007 04e2 e5cd9e18 c1048093 0009 Nov 26 04:18:34 localhost kernel: [ 7814.197975] e5cd9e10 c1643220 e5cd9e2c e5cd9e50 c13e4c55 c163fe6b 00ff c1643220 Nov 26 04:18:34 localhost kernel: [ 7814.197982] Call Trace: Nov 26 04:18:34 localhost kernel: [ 7814.197988] [] ? dev_watchdog+0x165/0x220 Nov 26 04:18:34 localhost kernel: [ 7814.197994] [] dump_stack+0x16/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198000] [] warn_slowpath_common+0x63/0x80 Nov 26 04:18:34 localhost kernel: [ 7814.198003] [] ? dev_watchdog+0x165/0x220 Nov 26 04:18:34 localhost kernel: [ 7814.198007] [] warn_slowpath_fmt+0x33/0x40 Nov 26 04:18:34 localhost kernel: [ 7814.198011] [] dev_watchdog+0x165/0x220 Nov 26 04:18:34 localhost kernel: [ 7814.198017] [] ? dev_activate+0x110/0x110 Nov 26 04:18:34 localhost kernel: [ 7814.198020] [] call_timer_fn+0x58/0xe0 Nov 26 04:18:34 localhost kernel: [ 7814.198024] [] run_timer_softirq+0x1a8/0x1f0 Nov 26 04:18:34 localhost kernel: [ 7814.198028] [] ? info_for_irq+0xd/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198031] [] ? evtchn_from_irq+0x3c/0x50 Nov 26 04:18:34 localhost kernel: [ 7814.198034] [] ? dev_activate+0x110/0x110 Nov 26 04:18:34 localhost kernel: [ 7814.198038] [] __do_softirq+0xd9/0x1e0 Nov 26 04:18:34 localhost kernel: [ 7814.198041] [] ? __xen_evtchn_do_upcall+0x245/0x280 Nov 26 04:18:34 localhost kernel: [ 7814.198045] [] irq_exit+0x41/0x80 Nov 26 04:18:34 localhost kernel: [ 7814.198048] [] xen_evtchn_do_upcall+0x25/0x30 Nov 26 04:18:34 localhost kernel: [ 7814.198053] [] xen_do_upcall+0x7/0xc Nov 26 04:18:34 localhost kernel: [ 7814.198058] [] ? rcu_process_gp_end+0x58/0x70 Nov 26 04:18:34 localhost kernel: [ 7814.198061] [] ? xen_hypercall_sched_op+0x7/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198066] [] ? xen_safe_halt+0x12/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198070] [] default_idle+0x56/0xb0 Nov 26 04:18:34 localhost kernel: [ 7814.198074] [] arch_cpu_idle+0x17/0x30 Nov 26 04:18:34 localhost kernel: [ 7814.198078] [] cpu_startup_entry+0x15e/0x1d0 Nov 26 04:18:34 localhost kernel: [ 7814.198085] [] cpu_bringup_and_idle+0x12/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198088] ---[ end trace d8c0d3f5c187aa6b ]--- And the recovery: Nov 26 21:47:54 localhost kernel: [70773.950715] [ cut here ] Nov 26 21:47:54 localhost kernel: [70773.950747] WARNING: at net/core/dev.c:4201 net_rx_action+0xfd/0x1c0() Nov 26 21:47:54 localhost kernel: [70773.950751] Modules linked in: tun nfsv3 nfs_acl nfs fscache dm_multipath scsi_dh lockd sunrpc openvswitch ipt_REJECT nf_conntrack_ipv4 nf_defrag_ip v4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter ip_tables x_tables nls_utf8 isofs dm_mirror video backlight sbs sbshc hed acpi_ipmi ipmi_msghandler nvram sg psmouse serio_raw igb i2c_algo_bit pt
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
Hi, I still haven't managed to crack this problem. I've made sure the below mentioned skb's look the same as the other ones: linear buffer with header, and the rest is aggregated into frags. Utilizing the skb destructor I've also checked that these packets are all freed before the TX hang happens. So the only difference from current upstream is that the pages are grant mapped into Dom0 instead of grant copy to a local page. I've also found some of my older notes about this issue, where I managed to reproduce this on igb, and in that particular case the TX hang could be solved with ifconfig down/up. Does the Detected Tx Unit Hang messages give any hint to igb developers? Nov 26 04:18:34 localhost kernel: [ 7814.197868] [ cut here ] Nov 26 04:18:34 localhost kernel: [ 7814.197889] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x165/0x220() Nov 26 04:18:34 localhost kernel: [ 7814.197892] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out Nov 26 04:18:34 localhost kernel: [ 7814.197894] Modules linked in: tun nfsv3 nfs_acl nfs fscache dm_multipath scsi_dh lockd sunrpc openvswitch ipt_REJECT nf_conntrack_ipv4 nf_defrag_ip v4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter ip_tables x_tables nls_utf8 isofs dm_mirror video backlight sbs sbshc hed acpi_ipmi ipmi_msghandler nvram sg psmouse serio_raw igb i2c_algo_bit ptp pps_core hpilo tpm_tis tpm tpm_bios lpc_ich mfd_core ehci_pci crc32_pclmul aesni_intel ablk_helper cryptd lrw aes_i586 xts gf128mul dm_region_hash dm_log dm_mod shpchp hpsa sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd fbcon font tileblit bitblit softcursor [last unloaded: microcode] Nov 26 04:18:34 localhost kernel: [ 7814.197957] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.10.11-0.xs1.8.50.127.377543 #1 Nov 26 04:18:34 localhost kernel: [ 7814.197959] Hardware name: HP ProLiant BL420c Gen8, BIOS I30 12/14/2012 Nov 26 04:18:34 localhost kernel: [ 7814.197962] e5cd9e10 c13e4c55 e5cd9ddc c1278546 e5cd9e00 c1047fd3 c1643220 e5cd9e2c Nov 26 04:18:34 localhost kernel: [ 7814.197969] 00ff c13e4c55 e1fa8700 0007 04e2 e5cd9e18 c1048093 0009 Nov 26 04:18:34 localhost kernel: [ 7814.197975] e5cd9e10 c1643220 e5cd9e2c e5cd9e50 c13e4c55 c163fe6b 00ff c1643220 Nov 26 04:18:34 localhost kernel: [ 7814.197982] Call Trace: Nov 26 04:18:34 localhost kernel: [ 7814.197988] [c13e4c55] ? dev_watchdog+0x165/0x220 Nov 26 04:18:34 localhost kernel: [ 7814.197994] [c1278546] dump_stack+0x16/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198000] [c1047fd3] warn_slowpath_common+0x63/0x80 Nov 26 04:18:34 localhost kernel: [ 7814.198003] [c13e4c55] ? dev_watchdog+0x165/0x220 Nov 26 04:18:34 localhost kernel: [ 7814.198007] [c1048093] warn_slowpath_fmt+0x33/0x40 Nov 26 04:18:34 localhost kernel: [ 7814.198011] [c13e4c55] dev_watchdog+0x165/0x220 Nov 26 04:18:34 localhost kernel: [ 7814.198017] [c13e4af0] ? dev_activate+0x110/0x110 Nov 26 04:18:34 localhost kernel: [ 7814.198020] [c1055c18] call_timer_fn+0x58/0xe0 Nov 26 04:18:34 localhost kernel: [ 7814.198024] [c1056ce8] run_timer_softirq+0x1a8/0x1f0 Nov 26 04:18:34 localhost kernel: [ 7814.198028] [c12fb61d] ? info_for_irq+0xd/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198031] [c12fbb6c] ? evtchn_from_irq+0x3c/0x50 Nov 26 04:18:34 localhost kernel: [ 7814.198034] [c13e4af0] ? dev_activate+0x110/0x110 Nov 26 04:18:34 localhost kernel: [ 7814.198038] [c104fcb9] __do_softirq+0xd9/0x1e0 Nov 26 04:18:34 localhost kernel: [ 7814.198041] [c12fc045] ? __xen_evtchn_do_upcall+0x245/0x280 Nov 26 04:18:34 localhost kernel: [ 7814.198045] [c104fe41] irq_exit+0x41/0x80 Nov 26 04:18:34 localhost kernel: [ 7814.198048] [c12fc0e5] xen_evtchn_do_upcall+0x25/0x30 Nov 26 04:18:34 localhost kernel: [ 7814.198053] [c147b287] xen_do_upcall+0x7/0xc Nov 26 04:18:34 localhost kernel: [ 7814.198058] [c10c00d8] ? rcu_process_gp_end+0x58/0x70 Nov 26 04:18:34 localhost kernel: [ 7814.198061] [c10013a7] ? xen_hypercall_sched_op+0x7/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198066] [c1007ef2] ? xen_safe_halt+0x12/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198070] [c1015be6] default_idle+0x56/0xb0 Nov 26 04:18:34 localhost kernel: [ 7814.198074] [c10158e7] arch_cpu_idle+0x17/0x30 Nov 26 04:18:34 localhost kernel: [ 7814.198078] [c108e2ae] cpu_startup_entry+0x15e/0x1d0 Nov 26 04:18:34 localhost kernel: [ 7814.198085] [c1464282] cpu_bringup_and_idle+0x12/0x20 Nov 26 04:18:34 localhost kernel: [ 7814.198088] ---[ end trace d8c0d3f5c187aa6b ]--- And the recovery: Nov 26 21:47:54 localhost kernel: [70773.950715] [ cut here ] Nov 26 21:47:54 localhost kernel: [70773.950747] WARNING: at net/core/dev.c:4201 net_rx_action+0xfd/0x1c0() Nov 26 21:47:54 localhost kernel: [70773.950751] Modules linked in: tun nfsv3 nfs_acl nfs fscache dm_multipath scsi_dh lockd sunrpc openvswitch ipt_REJECT nf_conntrack_ipv4 nf_defrag_ip v4 xt_tcpudp
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On 05/02/14 20:43, Andrew Cooper wrote: On 05/02/2014 20:23, Zoltan Kiss wrote: On 04/02/14 19:47, Michael Chan wrote: On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? I haven't seen any IRQ related error message. Note, this is on Xen 4.3.1. Now I have new results with a reworked version of the patch, unfortunately it still has this issue. Here is a bnx2 dump, lspci output and some Xen debug output (MSI and interrupt bindings, I have more if needed). You need debug-keys 'Q' as well to map between the PCI devices and Xen IRQs ~Andrew I could have it after reboot: (XEN) [2014-02-06 09:44:34] :02:00.0 - dom 0 - MSIs < 64 65 66 67 68 69 > So the relevant MSI informations: (XEN) [2014-02-05 20:15:20] MSI-X 64 vec=d7 fixed edge assert physcpu dest=0022 mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 65 vec=ba fixed edge assert physcpu dest= mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 66 vec=92 fixed edge assert physcpu dest=0022 mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 67 vec=3a fixed edge assert physcpu dest=0021 mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 68 vec=b8 fixed edge assert physcpu dest=0022 mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 69 vec=2a fixed edge assert physcpu dest=0020 mask=1/1/1 ... (XEN) [2014-02-05 20:15:22]IRQ: 64 affinity:0004 vec:d7 type=PCI-MSI/-X status=0030 in-flight=0 domain-list=0:304(---), (XEN) [2014-02-05 20:15:22]IRQ: 65 affinity:0100 vec:ba type=PCI-MSI/-X status=0010 in-flight=0 domain-list=0:303(---), (XEN) [2014-02-05 20:15:22]IRQ: 66 affinity:0004 vec:92 type=PCI-MSI/-X status=0010 in-flight=0 domain-list=0:302(---), (XEN) [2014-02-05 20:15:22]IRQ: 67 affinity:0002 vec:3a type=PCI-MSI/-X status=0010 in-flight=0 domain-list=0:301(---), (XEN) [2014-02-05 20:15:22]IRQ: 68 affinity:0004 vec:b8 type=PCI-MSI/-X status=0030 in-flight=0 domain-list=0:300(---), (XEN) [2014-02-05 20:15:22]IRQ: 69 affinity:0001 vec:2a type=PCI-MSI/-X status=0002 mapped, unbound Zoli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On 05/02/14 20:43, Andrew Cooper wrote: On 05/02/2014 20:23, Zoltan Kiss wrote: On 04/02/14 19:47, Michael Chan wrote: On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? I haven't seen any IRQ related error message. Note, this is on Xen 4.3.1. Now I have new results with a reworked version of the patch, unfortunately it still has this issue. Here is a bnx2 dump, lspci output and some Xen debug output (MSI and interrupt bindings, I have more if needed). You need debug-keys 'Q' as well to map between the PCI devices and Xen IRQs ~Andrew I could have it after reboot: (XEN) [2014-02-06 09:44:34] :02:00.0 - dom 0 - MSIs 64 65 66 67 68 69 So the relevant MSI informations: (XEN) [2014-02-05 20:15:20] MSI-X 64 vec=d7 fixed edge assert physcpu dest=0022 mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 65 vec=ba fixed edge assert physcpu dest= mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 66 vec=92 fixed edge assert physcpu dest=0022 mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 67 vec=3a fixed edge assert physcpu dest=0021 mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 68 vec=b8 fixed edge assert physcpu dest=0022 mask=1/0/0 (XEN) [2014-02-05 20:15:20] MSI-X 69 vec=2a fixed edge assert physcpu dest=0020 mask=1/1/1 ... (XEN) [2014-02-05 20:15:22]IRQ: 64 affinity:0004 vec:d7 type=PCI-MSI/-X status=0030 in-flight=0 domain-list=0:304(---), (XEN) [2014-02-05 20:15:22]IRQ: 65 affinity:0100 vec:ba type=PCI-MSI/-X status=0010 in-flight=0 domain-list=0:303(---), (XEN) [2014-02-05 20:15:22]IRQ: 66 affinity:0004 vec:92 type=PCI-MSI/-X status=0010 in-flight=0 domain-list=0:302(---), (XEN) [2014-02-05 20:15:22]IRQ: 67 affinity:0002 vec:3a type=PCI-MSI/-X status=0010 in-flight=0 domain-list=0:301(---), (XEN) [2014-02-05 20:15:22]IRQ: 68 affinity:0004 vec:b8 type=PCI-MSI/-X status=0030 in-flight=0 domain-list=0:300(---), (XEN) [2014-02-05 20:15:22]IRQ: 69 affinity:0001 vec:2a type=PCI-MSI/-X status=0002 mapped, unbound Zoli -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On 05/02/2014 20:23, Zoltan Kiss wrote: > On 04/02/14 19:47, Michael Chan wrote: >> On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: >>> [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 >>> dev_watchdog+0x156/0x1f0() >>> [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out >> >> The dump shows an internal IRQ pending on MSIX vector 2 which matches >> the the queue number that is timing out. I don't know what happened to >> the MSIX and why the driver is not seeing it. Do you see an IRQ error >> message from the kernel a few seconds before the tx timeout message? > > I haven't seen any IRQ related error message. Note, this is on Xen > 4.3.1. Now I have new results with a reworked version of the patch, > unfortunately it still has this issue. Here is a bnx2 dump, lspci > output and some Xen debug output (MSI and interrupt bindings, I have > more if needed). You need debug-keys 'Q' as well to map between the PCI devices and Xen IRQs ~Andrew -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On 05/02/14 20:23, Zoltan Kiss wrote: On 04/02/14 19:47, Michael Chan wrote: On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? I haven't seen any IRQ related error message. Note, this is on Xen 4.3.1. Now I have new results with a reworked version of the patch, unfortunately it still has this issue. Here is a bnx2 dump, lspci output and some Xen debug output (MSI and interrupt bindings, I have more if needed). And here is the watchdog message and the first dump, if it matters: [10118.282007] [ cut here ] [10118.282018] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [10118.282021] NETDEV WATCHDOG: eth0 (bnx2): transmit queue 0 timed out [10118.282024] Modules linked in: tun nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc ipv6 openvswitch(O) frag_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables sr_mod cdrom nls_utf8 isofs dm_multipath scsi_dh dm_mod usb_storage lk_helper cryptd lrw aes_i586 xts gf128mul coretemp microcode hid_generic lpc_ich mfd_core ehci_pci ehci_hcd i7core_edac edac_core bnx2 sg hed u scsi_transport_sas raid_class scsi_mod [10118.282083] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G O 3.10.11-0.xs1.8.50.175.377583 #1 [10118.282086] Hardware name: Dell Inc. PowerEdge M710HD/05GGXD, BIOS 2.0.0 01/31/2011 [10118.282089] 00ff ee0a5dd0 c1488cd3 ee0a5df8 c1046664 c1658a88 ee0a5e24 00ff [10118.282097] c13fc1c6 c13fc1c6 ec778000 00256a1c ee0a5e10 c1046723 0009 [10118.282104] ee0a5e08 c1658a88 ee0a5e24 ee0a5e48 c13fc1c6 c16556e1 00ff c1658a88 [10118.282112] Call Trace: [10118.282118] [] dump_stack+0x16/0x1b [10118.282125] [] warn_slowpath_common+0x64/0x80 [10118.282129] [] ? dev_watchdog+0x156/0x1f0 [10118.282133] [] ? dev_watchdog+0x156/0x1f0 [10118.282137] [] warn_slowpath_fmt+0x33/0x40 [10118.282141] [] dev_watchdog+0x156/0x1f0 [10118.282149] [] call_timer_fn+0x3e/0xf0 [10118.282155] [] ? xen_hypercall_sched_op+0x7/0x20 [10118.282159] [] ? __netdev_watchdog_up+0x60/0x60 [10118.282164] [] run_timer_softirq+0x1ab/0x210 [10118.282169] [] ? irq_get_irq_data+0xd/0x10 [10118.282176] [] ? info_for_irq+0xd/0x20 [10118.282180] [] ? __netdev_watchdog_up+0x60/0x60 [10118.282184] [] __do_softirq+0xc4/0x200 [10118.282189] [] ? evtchn_fifo_handle_events+0xf6/0x120 [10118.282193] [] irq_exit+0x3d/0x90 [10118.282198] [] xen_evtchn_do_upcall+0x25/0x40 [10118.282203] [] xen_do_upcall+0x7/0xc [10118.282207] [] ? xen_hypercall_sched_op+0x7/0x20 [10118.282213] [] ? xen_safe_halt+0x12/0x20 [10118.282218] [] default_idle+0x3f/0xb0 [10118.28] [] arch_cpu_idle+0x17/0x30 [10118.282229] [] cpu_startup_entry+0x141/0x1f0 [10118.282234] [] cpu_bringup_and_idle+0x12/0x14 [10118.282237] ---[ end trace 25ed24391f6c7acd ]--- [10118.282242] bnx2 :02:00.0 eth0: <--- start FTQ dump ---> [10118.282267] bnx2 :02:00.0 eth0: RV2P_PFTQ_CTL 0001 [10118.282277] bnx2 :02:00.0 eth0: RV2P_TFTQ_CTL 0002 [10118.282288] bnx2 :02:00.0 eth0: RV2P_MFTQ_CTL 4000 [10118.282298] bnx2 :02:00.0 eth0: TBDR_FTQ_CTL 4002 [10118.282309] bnx2 :02:00.0 eth0: TDMA_FTQ_CTL 00010002 [10118.282319] bnx2 :02:00.0 eth0: TXP_FTQ_CTL 01810002 [10118.282330] bnx2 :02:00.0 eth0: TXP_FTQ_CTL 01810002 [10118.282340] bnx2 :02:00.0 eth0: TPAT_FTQ_CTL 00010002 [10118.282372] bnx2 :02:00.0 eth0: RXP_CFTQ_CTL 8000 [10118.282383] bnx2 :02:00.0 eth0: RXP_FTQ_CTL 0010 [10118.282392] bnx2 :02:00.0 eth0: COM_COMXQ_FTQ_CTL 0001 [10118.282403] bnx2 :02:00.0 eth0: COM_COMTQ_FTQ_CTL 0002 [10118.282414] bnx2 :02:00.0 eth0: COM_COMQ_FTQ_CTL 0001 [10118.282425] bnx2 :02:00.0 eth0: CP_CPQ_FTQ_CTL 4000 [10118.282434] bnx2 :02:00.0 eth0: CPU states: [10118.282449] bnx2 :02:00.0 eth0: 045000 mode b84c state 80001000 evt_mask 500 pc 8000844 pc 80012bc instr a0e00012 [10118.282471] bnx2 :02:00.0 eth0: 085000 mode b84c state 80001000 evt_mask 500 pc 8000a50 pc 8000ac4 instr 38420001 [10118.282493] bnx2 :02:00.0 eth0: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c18 instr 32070001 [10118.282515] bnx2 :02:00.0 eth0: 105000 mode b8cc state 8000 evt_mask 500 pc 8000a9c pc 8000b28 instr 8c53 [10118.282537] bnx2 :02:00.0 eth0: 145000 mode b880 state 8000 evt_mask 500 pc 800d1a8 pc 800af74 instr 441010a [10118.282560] bnx2 :02:00.0 eth0: 185000 mode b8cc state 8000 evt_mask 500 pc 8000918 pc 8000928 instr 8f
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On 04/02/14 19:47, Michael Chan wrote: On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? I haven't seen any IRQ related error message. Note, this is on Xen 4.3.1. Now I have new results with a reworked version of the patch, unfortunately it still has this issue. Here is a bnx2 dump, lspci output and some Xen debug output (MSI and interrupt bindings, I have more if needed). [82099.288743] bnx2 :02:00.0 eth0: <--- start FTQ dump ---> [82099.288767] bnx2 :02:00.0 eth0: RV2P_PFTQ_CTL 00010002 [82099.288779] bnx2 :02:00.0 eth0: RV2P_TFTQ_CTL 0002 [82099.288790] bnx2 :02:00.0 eth0: RV2P_MFTQ_CTL 4000 [82099.288801] bnx2 :02:00.0 eth0: TBDR_FTQ_CTL 00404002 [82099.288812] bnx2 :02:00.0 eth0: TDMA_FTQ_CTL 00010002 [82099.288823] bnx2 :02:00.0 eth0: TXP_FTQ_CTL 00810002 [82099.288834] bnx2 :02:00.0 eth0: TXP_FTQ_CTL 01010002 [82099.288845] bnx2 :02:00.0 eth0: TPAT_FTQ_CTL 00010002 [82099.288878] bnx2 :02:00.0 eth0: RXP_CFTQ_CTL 8000 [82099.29] bnx2 :02:00.0 eth0: RXP_FTQ_CTL 0012 [82099.288899] bnx2 :02:00.0 eth0: COM_COMXQ_FTQ_CTL 0001 [82099.288911] bnx2 :02:00.0 eth0: COM_COMTQ_FTQ_CTL 0002 [82099.288923] bnx2 :02:00.0 eth0: COM_COMQ_FTQ_CTL 0001 [82099.288934] bnx2 :02:00.0 eth0: CP_CPQ_FTQ_CTL 4000 [82099.288944] bnx2 :02:00.0 eth0: CPU states: [82099.288960] bnx2 :02:00.0 eth0: 045000 mode b84c state 80005000 evt_mask 500 pc 8001284 pc 8000cb8 instr 35690100 [82099.288984] bnx2 :02:00.0 eth0: 085000 mode b84c state 80001000 evt_mask 500 pc 8000a58 pc 8000a4c instr 38420001 [82099.289007] bnx2 :02:00.0 eth0: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c14 instr 32050003 [82099.289030] bnx2 :02:00.0 eth0: 105000 mode b8cc state 8000 evt_mask 500 pc 8000a94 pc 8000a94 instr 8c420020 [82099.289063] bnx2 :02:00.0 eth0: 145000 mode b880 state 8000 evt_mask 500 pc 800d244 pc 8008aac instr 8c46 [82099.289087] bnx2 :02:00.0 eth0: 185000 mode b8cc state 8000 evt_mask 500 pc 8000c6c pc 8000c6c instr 3c056000 [82099.289103] bnx2 :02:00.0 eth0: <--- end FTQ dump ---> [82099.289112] bnx2 :02:00.0 eth0: <--- start TBDC dump ---> [82099.289124] bnx2 :02:00.0 eth0: TBDC free cnt: 31 [82099.289133] bnx2 :02:00.0 eth0: LINE CID BIDX CMD VALIDS [82099.289148] bnx2 :02:00.0 eth0: 00000800 a3b8 00[1] [82099.289163] bnx2 :02:00.0 eth0: 01001100 1b58 00[0] [82099.289178] bnx2 :02:00.0 eth0: 02000800 a390 00[0] [82099.289193] bnx2 :02:00.0 eth0: 03000800 a370 00[0] [82099.289217] bnx2 :02:00.0 eth0: 04000800 a378 00[0] [82099.289232] bnx2 :02:00.0 eth0: 05000800 a388 00[0] [82099.289247] bnx2 :02:00.0 eth0: 06000800 a398 00[0] [82099.289262] bnx2 :02:00.0 eth0: 07000800 a3a8 00[0] [82099.289277] bnx2 :02:00.0 eth0: 08000800 a3b0 00[0] [82099.289291] bnx2 :02:00.0 eth0: 09000800 a3b8 00[0] [82099.289306] bnx2 :02:00.0 eth0: 0a000800 8c10 00[0] [82099.289321] bnx2 :02:00.0 eth0: 0b000800 eaf0 00[0] [82099.289336] bnx2 :02:00.0 eth0: 0c000800 eaf8 00[0] [82099.289351] bnx2 :02:00.0 eth0: 0d001100 5e60 00[0] [82099.289365] bnx2 :02:00.0 eth0: 0e001100 5e68 00[0] [82099.289380] bnx2 :02:00.0 eth0: 0f001100 5e70 00[0] [82099.289395] bnx2 :02:00.0 eth0: 10001100 5e88 00[0] [82099.289410] bnx2 :02:00.0 eth0: 11001100 5e90 00[0] [82099.289425] bnx2 :02:00.0 eth0: 12001100 5ee8 00[0] [82099.289440] bnx2 :02:00.0 eth0: 13001100 5ef8 00[0] [82099.289454] bnx2 :02:00.0 eth0: 14001100 5e00 00[0] [82099.289470] bnx2 :02:00.0 eth0: 15001100 5a20 00[0] [82099.289485] bnx2 :02:00.0 eth0: 16001100 59a8 00[0] [82099.289499] bnx2 :02:00.0 eth0: 17001100 59b0 00[0] [82099.289514] bnx2 :02:00.0 eth0: 18001100 59b8 00[0] [82099.289529] bnx2 :02:00.0 eth0: 19001100 5a28 00[0] [82099.289544] bnx2 :02:00.0 eth0: 1a001100 5a30 00[0] [82099.289559] bnx2 :02:00.0 eth0: 1b000800 8c58 00[0] [82099.289573] bnx2 :02:00.0 eth0: 1c000800 8c60 00[0] [82099.289588] bnx2 :02:00.0 eth0: 1d055e80 dca8 fb[0] [82099.289603] bnx2 :02:00.0 eth0: 1e1cf780 f7b8 af[0] [82099
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On 04/02/14 19:47, Michael Chan wrote: On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? I haven't seen any IRQ related error message. Note, this is on Xen 4.3.1. Now I have new results with a reworked version of the patch, unfortunately it still has this issue. Here is a bnx2 dump, lspci output and some Xen debug output (MSI and interrupt bindings, I have more if needed). [82099.288743] bnx2 :02:00.0 eth0: --- start FTQ dump --- [82099.288767] bnx2 :02:00.0 eth0: RV2P_PFTQ_CTL 00010002 [82099.288779] bnx2 :02:00.0 eth0: RV2P_TFTQ_CTL 0002 [82099.288790] bnx2 :02:00.0 eth0: RV2P_MFTQ_CTL 4000 [82099.288801] bnx2 :02:00.0 eth0: TBDR_FTQ_CTL 00404002 [82099.288812] bnx2 :02:00.0 eth0: TDMA_FTQ_CTL 00010002 [82099.288823] bnx2 :02:00.0 eth0: TXP_FTQ_CTL 00810002 [82099.288834] bnx2 :02:00.0 eth0: TXP_FTQ_CTL 01010002 [82099.288845] bnx2 :02:00.0 eth0: TPAT_FTQ_CTL 00010002 [82099.288878] bnx2 :02:00.0 eth0: RXP_CFTQ_CTL 8000 [82099.29] bnx2 :02:00.0 eth0: RXP_FTQ_CTL 0012 [82099.288899] bnx2 :02:00.0 eth0: COM_COMXQ_FTQ_CTL 0001 [82099.288911] bnx2 :02:00.0 eth0: COM_COMTQ_FTQ_CTL 0002 [82099.288923] bnx2 :02:00.0 eth0: COM_COMQ_FTQ_CTL 0001 [82099.288934] bnx2 :02:00.0 eth0: CP_CPQ_FTQ_CTL 4000 [82099.288944] bnx2 :02:00.0 eth0: CPU states: [82099.288960] bnx2 :02:00.0 eth0: 045000 mode b84c state 80005000 evt_mask 500 pc 8001284 pc 8000cb8 instr 35690100 [82099.288984] bnx2 :02:00.0 eth0: 085000 mode b84c state 80001000 evt_mask 500 pc 8000a58 pc 8000a4c instr 38420001 [82099.289007] bnx2 :02:00.0 eth0: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c14 instr 32050003 [82099.289030] bnx2 :02:00.0 eth0: 105000 mode b8cc state 8000 evt_mask 500 pc 8000a94 pc 8000a94 instr 8c420020 [82099.289063] bnx2 :02:00.0 eth0: 145000 mode b880 state 8000 evt_mask 500 pc 800d244 pc 8008aac instr 8c46 [82099.289087] bnx2 :02:00.0 eth0: 185000 mode b8cc state 8000 evt_mask 500 pc 8000c6c pc 8000c6c instr 3c056000 [82099.289103] bnx2 :02:00.0 eth0: --- end FTQ dump --- [82099.289112] bnx2 :02:00.0 eth0: --- start TBDC dump --- [82099.289124] bnx2 :02:00.0 eth0: TBDC free cnt: 31 [82099.289133] bnx2 :02:00.0 eth0: LINE CID BIDX CMD VALIDS [82099.289148] bnx2 :02:00.0 eth0: 00000800 a3b8 00[1] [82099.289163] bnx2 :02:00.0 eth0: 01001100 1b58 00[0] [82099.289178] bnx2 :02:00.0 eth0: 02000800 a390 00[0] [82099.289193] bnx2 :02:00.0 eth0: 03000800 a370 00[0] [82099.289217] bnx2 :02:00.0 eth0: 04000800 a378 00[0] [82099.289232] bnx2 :02:00.0 eth0: 05000800 a388 00[0] [82099.289247] bnx2 :02:00.0 eth0: 06000800 a398 00[0] [82099.289262] bnx2 :02:00.0 eth0: 07000800 a3a8 00[0] [82099.289277] bnx2 :02:00.0 eth0: 08000800 a3b0 00[0] [82099.289291] bnx2 :02:00.0 eth0: 09000800 a3b8 00[0] [82099.289306] bnx2 :02:00.0 eth0: 0a000800 8c10 00[0] [82099.289321] bnx2 :02:00.0 eth0: 0b000800 eaf0 00[0] [82099.289336] bnx2 :02:00.0 eth0: 0c000800 eaf8 00[0] [82099.289351] bnx2 :02:00.0 eth0: 0d001100 5e60 00[0] [82099.289365] bnx2 :02:00.0 eth0: 0e001100 5e68 00[0] [82099.289380] bnx2 :02:00.0 eth0: 0f001100 5e70 00[0] [82099.289395] bnx2 :02:00.0 eth0: 10001100 5e88 00[0] [82099.289410] bnx2 :02:00.0 eth0: 11001100 5e90 00[0] [82099.289425] bnx2 :02:00.0 eth0: 12001100 5ee8 00[0] [82099.289440] bnx2 :02:00.0 eth0: 13001100 5ef8 00[0] [82099.289454] bnx2 :02:00.0 eth0: 14001100 5e00 00[0] [82099.289470] bnx2 :02:00.0 eth0: 15001100 5a20 00[0] [82099.289485] bnx2 :02:00.0 eth0: 16001100 59a8 00[0] [82099.289499] bnx2 :02:00.0 eth0: 17001100 59b0 00[0] [82099.289514] bnx2 :02:00.0 eth0: 18001100 59b8 00[0] [82099.289529] bnx2 :02:00.0 eth0: 19001100 5a28 00[0] [82099.289544] bnx2 :02:00.0 eth0: 1a001100 5a30 00[0] [82099.289559] bnx2 :02:00.0 eth0: 1b000800 8c58 00[0] [82099.289573] bnx2 :02:00.0 eth0: 1c000800 8c60 00[0] [82099.289588] bnx2 :02:00.0 eth0: 1d055e80 dca8 fb[0] [82099.289603] bnx2 :02:00.0 eth0: 1e1cf780 f7b8 af[0] [82099.289618] bnx2 :02
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On 05/02/14 20:23, Zoltan Kiss wrote: On 04/02/14 19:47, Michael Chan wrote: On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? I haven't seen any IRQ related error message. Note, this is on Xen 4.3.1. Now I have new results with a reworked version of the patch, unfortunately it still has this issue. Here is a bnx2 dump, lspci output and some Xen debug output (MSI and interrupt bindings, I have more if needed). And here is the watchdog message and the first dump, if it matters: [10118.282007] [ cut here ] [10118.282018] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [10118.282021] NETDEV WATCHDOG: eth0 (bnx2): transmit queue 0 timed out [10118.282024] Modules linked in: tun nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc ipv6 openvswitch(O) frag_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables sr_mod cdrom nls_utf8 isofs dm_multipath scsi_dh dm_mod usb_storage lk_helper cryptd lrw aes_i586 xts gf128mul coretemp microcode hid_generic lpc_ich mfd_core ehci_pci ehci_hcd i7core_edac edac_core bnx2 sg hed u scsi_transport_sas raid_class scsi_mod [10118.282083] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G O 3.10.11-0.xs1.8.50.175.377583 #1 [10118.282086] Hardware name: Dell Inc. PowerEdge M710HD/05GGXD, BIOS 2.0.0 01/31/2011 [10118.282089] 00ff ee0a5dd0 c1488cd3 ee0a5df8 c1046664 c1658a88 ee0a5e24 00ff [10118.282097] c13fc1c6 c13fc1c6 ec778000 00256a1c ee0a5e10 c1046723 0009 [10118.282104] ee0a5e08 c1658a88 ee0a5e24 ee0a5e48 c13fc1c6 c16556e1 00ff c1658a88 [10118.282112] Call Trace: [10118.282118] [c1488cd3] dump_stack+0x16/0x1b [10118.282125] [c1046664] warn_slowpath_common+0x64/0x80 [10118.282129] [c13fc1c6] ? dev_watchdog+0x156/0x1f0 [10118.282133] [c13fc1c6] ? dev_watchdog+0x156/0x1f0 [10118.282137] [c1046723] warn_slowpath_fmt+0x33/0x40 [10118.282141] [c13fc1c6] dev_watchdog+0x156/0x1f0 [10118.282149] [c10549ce] call_timer_fn+0x3e/0xf0 [10118.282155] [c10013a7] ? xen_hypercall_sched_op+0x7/0x20 [10118.282159] [c13fc070] ? __netdev_watchdog_up+0x60/0x60 [10118.282164] [c1055c1b] run_timer_softirq+0x1ab/0x210 [10118.282169] [c10be4fd] ? irq_get_irq_data+0xd/0x10 [10118.282176] [c130fb2d] ? info_for_irq+0xd/0x20 [10118.282180] [c13fc070] ? __netdev_watchdog_up+0x60/0x60 [10118.282184] [c104e3f4] __do_softirq+0xc4/0x200 [10118.282189] [c1312316] ? evtchn_fifo_handle_events+0xf6/0x120 [10118.282193] [c104e5bd] irq_exit+0x3d/0x90 [10118.282198] [c130fe55] xen_evtchn_do_upcall+0x25/0x40 [10118.282203] [c14935c7] xen_do_upcall+0x7/0xc [10118.282207] [c10013a7] ? xen_hypercall_sched_op+0x7/0x20 [10118.282213] [c1007f12] ? xen_safe_halt+0x12/0x20 [10118.282218] [c1015eff] default_idle+0x3f/0xb0 [10118.28] [c1015a17] arch_cpu_idle+0x17/0x30 [10118.282229] [c108f591] cpu_startup_entry+0x141/0x1f0 [10118.282234] [c147d11b] cpu_bringup_and_idle+0x12/0x14 [10118.282237] ---[ end trace 25ed24391f6c7acd ]--- [10118.282242] bnx2 :02:00.0 eth0: --- start FTQ dump --- [10118.282267] bnx2 :02:00.0 eth0: RV2P_PFTQ_CTL 0001 [10118.282277] bnx2 :02:00.0 eth0: RV2P_TFTQ_CTL 0002 [10118.282288] bnx2 :02:00.0 eth0: RV2P_MFTQ_CTL 4000 [10118.282298] bnx2 :02:00.0 eth0: TBDR_FTQ_CTL 4002 [10118.282309] bnx2 :02:00.0 eth0: TDMA_FTQ_CTL 00010002 [10118.282319] bnx2 :02:00.0 eth0: TXP_FTQ_CTL 01810002 [10118.282330] bnx2 :02:00.0 eth0: TXP_FTQ_CTL 01810002 [10118.282340] bnx2 :02:00.0 eth0: TPAT_FTQ_CTL 00010002 [10118.282372] bnx2 :02:00.0 eth0: RXP_CFTQ_CTL 8000 [10118.282383] bnx2 :02:00.0 eth0: RXP_FTQ_CTL 0010 [10118.282392] bnx2 :02:00.0 eth0: COM_COMXQ_FTQ_CTL 0001 [10118.282403] bnx2 :02:00.0 eth0: COM_COMTQ_FTQ_CTL 0002 [10118.282414] bnx2 :02:00.0 eth0: COM_COMQ_FTQ_CTL 0001 [10118.282425] bnx2 :02:00.0 eth0: CP_CPQ_FTQ_CTL 4000 [10118.282434] bnx2 :02:00.0 eth0: CPU states: [10118.282449] bnx2 :02:00.0 eth0: 045000 mode b84c state 80001000 evt_mask 500 pc 8000844 pc 80012bc instr a0e00012 [10118.282471] bnx2 :02:00.0 eth0: 085000 mode b84c state 80001000 evt_mask 500 pc 8000a50 pc 8000ac4 instr 38420001 [10118.282493] bnx2 :02:00.0 eth0: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c18 instr 32070001 [10118.282515] bnx2 :02:00.0 eth0: 105000 mode b8cc state 8000 evt_mask 500 pc 8000a9c pc 8000b28 instr 8c53 [10118.282537] bnx2 :02:00.0 eth0: 145000 mode b880
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On 05/02/2014 20:23, Zoltan Kiss wrote: On 04/02/14 19:47, Michael Chan wrote: On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? I haven't seen any IRQ related error message. Note, this is on Xen 4.3.1. Now I have new results with a reworked version of the patch, unfortunately it still has this issue. Here is a bnx2 dump, lspci output and some Xen debug output (MSI and interrupt bindings, I have more if needed). You need debug-keys 'Q' as well to map between the PCI devices and Xen IRQs ~Andrew -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On 31/01/14 18:56, Wei Liu wrote: On Thu, Jan 30, 2014 at 07:08:11PM +, Zoltan Kiss wrote: Hi, I've experienced some queue timeout problems mentioned in the subject with igb and bnx2 cards. I haven't seen them on other cards so far. I'm using XenServer with 3.10 Dom0 kernel (however igb were already updated to latest version), and there are Windows guests sending data through these cards. I noticed these problems in XenRT test runs, and I know that they usually mean some lost interrupt problem or other hardware error, but in my case they started to appear more often, and they are likely connected to my netback grant mapping patches. These patches causing skb's with huge (~64kb) linear buffers to appear more often. The reason for that is an old problem in the ring protocol: originally the maximum amount of slots were linked to MAX_SKB_FRAGS, as every slot ended up as a frag of the skb. When this value were changed, netback had to cope with the situation by coalescing the packets into fewer frags. My patch series take a different approach: the leftover slots (pages) were assigned to a new skb's frags, and that skb were stashed to the frag_list of the first one. Then, before sending it off to the stack it calls skb = skb_copy_expand(skb, 0, 0, GFP_ATOMIC, __GFP_NOWARN), which basically creates a new skb and copied all the data into it. As far as I understood, it put everything into the linear buffer, which can amount to 64KB at most. The original skb are freed then, and this new one were sent to the stack. Just my two cents, if it is this case, you can try to call skb_copy_expand on every SKB netback receives to manually create SKBs with ~64KB linear buffer to see how it goes... I've tried it, and it did break everything in a similar way, so that's a strong clue that the problem lies here. I've rewrote that part of my patches to do less modification, based on Malcolm's idea: netback pulls the first frag into linear buffer, then moves a frag from the frag_list skb into the first one. That seems to help, but so far I have only one relevant test result, I'm waiting for more results. Zoli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: > [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 > dev_watchdog+0x156/0x1f0() > [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On Fri, 2014-01-31 at 14:29 +0100, Zoltan Kiss wrote: [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out The dump shows an internal IRQ pending on MSIX vector 2 which matches the the queue number that is timing out. I don't know what happened to the MSIX and why the driver is not seeing it. Do you see an IRQ error message from the kernel a few seconds before the tx timeout message? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On 31/01/14 18:56, Wei Liu wrote: On Thu, Jan 30, 2014 at 07:08:11PM +, Zoltan Kiss wrote: Hi, I've experienced some queue timeout problems mentioned in the subject with igb and bnx2 cards. I haven't seen them on other cards so far. I'm using XenServer with 3.10 Dom0 kernel (however igb were already updated to latest version), and there are Windows guests sending data through these cards. I noticed these problems in XenRT test runs, and I know that they usually mean some lost interrupt problem or other hardware error, but in my case they started to appear more often, and they are likely connected to my netback grant mapping patches. These patches causing skb's with huge (~64kb) linear buffers to appear more often. The reason for that is an old problem in the ring protocol: originally the maximum amount of slots were linked to MAX_SKB_FRAGS, as every slot ended up as a frag of the skb. When this value were changed, netback had to cope with the situation by coalescing the packets into fewer frags. My patch series take a different approach: the leftover slots (pages) were assigned to a new skb's frags, and that skb were stashed to the frag_list of the first one. Then, before sending it off to the stack it calls skb = skb_copy_expand(skb, 0, 0, GFP_ATOMIC, __GFP_NOWARN), which basically creates a new skb and copied all the data into it. As far as I understood, it put everything into the linear buffer, which can amount to 64KB at most. The original skb are freed then, and this new one were sent to the stack. Just my two cents, if it is this case, you can try to call skb_copy_expand on every SKB netback receives to manually create SKBs with ~64KB linear buffer to see how it goes... I've tried it, and it did break everything in a similar way, so that's a strong clue that the problem lies here. I've rewrote that part of my patches to do less modification, based on Malcolm's idea: netback pulls the first frag into linear buffer, then moves a frag from the frag_list skb into the first one. That seems to help, but so far I have only one relevant test result, I'm waiting for more results. Zoli -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On Thu, Jan 30, 2014 at 07:08:11PM +, Zoltan Kiss wrote: > Hi, > > I've experienced some queue timeout problems mentioned in the > subject with igb and bnx2 cards. I haven't seen them on other cards > so far. I'm using XenServer with 3.10 Dom0 kernel (however igb were > already updated to latest version), and there are Windows guests > sending data through these cards. I noticed these problems in XenRT > test runs, and I know that they usually mean some lost interrupt > problem or other hardware error, but in my case they started to > appear more often, and they are likely connected to my netback grant > mapping patches. These patches causing skb's with huge (~64kb) > linear buffers to appear more often. > The reason for that is an old problem in the ring protocol: > originally the maximum amount of slots were linked to MAX_SKB_FRAGS, > as every slot ended up as a frag of the skb. When this value were > changed, netback had to cope with the situation by coalescing the > packets into fewer frags. > My patch series take a different approach: the leftover slots > (pages) were assigned to a new skb's frags, and that skb were > stashed to the frag_list of the first one. Then, before sending it > off to the stack it calls skb = skb_copy_expand(skb, 0, 0, > GFP_ATOMIC, __GFP_NOWARN), which basically creates a new skb and > copied all the data into it. As far as I understood, it put > everything into the linear buffer, which can amount to 64KB at most. > The original skb are freed then, and this new one were sent to the > stack. Just my two cents, if it is this case, you can try to call skb_copy_expand on every SKB netback receives to manually create SKBs with ~64KB linear buffer to see how it goes... Wei. > I suspect that this is the problem as it only happens when guests > send too much slots. Does anyone familiar with these drivers have > seen such issue before? (when these kind of skb's get stucked in the > queue) > > Regards, > > Zoltan Kiss > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On 30/01/14 21:34, Michael Chan wrote: On Thu, 2014-01-30 at 19:08 +, Zoltan Kiss wrote: I've experienced some queue timeout problems mentioned in the subject with igb and bnx2 cards. Please provide the full tx timeout dmesg. bnx2 dumps some diagnostic information during tx timeout that may be useful. Thanks. Hi, Here is some: [ 5417.275463] [ cut here ] [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out [ 5417.275476] Modules linked in: tun nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc ipv6 openvswitch(O) ipt_REJECT nf_conntrack_ipv rack xt_tcpudp iptable_filter ip_tables x_tables nls_utf8 isofs dm_multipath scsi_dh dm_mod dcdbas coretemp microcode psmouse serio_raw lpc_ich mfd_core hid_generic ehci_p sg hed bnx2 usbhid hid sr_mod cdrom sd_mod pata_acpi ata_generic ata_piix libata uhci_hcd mptsas mptscsih mptbase scsi_transport_sas scsi_mod [ 5417.275517] CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G O 3.10.11-0.xs1.8.50.170.377582 #1 [ 5417.275518] Hardware name: Dell Inc. PowerEdge R710/00W9X3, BIOS 1.2.6 07/17/2009 [ 5417.275520] 00ff f008be08 c1488c53 f008be30 c1046664 c1658a88 f008be5c 00ff [ 5417.275525] c13fc146 c13fc146 ee96a000 0002 00137d44 f008be48 c1046723 0009 [ 5417.275530] f008be40 c1658a88 f008be5c f008be80 c13fc146 c16556e1 00ff c1658a88 [ 5417.275535] Call Trace: [ 5417.275539] [] dump_stack+0x16/0x1b [ 5417.275544] [] warn_slowpath_common+0x64/0x80 [ 5417.275546] [] ? dev_watchdog+0x156/0x1f0 [ 5417.275549] [] ? dev_watchdog+0x156/0x1f0 [ 5417.275551] [] warn_slowpath_fmt+0x33/0x40 [ 5417.275554] [] dev_watchdog+0x156/0x1f0 [ 5417.275559] [] call_timer_fn+0x3e/0xf0 [ 5417.275563] [] ? finish_task_switch+0x4e/0xb0 [ 5417.275565] [] ? __netdev_watchdog_up+0x60/0x60 [ 5417.275568] [] run_timer_softirq+0x1ab/0x210 [ 5417.275571] [] ? __netdev_watchdog_up+0x60/0x60 [ 5417.275574] [] __do_softirq+0xc4/0x200 [ 5417.275577] [] ? xen_do_upcall+0x7/0xc [ 5417.275579] [] run_ksoftirqd+0x20/0x50 [ 5417.275582] [] smpboot_thread_fn+0x142/0x150 [ 5417.275586] [] kthread+0x9b/0xa0 [ 5417.275589] [] ? smpboot_create_threads+0x60/0x60 [ 5417.275591] [] ? cpu_rt_runtime_read+0x40/0x80 [ 5417.275594] [] ret_from_kernel_thread+0x1b/0x28 [ 5417.275596] [] ? kthread_freezable_should_stop+0x60/0x60 [ 5417.275599] ---[ end trace 691f572d388226ca ]--- [ 5417.275602] bnx2 :01:00.1 eth1: <--- start FTQ dump ---> [ 5417.275622] bnx2 :01:00.1 eth1: RV2P_PFTQ_CTL 0001 [ 5417.275629] bnx2 :01:00.1 eth1: RV2P_TFTQ_CTL 0002 [ 5417.275636] bnx2 :01:00.1 eth1: RV2P_MFTQ_CTL 4000 [ 5417.275643] bnx2 :01:00.1 eth1: TBDR_FTQ_CTL 4002 [ 5417.275650] bnx2 :01:00.1 eth1: TDMA_FTQ_CTL 00010002 [ 5417.275657] bnx2 :01:00.1 eth1: TXP_FTQ_CTL 0001 [ 5417.275663] bnx2 :01:00.1 eth1: TXP_FTQ_CTL 0001 [ 5417.275670] bnx2 :01:00.1 eth1: TPAT_FTQ_CTL 0001 [ 5417.275677] bnx2 :01:00.1 eth1: RXP_CFTQ_CTL 8000 [ 5417.275684] bnx2 :01:00.1 eth1: RXP_FTQ_CTL 0010 [ 5417.275690] bnx2 :01:00.1 eth1: COM_COMXQ_FTQ_CTL 0001 [ 5417.275698] bnx2 :01:00.1 eth1: COM_COMTQ_FTQ_CTL 0002 [ 5417.275705] bnx2 :01:00.1 eth1: COM_COMQ_FTQ_CTL 0001 [ 5417.275712] bnx2 :01:00.1 eth1: CP_CPQ_FTQ_CTL 4000 [ 5417.275718] bnx2 :01:00.1 eth1: CPU states: [ 5417.275730] bnx2 :01:00.1 eth1: 045000 mode b84c state 80001000 evt_mask 500 pc 8001284 pc 8001284 instr 1440fffc [ 5417.275746] bnx2 :01:00.1 eth1: 085000 mode b84c state 80005000 evt_mask 500 pc 8000a54 pc 8000a5c instr 10400016 [ 5417.275785] bnx2 :01:00.1 eth1: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c20 pc 8004c20 instr 32050003 [ 5417.275801] bnx2 :01:00.1 eth1: 105000 mode b8cc state 8000 evt_mask 500 pc 8000a8c pc 8000a94 instr 8c420020 [ 5417.275817] bnx2 :01:00.1 eth1: 145000 mode b880 state 8000 evt_mask 500 pc 8000ab0 pc 800d1e8 instr 27bd0020 [ 5417.275834] bnx2 :01:00.1 eth1: 185000 mode b8cc state 8000 evt_mask 500 pc 8000cb0 pc 8000930 instr 8ce800e8 [ 5417.275845] bnx2 :01:00.1 eth1: <--- end FTQ dump ---> [ 5417.275851] bnx2 :01:00.1 eth1: <--- start TBDC dump ---> [ 5417.275858] bnx2 :01:00.1 eth1: TBDC free cnt: 32 [ 5417.275864] bnx2 :01:00.1 eth1: LINE CID BIDX CMD VALIDS [ 5417.275875] bnx2 :01:00.1 eth1: 00001080 17c8 00 [0] [ 5417.275886] bnx2 :01:00.1 eth1: 01001080 17e0 00 [0] [ 5417.275897] bnx2 :01:00.1 eth1: 02001080 17e8 00 [0] [ 5417.275907] bnx2 :01:00.1 eth1: 03001080 17f8 00 [0] [ 5417.275918] bnx2 :01:00.1 eth1: 04001080 1800 00 [0] [ 5417.275929] bnx2 :01:00.1 eth1: 05001080 17d0 00 [0] [ 5417.275940] bnx2 :01:00.1 eth1: 06001080 17d8 00 [0] [ 5417.275951]
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On Thu, Jan 30, 2014 at 07:08:11PM +, Zoltan Kiss wrote: Hi, I've experienced some queue timeout problems mentioned in the subject with igb and bnx2 cards. I haven't seen them on other cards so far. I'm using XenServer with 3.10 Dom0 kernel (however igb were already updated to latest version), and there are Windows guests sending data through these cards. I noticed these problems in XenRT test runs, and I know that they usually mean some lost interrupt problem or other hardware error, but in my case they started to appear more often, and they are likely connected to my netback grant mapping patches. These patches causing skb's with huge (~64kb) linear buffers to appear more often. The reason for that is an old problem in the ring protocol: originally the maximum amount of slots were linked to MAX_SKB_FRAGS, as every slot ended up as a frag of the skb. When this value were changed, netback had to cope with the situation by coalescing the packets into fewer frags. My patch series take a different approach: the leftover slots (pages) were assigned to a new skb's frags, and that skb were stashed to the frag_list of the first one. Then, before sending it off to the stack it calls skb = skb_copy_expand(skb, 0, 0, GFP_ATOMIC, __GFP_NOWARN), which basically creates a new skb and copied all the data into it. As far as I understood, it put everything into the linear buffer, which can amount to 64KB at most. The original skb are freed then, and this new one were sent to the stack. Just my two cents, if it is this case, you can try to call skb_copy_expand on every SKB netback receives to manually create SKBs with ~64KB linear buffer to see how it goes... Wei. I suspect that this is the problem as it only happens when guests send too much slots. Does anyone familiar with these drivers have seen such issue before? (when these kind of skb's get stucked in the queue) Regards, Zoltan Kiss -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On 30/01/14 21:34, Michael Chan wrote: On Thu, 2014-01-30 at 19:08 +, Zoltan Kiss wrote: I've experienced some queue timeout problems mentioned in the subject with igb and bnx2 cards. Please provide the full tx timeout dmesg. bnx2 dumps some diagnostic information during tx timeout that may be useful. Thanks. Hi, Here is some: [ 5417.275463] [ cut here ] [ 5417.275472] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x156/0x1f0() [ 5417.275474] NETDEV WATCHDOG: eth1 (bnx2): transmit queue 2 timed out [ 5417.275476] Modules linked in: tun nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc ipv6 openvswitch(O) ipt_REJECT nf_conntrack_ipv rack xt_tcpudp iptable_filter ip_tables x_tables nls_utf8 isofs dm_multipath scsi_dh dm_mod dcdbas coretemp microcode psmouse serio_raw lpc_ich mfd_core hid_generic ehci_p sg hed bnx2 usbhid hid sr_mod cdrom sd_mod pata_acpi ata_generic ata_piix libata uhci_hcd mptsas mptscsih mptbase scsi_transport_sas scsi_mod [ 5417.275517] CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G O 3.10.11-0.xs1.8.50.170.377582 #1 [ 5417.275518] Hardware name: Dell Inc. PowerEdge R710/00W9X3, BIOS 1.2.6 07/17/2009 [ 5417.275520] 00ff f008be08 c1488c53 f008be30 c1046664 c1658a88 f008be5c 00ff [ 5417.275525] c13fc146 c13fc146 ee96a000 0002 00137d44 f008be48 c1046723 0009 [ 5417.275530] f008be40 c1658a88 f008be5c f008be80 c13fc146 c16556e1 00ff c1658a88 [ 5417.275535] Call Trace: [ 5417.275539] [c1488c53] dump_stack+0x16/0x1b [ 5417.275544] [c1046664] warn_slowpath_common+0x64/0x80 [ 5417.275546] [c13fc146] ? dev_watchdog+0x156/0x1f0 [ 5417.275549] [c13fc146] ? dev_watchdog+0x156/0x1f0 [ 5417.275551] [c1046723] warn_slowpath_fmt+0x33/0x40 [ 5417.275554] [c13fc146] dev_watchdog+0x156/0x1f0 [ 5417.275559] [c10549ce] call_timer_fn+0x3e/0xf0 [ 5417.275563] [c107293e] ? finish_task_switch+0x4e/0xb0 [ 5417.275565] [c13fbff0] ? __netdev_watchdog_up+0x60/0x60 [ 5417.275568] [c1055c1b] run_timer_softirq+0x1ab/0x210 [ 5417.275571] [c13fbff0] ? __netdev_watchdog_up+0x60/0x60 [ 5417.275574] [c104e3f4] __do_softirq+0xc4/0x200 [ 5417.275577] [c1493547] ? xen_do_upcall+0x7/0xc [ 5417.275579] [c104e550] run_ksoftirqd+0x20/0x50 [ 5417.275582] [c106f182] smpboot_thread_fn+0x142/0x150 [ 5417.275586] [c1067a2b] kthread+0x9b/0xa0 [ 5417.275589] [c106f040] ? smpboot_create_threads+0x60/0x60 [ 5417.275591] [c107] ? cpu_rt_runtime_read+0x40/0x80 [ 5417.275594] [c1492f77] ret_from_kernel_thread+0x1b/0x28 [ 5417.275596] [c1067990] ? kthread_freezable_should_stop+0x60/0x60 [ 5417.275599] ---[ end trace 691f572d388226ca ]--- [ 5417.275602] bnx2 :01:00.1 eth1: --- start FTQ dump --- [ 5417.275622] bnx2 :01:00.1 eth1: RV2P_PFTQ_CTL 0001 [ 5417.275629] bnx2 :01:00.1 eth1: RV2P_TFTQ_CTL 0002 [ 5417.275636] bnx2 :01:00.1 eth1: RV2P_MFTQ_CTL 4000 [ 5417.275643] bnx2 :01:00.1 eth1: TBDR_FTQ_CTL 4002 [ 5417.275650] bnx2 :01:00.1 eth1: TDMA_FTQ_CTL 00010002 [ 5417.275657] bnx2 :01:00.1 eth1: TXP_FTQ_CTL 0001 [ 5417.275663] bnx2 :01:00.1 eth1: TXP_FTQ_CTL 0001 [ 5417.275670] bnx2 :01:00.1 eth1: TPAT_FTQ_CTL 0001 [ 5417.275677] bnx2 :01:00.1 eth1: RXP_CFTQ_CTL 8000 [ 5417.275684] bnx2 :01:00.1 eth1: RXP_FTQ_CTL 0010 [ 5417.275690] bnx2 :01:00.1 eth1: COM_COMXQ_FTQ_CTL 0001 [ 5417.275698] bnx2 :01:00.1 eth1: COM_COMTQ_FTQ_CTL 0002 [ 5417.275705] bnx2 :01:00.1 eth1: COM_COMQ_FTQ_CTL 0001 [ 5417.275712] bnx2 :01:00.1 eth1: CP_CPQ_FTQ_CTL 4000 [ 5417.275718] bnx2 :01:00.1 eth1: CPU states: [ 5417.275730] bnx2 :01:00.1 eth1: 045000 mode b84c state 80001000 evt_mask 500 pc 8001284 pc 8001284 instr 1440fffc [ 5417.275746] bnx2 :01:00.1 eth1: 085000 mode b84c state 80005000 evt_mask 500 pc 8000a54 pc 8000a5c instr 10400016 [ 5417.275785] bnx2 :01:00.1 eth1: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c20 pc 8004c20 instr 32050003 [ 5417.275801] bnx2 :01:00.1 eth1: 105000 mode b8cc state 8000 evt_mask 500 pc 8000a8c pc 8000a94 instr 8c420020 [ 5417.275817] bnx2 :01:00.1 eth1: 145000 mode b880 state 8000 evt_mask 500 pc 8000ab0 pc 800d1e8 instr 27bd0020 [ 5417.275834] bnx2 :01:00.1 eth1: 185000 mode b8cc state 8000 evt_mask 500 pc 8000cb0 pc 8000930 instr 8ce800e8 [ 5417.275845] bnx2 :01:00.1 eth1: --- end FTQ dump --- [ 5417.275851] bnx2 :01:00.1 eth1: --- start TBDC dump --- [ 5417.275858] bnx2 :01:00.1 eth1: TBDC free cnt: 32 [ 5417.275864] bnx2 :01:00.1 eth1: LINE CID BIDX CMD VALIDS [ 5417.275875] bnx2 :01:00.1 eth1: 00001080 17c8 00 [0] [ 5417.275886] bnx2 :01:00.1 eth1: 01001080 17e0 00 [0] [ 5417.275897] bnx2 :01:00.1 eth1: 02001080 17e8 00 [0] [ 5417.275907] bnx2 :01:00.1 eth1: 03001080 17f8 00 [0] [ 5417.275918] bnx2 :01:00.1 eth1: 04001080 1800 00 [0] [ 5417.275929
Re: igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
On Thu, 2014-01-30 at 19:08 +, Zoltan Kiss wrote: > I've experienced some queue timeout problems mentioned in the subject > with igb and bnx2 cards. Please provide the full tx timeout dmesg. bnx2 dumps some diagnostic information during tx timeout that may be useful. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
igb and bnx2: "NETDEV WATCHDOG: transmit queue timed out" when skb has huge linear buffer
Hi, I've experienced some queue timeout problems mentioned in the subject with igb and bnx2 cards. I haven't seen them on other cards so far. I'm using XenServer with 3.10 Dom0 kernel (however igb were already updated to latest version), and there are Windows guests sending data through these cards. I noticed these problems in XenRT test runs, and I know that they usually mean some lost interrupt problem or other hardware error, but in my case they started to appear more often, and they are likely connected to my netback grant mapping patches. These patches causing skb's with huge (~64kb) linear buffers to appear more often. The reason for that is an old problem in the ring protocol: originally the maximum amount of slots were linked to MAX_SKB_FRAGS, as every slot ended up as a frag of the skb. When this value were changed, netback had to cope with the situation by coalescing the packets into fewer frags. My patch series take a different approach: the leftover slots (pages) were assigned to a new skb's frags, and that skb were stashed to the frag_list of the first one. Then, before sending it off to the stack it calls skb = skb_copy_expand(skb, 0, 0, GFP_ATOMIC, __GFP_NOWARN), which basically creates a new skb and copied all the data into it. As far as I understood, it put everything into the linear buffer, which can amount to 64KB at most. The original skb are freed then, and this new one were sent to the stack. I suspect that this is the problem as it only happens when guests send too much slots. Does anyone familiar with these drivers have seen such issue before? (when these kind of skb's get stucked in the queue) Regards, Zoltan Kiss -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
Hi, I've experienced some queue timeout problems mentioned in the subject with igb and bnx2 cards. I haven't seen them on other cards so far. I'm using XenServer with 3.10 Dom0 kernel (however igb were already updated to latest version), and there are Windows guests sending data through these cards. I noticed these problems in XenRT test runs, and I know that they usually mean some lost interrupt problem or other hardware error, but in my case they started to appear more often, and they are likely connected to my netback grant mapping patches. These patches causing skb's with huge (~64kb) linear buffers to appear more often. The reason for that is an old problem in the ring protocol: originally the maximum amount of slots were linked to MAX_SKB_FRAGS, as every slot ended up as a frag of the skb. When this value were changed, netback had to cope with the situation by coalescing the packets into fewer frags. My patch series take a different approach: the leftover slots (pages) were assigned to a new skb's frags, and that skb were stashed to the frag_list of the first one. Then, before sending it off to the stack it calls skb = skb_copy_expand(skb, 0, 0, GFP_ATOMIC, __GFP_NOWARN), which basically creates a new skb and copied all the data into it. As far as I understood, it put everything into the linear buffer, which can amount to 64KB at most. The original skb are freed then, and this new one were sent to the stack. I suspect that this is the problem as it only happens when guests send too much slots. Does anyone familiar with these drivers have seen such issue before? (when these kind of skb's get stucked in the queue) Regards, Zoltan Kiss -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb and bnx2: NETDEV WATCHDOG: transmit queue timed out when skb has huge linear buffer
On Thu, 2014-01-30 at 19:08 +, Zoltan Kiss wrote: I've experienced some queue timeout problems mentioned in the subject with igb and bnx2 cards. Please provide the full tx timeout dmesg. bnx2 dumps some diagnostic information during tx timeout that may be useful. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Nick, You could try 7.3.21-k8-NAPI in tree or the out-of-tree version as Bjorn mentioned. To read and debug an old version driver is not a interesting thing for somebody to do. Thanks, Ethan On Tue, Dec 3, 2013 at 9:33 PM, Nick Pegg wrote: > On Mon, Dec 2, 2013 at 10:51 PM, Ethan Zhao wrote: >> Bjorn, >>Seems not the same bug as http://sourceforge.net/p/e1000/bugs/367/ >> , Nick is not running his kernel on bare metal, per the error log, >> he runs his kernel as HVM DomU guest or Dom0 on XEN ? so just a check >> of NULL will not fix that. >> > > Sorry, I neglected to say in my original email that the kernel is > running as a Xen Dom0. Per Todd's request, I've opened a bug report on > sourceforge and will follow up with this issue there: > https://sourceforge.net/p/e1000/bugs/385/ > > Thanks, > Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Nick, You could try 7.3.21-k8-NAPI in tree or the out-of-tree version as Bjorn mentioned. To read and debug an old version driver is not a interesting thing for somebody to do. Thanks, Ethan On Tue, Dec 3, 2013 at 9:33 PM, Nick Pegg n...@nickpegg.com wrote: On Mon, Dec 2, 2013 at 10:51 PM, Ethan Zhao ethan.ker...@gmail.com wrote: Bjorn, Seems not the same bug as http://sourceforge.net/p/e1000/bugs/367/ , Nick is not running his kernel on bare metal, per the error log, he runs his kernel as HVM DomU guest or Dom0 on XEN ? so just a check of NULL will not fix that. Sorry, I neglected to say in my original email that the kernel is running as a Xen Dom0. Per Todd's request, I've opened a bug report on sourceforge and will follow up with this issue there: https://sourceforge.net/p/e1000/bugs/385/ Thanks, Nick -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
On Mon, Dec 2, 2013 at 10:51 PM, Ethan Zhao wrote: > Bjorn, >Seems not the same bug as http://sourceforge.net/p/e1000/bugs/367/ > , Nick is not running his kernel on bare metal, per the error log, > he runs his kernel as HVM DomU guest or Dom0 on XEN ? so just a check > of NULL will not fix that. > Sorry, I neglected to say in my original email that the kernel is running as a Xen Dom0. Per Todd's request, I've opened a bug report on sourceforge and will follow up with this issue there: https://sourceforge.net/p/e1000/bugs/385/ Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
On Mon, Dec 2, 2013 at 10:51 PM, Ethan Zhao ethan.ker...@gmail.com wrote: Bjorn, Seems not the same bug as http://sourceforge.net/p/e1000/bugs/367/ , Nick is not running his kernel on bare metal, per the error log, he runs his kernel as HVM DomU guest or Dom0 on XEN ? so just a check of NULL will not fix that. Sorry, I neglected to say in my original email that the kernel is running as a Xen Dom0. Per Todd's request, I've opened a bug report on sourceforge and will follow up with this issue there: https://sourceforge.net/p/e1000/bugs/385/ Thanks, Nick -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Bjorn, Seems not the same bug as http://sourceforge.net/p/e1000/bugs/367/ , Nick is not running his kernel on bare metal, per the error log, he runs his kernel as HVM DomU guest or Dom0 on XEN ? so just a check of NULL will not fix that. Thanks, Ethan On Thu, Nov 21, 2013 at 5:22 AM, Bjorn Helgaas wrote: > [+cc e1000-devel] > > On Wed, Nov 20, 2013 at 11:44 AM, Nick Pegg wrote: >> Hello, >> >> I've been seeing some servers hit a condition where they receive a >> large number of packets (over 500,000 per second, for example) which >> causes a kernel panic due to a null pointer dereference. I've included >> the tracebacks below. >> >> I have not been able to reproduce this in my lab, but out in the field >> I've seen this happen with kernel versions 3.7.6 through 3.9.2 (e1000e >> driver versions 2.1.4-k through 2.2.14-k), running with Intel 82574L >> NICs. >> >> I've seen previous posts to this mailing list suggesting that this is >> a hardware issue (the mitigation being turning TSO/GSO off), however >> those tracebacks didn't show the interface getting unexpectedly reset, >> causing the null pointer dereference. Is this possibly a problem with >> the e1000e driver where it's not gracefully handling the reset? >> >> Let me know if more information is needed. And please CC me in replies >> since I'm not subscribed to this list. Thanks! > > Intel maintains newer drivers out-of-tree at > http://sourceforge.net/projects/e1000/, and it's possible this is some > bug that has already been fixed. The current version there looks like > e1000e-2.5.4, released 2013-09-05. > > Possible similar report: http://sourceforge.net/p/e1000/bugs/367/ (no > real data there). > >> >> Nov 16 07:03:19 rx [ cut here ]-------- >> Nov 16 07:03:19 rx WARNING: at net/sched/sch_generic.c:255 >> dev_watchdog+0x25b/0x270() >> Nov 16 07:03:19 rx Hardware name: X8DT6 >> Nov 16 07:03:19 rx NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out >> Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev >> ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit >> ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding >> ebtable_filter 8021q mrp e1000e ptp pps_core >> Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Not tainted 3.9.2-1 #1 >> Nov 16 07:03:19 rx Call Trace: >> Nov 16 07:03:19 rx[] >> warn_slowpath_common+0x7a/0xc0 >> Nov 16 07:03:19 rx [] warn_slowpath_fmt+0x41/0x50 >> Nov 16 07:03:19 rx [] dev_watchdog+0x25b/0x270 >> Nov 16 07:03:19 rx [] ? __netdev_watchdog_up+0x80/0x80 >> Nov 16 07:03:19 rx [] call_timer_fn+0x44/0x120 >> Nov 16 07:03:19 rx [] run_timer_softirq+0x241/0x2b0 >> Nov 16 07:03:19 rx [] ? __netdev_watchdog_up+0x80/0x80 >> Nov 16 07:03:19 rx [] __do_softirq+0xef/0x270 >> Nov 16 07:03:19 rx [] irq_exit+0xb5/0xc0 >> Nov 16 07:03:19 rx [] xen_evtchn_do_upcall+0x2f/0x40 >> Nov 16 07:03:19 rx [] xen_do_hypervisor_callback+0x1e/0x30 >> Nov 16 07:03:19 rx[] ? delay_tsc+0x32/0x80 >> Nov 16 07:03:19 rx [] ? delay_tsc+0x4a/0x80 >> Nov 16 07:03:19 rx [] ? __const_udelay+0x28/0x30 >> Nov 16 07:03:19 rx [] ? >> e1000e_read_phy_reg_mdic+0xce/0x120 [e1000e] >> Nov 16 07:03:19 rx [] ? >> e1000_get_hw_semaphore_82574+0x20/0x40 [e1000e] >> Nov 16 07:03:19 rx [] ? >> e1000e_read_phy_reg_bm2+0x55/0xb0 [e1000e] >> Nov 16 07:03:19 rx [] ? >> e1000e_flush_descriptors+0x96/0x270 [e1000e] >> Nov 16 07:03:19 rx [] ? >> e1000_check_phy_82574+0x27/0x60 [e1000e] >> Nov 16 07:03:19 rx [] ? >> e1000_watchdog_task+0x648/0x830 [e1000e] >> Nov 16 07:03:19 rx [] ? __schedule+0x3a7/0x7c0 >> Nov 16 07:03:19 rx [] ? process_one_work+0x16e/0x430 >> Nov 16 07:03:19 rx [] ? worker_thread+0x11c/0x410 >> Nov 16 07:03:19 rx [] ? manage_workers+0x360/0x360 >> Nov 16 07:03:19 rx [] ? kthread+0xc6/0xd0 >> Nov 16 07:03:19 rx [] ? xen_end_context_switch+0x19/0x20 >> Nov 16 07:03:19 rx [] ? >> kthread_freezable_should_stop+0x70/0x70 >> Nov 16 07:03:19 rx [] ? ret_from_fork+0x7c/0xb0 >> Nov 16 07:03:19 rx [] ? >> kthread_freezable_should_stop+0x70/0x70 >> Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182d ]--- >> Nov 16 07:03:19 rx e1000e :03:00.0 eth0: Reset adapter unexpectedly >> Nov 16 07:03:19 rx BUG: unable to handle kernel NULL pointer >> dereference at 00d0 >> Nov 16 07:03:19 rx IP: [] >> e1000_clean_rx_irq+0x101/0x490 [e1000e] >> Nov 16 07:03:19 rx PGD 4a6c3067 PUD 4f440067 PMD 0 >> Nov 16 07:03:19 rx Oops: [#1] SMP >> Nov 16 07:03:19 rx
RE: [E1000-devel] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
I'm having difficulty following this issue, most likely because of our email system. Can you file a new bug on sourceforge? Thanks. Todd Fujinaka Software Application Engineer Networking Division (ND) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Nick Pegg [mailto:n...@nickpegg.com] Sent: Monday, December 02, 2013 2:57 PM To: linux-kernel@vger.kernel.org; e1000-de...@lists.sourceforge.net Subject: Re: [E1000-devel] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out > Intel maintains newer drivers out-of-tree at > http://sourceforge.net/projects/e1000/, and it's possible this is some > bug that has already been fixed. The current version there looks like > e1000e-2.5.4, released 2013-09-05. > > Possible similar report: http://sourceforge.net/p/e1000/bugs/367/ (no > real data there). I've looked through the existing bug reports and version changelogs and didn't see anything that seemed very relevant. I was able to debug the e1000e object file and get the specific code that's bugging out after the interface is unexpectedly reset: (gdb) l *e1000_clean_rx_irq+0x101 0x19d81 is in e1000_clean_rx_irq (drivers/net/ethernet/intel/e1000e/netdev.c:933). 928 rmb(); /* read descriptor and rx_buffer_info after status DD */ 929 930 skb = buffer_info->skb; 931 buffer_info->skb = NULL; 932 933 prefetch(skb->data - NET_IP_ALIGN); 934 935 i++; 936 if (i == rx_ring->count) 937 i = 0; The above code is from kernel version 3.9.2 and e1000e driver version 2.2.14-k. Should there be a check here to see if skb is NULL? I checked the latest e1000e release (2.5.4) and there is no check there either (near netdev.c:994). -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351=/4140/ostg.clktrk ___ E1000-devel mailing list e1000-de...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel Ethernet, visit http://communities.intel.com/community/wired -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
> Intel maintains newer drivers out-of-tree at > http://sourceforge.net/projects/e1000/, and it's possible this is some > bug that has already been fixed. The current version there looks like > e1000e-2.5.4, released 2013-09-05. > > Possible similar report: http://sourceforge.net/p/e1000/bugs/367/ (no > real data there). I've looked through the existing bug reports and version changelogs and didn't see anything that seemed very relevant. I was able to debug the e1000e object file and get the specific code that's bugging out after the interface is unexpectedly reset: (gdb) l *e1000_clean_rx_irq+0x101 0x19d81 is in e1000_clean_rx_irq (drivers/net/ethernet/intel/e1000e/netdev.c:933). 928 rmb(); /* read descriptor and rx_buffer_info after status DD */ 929 930 skb = buffer_info->skb; 931 buffer_info->skb = NULL; 932 933 prefetch(skb->data - NET_IP_ALIGN); 934 935 i++; 936 if (i == rx_ring->count) 937 i = 0; The above code is from kernel version 3.9.2 and e1000e driver version 2.2.14-k. Should there be a check here to see if skb is NULL? I checked the latest e1000e release (2.5.4) and there is no check there either (near netdev.c:994). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Intel maintains newer drivers out-of-tree at http://sourceforge.net/projects/e1000/, and it's possible this is some bug that has already been fixed. The current version there looks like e1000e-2.5.4, released 2013-09-05. Possible similar report: http://sourceforge.net/p/e1000/bugs/367/ (no real data there). I've looked through the existing bug reports and version changelogs and didn't see anything that seemed very relevant. I was able to debug the e1000e object file and get the specific code that's bugging out after the interface is unexpectedly reset: (gdb) l *e1000_clean_rx_irq+0x101 0x19d81 is in e1000_clean_rx_irq (drivers/net/ethernet/intel/e1000e/netdev.c:933). 928 rmb(); /* read descriptor and rx_buffer_info after status DD */ 929 930 skb = buffer_info-skb; 931 buffer_info-skb = NULL; 932 933 prefetch(skb-data - NET_IP_ALIGN); 934 935 i++; 936 if (i == rx_ring-count) 937 i = 0; The above code is from kernel version 3.9.2 and e1000e driver version 2.2.14-k. Should there be a check here to see if skb is NULL? I checked the latest e1000e release (2.5.4) and there is no check there either (near netdev.c:994). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [E1000-devel] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
I'm having difficulty following this issue, most likely because of our email system. Can you file a new bug on sourceforge? Thanks. Todd Fujinaka Software Application Engineer Networking Division (ND) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Nick Pegg [mailto:n...@nickpegg.com] Sent: Monday, December 02, 2013 2:57 PM To: linux-kernel@vger.kernel.org; e1000-de...@lists.sourceforge.net Subject: Re: [E1000-devel] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Intel maintains newer drivers out-of-tree at http://sourceforge.net/projects/e1000/, and it's possible this is some bug that has already been fixed. The current version there looks like e1000e-2.5.4, released 2013-09-05. Possible similar report: http://sourceforge.net/p/e1000/bugs/367/ (no real data there). I've looked through the existing bug reports and version changelogs and didn't see anything that seemed very relevant. I was able to debug the e1000e object file and get the specific code that's bugging out after the interface is unexpectedly reset: (gdb) l *e1000_clean_rx_irq+0x101 0x19d81 is in e1000_clean_rx_irq (drivers/net/ethernet/intel/e1000e/netdev.c:933). 928 rmb(); /* read descriptor and rx_buffer_info after status DD */ 929 930 skb = buffer_info-skb; 931 buffer_info-skb = NULL; 932 933 prefetch(skb-data - NET_IP_ALIGN); 934 935 i++; 936 if (i == rx_ring-count) 937 i = 0; The above code is from kernel version 3.9.2 and e1000e driver version 2.2.14-k. Should there be a check here to see if skb is NULL? I checked the latest e1000e release (2.5.4) and there is no check there either (near netdev.c:994). -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ E1000-devel mailing list e1000-de...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Bjorn, Seems not the same bug as http://sourceforge.net/p/e1000/bugs/367/ , Nick is not running his kernel on bare metal, per the error log, he runs his kernel as HVM DomU guest or Dom0 on XEN ? so just a check of NULL will not fix that. Thanks, Ethan On Thu, Nov 21, 2013 at 5:22 AM, Bjorn Helgaas bhelg...@google.com wrote: [+cc e1000-devel] On Wed, Nov 20, 2013 at 11:44 AM, Nick Pegg n...@nickpegg.com wrote: Hello, I've been seeing some servers hit a condition where they receive a large number of packets (over 500,000 per second, for example) which causes a kernel panic due to a null pointer dereference. I've included the tracebacks below. I have not been able to reproduce this in my lab, but out in the field I've seen this happen with kernel versions 3.7.6 through 3.9.2 (e1000e driver versions 2.1.4-k through 2.2.14-k), running with Intel 82574L NICs. I've seen previous posts to this mailing list suggesting that this is a hardware issue (the mitigation being turning TSO/GSO off), however those tracebacks didn't show the interface getting unexpectedly reset, causing the null pointer dereference. Is this possibly a problem with the e1000e driver where it's not gracefully handling the reset? Let me know if more information is needed. And please CC me in replies since I'm not subscribed to this list. Thanks! Intel maintains newer drivers out-of-tree at http://sourceforge.net/projects/e1000/, and it's possible this is some bug that has already been fixed. The current version there looks like e1000e-2.5.4, released 2013-09-05. Possible similar report: http://sourceforge.net/p/e1000/bugs/367/ (no real data there). Nov 16 07:03:19 rx [ cut here ] Nov 16 07:03:19 rx WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x25b/0x270() Nov 16 07:03:19 rx Hardware name: X8DT6 Nov 16 07:03:19 rx NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding ebtable_filter 8021q mrp e1000e ptp pps_core Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Not tainted 3.9.2-1 #1 Nov 16 07:03:19 rx Call Trace: Nov 16 07:03:19 rx IRQ [8105070a] warn_slowpath_common+0x7a/0xc0 Nov 16 07:03:19 rx [810507f1] warn_slowpath_fmt+0x41/0x50 Nov 16 07:03:19 rx [8168e2ab] dev_watchdog+0x25b/0x270 Nov 16 07:03:19 rx [8168e050] ? __netdev_watchdog_up+0x80/0x80 Nov 16 07:03:19 rx [81060464] call_timer_fn+0x44/0x120 Nov 16 07:03:19 rx [81060a71] run_timer_softirq+0x241/0x2b0 Nov 16 07:03:19 rx [8168e050] ? __netdev_watchdog_up+0x80/0x80 Nov 16 07:03:19 rx [8105893f] __do_softirq+0xef/0x270 Nov 16 07:03:19 rx [81058bc5] irq_exit+0xb5/0xc0 Nov 16 07:03:19 rx [81435a8f] xen_evtchn_do_upcall+0x2f/0x40 Nov 16 07:03:19 rx [817a24fe] xen_do_hypervisor_callback+0x1e/0x30 Nov 16 07:03:19 rx EOI [813b3722] ? delay_tsc+0x32/0x80 Nov 16 07:03:19 rx [813b373a] ? delay_tsc+0x4a/0x80 Nov 16 07:03:19 rx [813b36e8] ? __const_udelay+0x28/0x30 Nov 16 07:03:19 rx [a0025b0e] ? e1000e_read_phy_reg_mdic+0xce/0x120 [e1000e] Nov 16 07:03:19 rx [a0018bd0] ? e1000_get_hw_semaphore_82574+0x20/0x40 [e1000e] Nov 16 07:03:19 rx [a0027a75] ? e1000e_read_phy_reg_bm2+0x55/0xb0 [e1000e] Nov 16 07:03:19 rx [a00330d6] ? e1000e_flush_descriptors+0x96/0x270 [e1000e] Nov 16 07:03:19 rx [a00181b7] ? e1000_check_phy_82574+0x27/0x60 [e1000e] Nov 16 07:03:19 rx [a0034178] ? e1000_watchdog_task+0x648/0x830 [e1000e] Nov 16 07:03:19 rx [81797877] ? __schedule+0x3a7/0x7c0 Nov 16 07:03:19 rx [8106c74e] ? process_one_work+0x16e/0x430 Nov 16 07:03:19 rx [8106ea3c] ? worker_thread+0x11c/0x410 Nov 16 07:03:19 rx [8106e920] ? manage_workers+0x360/0x360 Nov 16 07:03:19 rx [810738f6] ? kthread+0xc6/0xd0 Nov 16 07:03:19 rx [81003869] ? xen_end_context_switch+0x19/0x20 Nov 16 07:03:19 rx [81073830] ? kthread_freezable_should_stop+0x70/0x70 Nov 16 07:03:19 rx [817a10bc] ? ret_from_fork+0x7c/0xb0 Nov 16 07:03:19 rx [81073830] ? kthread_freezable_should_stop+0x70/0x70 Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182d ]--- Nov 16 07:03:19 rx e1000e :03:00.0 eth0: Reset adapter unexpectedly Nov 16 07:03:19 rx BUG: unable to handle kernel NULL pointer dereference at 00d0 Nov 16 07:03:19 rx IP: [a0031d51] e1000_clean_rx_irq+0x101/0x490 [e1000e] Nov 16 07:03:19 rx PGD 4a6c3067 PUD 4f440067 PMD 0 Nov 16 07:03:19 rx Oops: [#1] SMP Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Hello, I've been seeing some servers hit a condition where they receive a large number of packets (over 500,000 per second, for example) which causes a kernel panic due to a null pointer dereference. I've included the tracebacks below. I have not been able to reproduce this in my lab, but out in the field I've seen this happen with kernel versions 3.7.6 through 3.9.2 (e1000e driver versions 2.1.4-k through 2.2.14-k), running with Intel 82574L NICs. I've seen previous posts to this mailing list suggesting that this is a hardware issue (the mitigation being turning TSO/GSO off), however those tracebacks didn't show the interface getting unexpectedly reset, causing the null pointer dereference. Is this possibly a problem with the e1000e driver where it's not gracefully handling the reset? Let me know if more information is needed. And please CC me in replies since I'm not subscribed to this list. Thanks! -Nick Nov 16 07:03:19 rx [ cut here ] Nov 16 07:03:19 rx WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x25b/0x270() Nov 16 07:03:19 rx Hardware name: X8DT6 Nov 16 07:03:19 rx NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding ebtable_filter 8021q mrp e1000e ptp pps_core Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Not tainted 3.9.2-1 #1 Nov 16 07:03:19 rx Call Trace: Nov 16 07:03:19 rx[] warn_slowpath_common+0x7a/0xc0 Nov 16 07:03:19 rx [] warn_slowpath_fmt+0x41/0x50 Nov 16 07:03:19 rx [] dev_watchdog+0x25b/0x270 Nov 16 07:03:19 rx [] ? __netdev_watchdog_up+0x80/0x80 Nov 16 07:03:19 rx [] call_timer_fn+0x44/0x120 Nov 16 07:03:19 rx [] run_timer_softirq+0x241/0x2b0 Nov 16 07:03:19 rx [] ? __netdev_watchdog_up+0x80/0x80 Nov 16 07:03:19 rx [] __do_softirq+0xef/0x270 Nov 16 07:03:19 rx [] irq_exit+0xb5/0xc0 Nov 16 07:03:19 rx [] xen_evtchn_do_upcall+0x2f/0x40 Nov 16 07:03:19 rx [] xen_do_hypervisor_callback+0x1e/0x30 Nov 16 07:03:19 rx[] ? delay_tsc+0x32/0x80 Nov 16 07:03:19 rx [] ? delay_tsc+0x4a/0x80 Nov 16 07:03:19 rx [] ? __const_udelay+0x28/0x30 Nov 16 07:03:19 rx [] ? e1000e_read_phy_reg_mdic+0xce/0x120 [e1000e] Nov 16 07:03:19 rx [] ? e1000_get_hw_semaphore_82574+0x20/0x40 [e1000e] Nov 16 07:03:19 rx [] ? e1000e_read_phy_reg_bm2+0x55/0xb0 [e1000e] Nov 16 07:03:19 rx [] ? e1000e_flush_descriptors+0x96/0x270 [e1000e] Nov 16 07:03:19 rx [] ? e1000_check_phy_82574+0x27/0x60 [e1000e] Nov 16 07:03:19 rx [] ? e1000_watchdog_task+0x648/0x830 [e1000e] Nov 16 07:03:19 rx [] ? __schedule+0x3a7/0x7c0 Nov 16 07:03:19 rx [] ? process_one_work+0x16e/0x430 Nov 16 07:03:19 rx [] ? worker_thread+0x11c/0x410 Nov 16 07:03:19 rx [] ? manage_workers+0x360/0x360 Nov 16 07:03:19 rx [] ? kthread+0xc6/0xd0 Nov 16 07:03:19 rx [] ? xen_end_context_switch+0x19/0x20 Nov 16 07:03:19 rx [] ? kthread_freezable_should_stop+0x70/0x70 Nov 16 07:03:19 rx [] ? ret_from_fork+0x7c/0xb0 Nov 16 07:03:19 rx [] ? kthread_freezable_should_stop+0x70/0x70 Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182d ]--- Nov 16 07:03:19 rx e1000e :03:00.0 eth0: Reset adapter unexpectedly Nov 16 07:03:19 rx BUG: unable to handle kernel NULL pointer dereference at 00d0 Nov 16 07:03:19 rx IP: [] e1000_clean_rx_irq+0x101/0x490 [e1000e] Nov 16 07:03:19 rx PGD 4a6c3067 PUD 4f440067 PMD 0 Nov 16 07:03:19 rx Oops: [#1] SMP Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding ebtable_filter 8021q mrp e1000e ptp pps_core Nov 16 07:03:19 rx CPU 0 Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Tainted: GW 3.9.2-1 #1 Supermicro X8DT6/X8DT6 Nov 16 07:03:19 rx RIP: e030:[] [] e1000_clean_rx_irq+0x101/0x490 [e1000e] Nov 16 07:03:19 rx RSP: e02b:88008ea03d98 EFLAGS: 00010202 Nov 16 07:03:19 rx RAX: 001a RBX: c900115f9000 RCX: 88008ea03e64 Nov 16 07:03:19 rx RDX: 88008ea03e64 RSI: 880002dc6e00 RDI: 8800844026c0 Nov 16 07:03:19 rx RBP: 88008ea03e38 R08: 000169e0 R09: ea0b7180 Nov 16 07:03:19 rx R10: 00020063 R11: R12: Nov 16 07:03:19 rx R13: 880081fb6000 R14: 88008016c700 R15: 880081fb6000 Nov 16 07:03:19 rx FS: 7fde5acb8700() GS:88008ea0() knlGS: Nov 16 07:03:19 rx CS: e033 DS: ES: CR0: 8005003b Nov 16 07:03:19 rx CR2: 00d0 CR3: 71e8 CR4: 2660 Nov 16 07:03:19 rx DR0: DR1: DR2: Nov 16 07:03:19 rx DR3: DR6: 0ff0 DR7: 0400 Nov 16 07:03:19 rx Process kworker/0:0 (pid: 14268, threadinfo 880045f8a000, task 88005c9c) Nov 16 07:03:19 rx
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Hello, I've been seeing some servers hit a condition where they receive a large number of packets (over 500,000 per second, for example) which causes a kernel panic due to a null pointer dereference. I've included the tracebacks below. I have not been able to reproduce this in my lab, but out in the field I've seen this happen with kernel versions 3.7.6 through 3.9.2 (e1000e driver versions 2.1.4-k through 2.2.14-k), running with Intel 82574L NICs. I've seen previous posts to this mailing list suggesting that this is a hardware issue (the mitigation being turning TSO/GSO off), however those tracebacks didn't show the interface getting unexpectedly reset, causing the null pointer dereference. Is this possibly a problem with the e1000e driver where it's not gracefully handling the reset? Let me know if more information is needed. And please CC me in replies since I'm not subscribed to this list. Thanks! -Nick Nov 16 07:03:19 rx [ cut here ] Nov 16 07:03:19 rx WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x25b/0x270() Nov 16 07:03:19 rx Hardware name: X8DT6 Nov 16 07:03:19 rx NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding ebtable_filter 8021q mrp e1000e ptp pps_core Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Not tainted 3.9.2-1 #1 Nov 16 07:03:19 rx Call Trace: Nov 16 07:03:19 rx IRQ [8105070a] warn_slowpath_common+0x7a/0xc0 Nov 16 07:03:19 rx [810507f1] warn_slowpath_fmt+0x41/0x50 Nov 16 07:03:19 rx [8168e2ab] dev_watchdog+0x25b/0x270 Nov 16 07:03:19 rx [8168e050] ? __netdev_watchdog_up+0x80/0x80 Nov 16 07:03:19 rx [81060464] call_timer_fn+0x44/0x120 Nov 16 07:03:19 rx [81060a71] run_timer_softirq+0x241/0x2b0 Nov 16 07:03:19 rx [8168e050] ? __netdev_watchdog_up+0x80/0x80 Nov 16 07:03:19 rx [8105893f] __do_softirq+0xef/0x270 Nov 16 07:03:19 rx [81058bc5] irq_exit+0xb5/0xc0 Nov 16 07:03:19 rx [81435a8f] xen_evtchn_do_upcall+0x2f/0x40 Nov 16 07:03:19 rx [817a24fe] xen_do_hypervisor_callback+0x1e/0x30 Nov 16 07:03:19 rx EOI [813b3722] ? delay_tsc+0x32/0x80 Nov 16 07:03:19 rx [813b373a] ? delay_tsc+0x4a/0x80 Nov 16 07:03:19 rx [813b36e8] ? __const_udelay+0x28/0x30 Nov 16 07:03:19 rx [a0025b0e] ? e1000e_read_phy_reg_mdic+0xce/0x120 [e1000e] Nov 16 07:03:19 rx [a0018bd0] ? e1000_get_hw_semaphore_82574+0x20/0x40 [e1000e] Nov 16 07:03:19 rx [a0027a75] ? e1000e_read_phy_reg_bm2+0x55/0xb0 [e1000e] Nov 16 07:03:19 rx [a00330d6] ? e1000e_flush_descriptors+0x96/0x270 [e1000e] Nov 16 07:03:19 rx [a00181b7] ? e1000_check_phy_82574+0x27/0x60 [e1000e] Nov 16 07:03:19 rx [a0034178] ? e1000_watchdog_task+0x648/0x830 [e1000e] Nov 16 07:03:19 rx [81797877] ? __schedule+0x3a7/0x7c0 Nov 16 07:03:19 rx [8106c74e] ? process_one_work+0x16e/0x430 Nov 16 07:03:19 rx [8106ea3c] ? worker_thread+0x11c/0x410 Nov 16 07:03:19 rx [8106e920] ? manage_workers+0x360/0x360 Nov 16 07:03:19 rx [810738f6] ? kthread+0xc6/0xd0 Nov 16 07:03:19 rx [81003869] ? xen_end_context_switch+0x19/0x20 Nov 16 07:03:19 rx [81073830] ? kthread_freezable_should_stop+0x70/0x70 Nov 16 07:03:19 rx [817a10bc] ? ret_from_fork+0x7c/0xb0 Nov 16 07:03:19 rx [81073830] ? kthread_freezable_should_stop+0x70/0x70 Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182d ]--- Nov 16 07:03:19 rx e1000e :03:00.0 eth0: Reset adapter unexpectedly Nov 16 07:03:19 rx BUG: unable to handle kernel NULL pointer dereference at 00d0 Nov 16 07:03:19 rx IP: [a0031d51] e1000_clean_rx_irq+0x101/0x490 [e1000e] Nov 16 07:03:19 rx PGD 4a6c3067 PUD 4f440067 PMD 0 Nov 16 07:03:19 rx Oops: [#1] SMP Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding ebtable_filter 8021q mrp e1000e ptp pps_core Nov 16 07:03:19 rx CPU 0 Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Tainted: GW 3.9.2-1 #1 Supermicro X8DT6/X8DT6 Nov 16 07:03:19 rx RIP: e030:[a0031d51] [a0031d51] e1000_clean_rx_irq+0x101/0x490 [e1000e] Nov 16 07:03:19 rx RSP: e02b:88008ea03d98 EFLAGS: 00010202 Nov 16 07:03:19 rx RAX: 001a RBX: c900115f9000 RCX: 88008ea03e64 Nov 16 07:03:19 rx RDX: 88008ea03e64 RSI: 880002dc6e00 RDI: 8800844026c0 Nov 16 07:03:19 rx RBP: 88008ea03e38 R08: 000169e0 R09: ea0b7180 Nov 16 07:03:19 rx R10: 00020063 R11: R12: Nov 16 07:03:19 rx R13: 880081fb6000 R14: 88008016c700 R15: 880081fb6000 Nov 16 07:03:19 rx FS
Re: niu lock-up (Transmit timed out, resetting) and NETDEV WATCHDOG
Hi > On 26 March 2013 13:44, Andrew Brooks wrote: >> Using niu driver for this card: Oracle/SUN Multithreaded 10-Gigabit >> Ethernet Network Controller and after a period the interface will hang >> with errors every 5 seconds >> "niu: xxx: eth2: Transmit timed out, resetting" Here's more information about the problem: When the interface hangs we see these messages from the driver: [3408740.816032] niu: niu_interrupt() ldg[8807141d16d0](18) v0[80] v1[0] v2[0] [3408740.816036] niu :09:00.0: eth2: niu_txchan_intr() cs[b860b86c000] [3408740.816038] niu :09:00.0: eth2: niu_poll_core() v0[0080] [3408740.816040] niu :09:00.0: eth2: niu_tx_work() pkt_cnt[0] cons[119] [3408740.816042] niu: niu_interrupt() ldg[8807141d16d0](18) v0[80] v1[0] v2[0] [3408740.820004] [sched_delayed] sched: RT throttling activated [3408740.824021] niu :09:00.0: eth2: Disable interrupts [3408740.824044] niu :09:00.0: eth2: Disable RX MAC [3408740.824048] niu :09:00.0: eth2: Disable IPP [3408740.824054] niu :09:00.0: eth2: Stop TX channels [3408740.824641] niu :09:00.0: eth2: Stop RX channels [3408740.824652] niu :09:00.0: eth2: Reset TX channels [3408740.825212] niu :09:00.0: eth2: Reset RX channels [3408740.825999] niu :09:00.0: eth2: Initialize TXC [3408740.826002] niu :09:00.0: eth2: Initialize TX channels However the interface doesn't recover :-( Are there any clues there? Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: niu lock-up (Transmit timed out, resetting) and NETDEV WATCHDOG
Hi On 26 March 2013 13:44, Andrew Brooks a...@sat.dundee.ac.uk wrote: Using niu driver for this card: Oracle/SUN Multithreaded 10-Gigabit Ethernet Network Controller and after a period the interface will hang with errors every 5 seconds niu: xxx: eth2: Transmit timed out, resetting Here's more information about the problem: When the interface hangs we see these messages from the driver: [3408740.816032] niu: niu_interrupt() ldg[8807141d16d0](18) v0[80] v1[0] v2[0] [3408740.816036] niu :09:00.0: eth2: niu_txchan_intr() cs[b860b86c000] [3408740.816038] niu :09:00.0: eth2: niu_poll_core() v0[0080] [3408740.816040] niu :09:00.0: eth2: niu_tx_work() pkt_cnt[0] cons[119] [3408740.816042] niu: niu_interrupt() ldg[8807141d16d0](18) v0[80] v1[0] v2[0] [3408740.820004] [sched_delayed] sched: RT throttling activated [3408740.824021] niu :09:00.0: eth2: Disable interrupts [3408740.824044] niu :09:00.0: eth2: Disable RX MAC [3408740.824048] niu :09:00.0: eth2: Disable IPP [3408740.824054] niu :09:00.0: eth2: Stop TX channels [3408740.824641] niu :09:00.0: eth2: Stop RX channels [3408740.824652] niu :09:00.0: eth2: Reset TX channels [3408740.825212] niu :09:00.0: eth2: Reset RX channels [3408740.825999] niu :09:00.0: eth2: Initialize TXC [3408740.826002] niu :09:00.0: eth2: Initialize TX channels However the interface doesn't recover :-( Are there any clues there? Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: niu lock-up (Transmit timed out, resetting) and NETDEV WATCHDOG
On 26 March 2013 13:44, Andrew Brooks wrote: > > Using niu driver for this card: Oracle/SUN Multithreaded 10-Gigabit > Ethernet Network Controller and after a period the interface will hang > with errors every 5 seconds > "niu: xxx: eth2: Transmit timed out, resetting" > > Sometimes also in syslog are messages > WARNING: at sch_generic:255 dev_watchdog > NETDEV WATCHDOG: eth2 (niu): transmit queue 10 timed out Do you think this could be caused by a problem I've seen reported by other machines on the network "received unsolicited ack for DL_UNITDATA_REQ on nxge0" ? Is there some bad packet flying around that causes the niu driver to lock up the kernel? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: niu lock-up (Transmit timed out, resetting) and NETDEV WATCHDOG
On 26 March 2013 13:44, Andrew Brooks a...@sat.dundee.ac.uk wrote: Using niu driver for this card: Oracle/SUN Multithreaded 10-Gigabit Ethernet Network Controller and after a period the interface will hang with errors every 5 seconds niu: xxx: eth2: Transmit timed out, resetting Sometimes also in syslog are messages WARNING: at sch_generic:255 dev_watchdog NETDEV WATCHDOG: eth2 (niu): transmit queue 10 timed out Do you think this could be caused by a problem I've seen reported by other machines on the network received unsolicited ack for DL_UNITDATA_REQ on nxge0 ? Is there some bad packet flying around that causes the niu driver to lock up the kernel? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
niu lock-up (Transmit timed out, resetting) and NETDEV WATCHDOG
Hello Using niu driver for this card: Oracle/SUN Multithreaded 10-Gigabit Ethernet Network Controller after a period (often less than 24 hours) the interface will hang with errors every 5 seconds "niu: xxx: eth2: Transmit timed out, resetting" Sometimes also in syslog are messages WARNING: at sch_generic:255 dev_watchdog NETDEV WATCHDOG: eth2 (niu): transmit queue 10 timed out Does anyone know which driver revision has fixed this problem or if it's still buggy? Thanks! Andrew P.S. My guess is the commit on 2012-10-02 ?? 2013-02-04 ethernet: Remove unnecessary alloc/OOM messages, alloc cleanupsJoe Perches 1 -1/+1 2013-01-09 remove init of dev->perm_addr in driversJiri Pirko 1 -26/+20 2012-12-07 drivers/net: fix up function prototypes after __dev* removalsGreg Kroah-Hartman 1 -26/+17 2012-12-03 net/sun: remove __dev* attributes Bill Pemberton 1 -45/+45 2012-10-07 drivers/net/ethernet/sun/niu.c: fix error return code Peter Senna Tschudin 1 -0/+1 2012-10-02 Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Linus Torvalds1 -1/+1 [flush_work_sync is now flush_work] 2012-08-23 niu: Use PCI Express Capability accessors Jiang Liu 1 -12/+7 2012-08-20 workqueue: deprecate flush[_delayed]_work_sync()Tejun Heo 1 -1/+1 2012-07-23 niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value Shuah Khan 1 -1/+1 2012-07-23 niu: Fix to check for dma mapping errors. 2012-06-08 Revert "niu: Add support for byte queue limits."David S. Miller 1 -11/+1 2012-05-03 net/niu: remove one superfluous dma mask check Sebastian Andrzej Siewior 1 -1/+1 2012-02-23 ethernet: unify return value of .ndo_set_mac_address if address is invalid Danny Kukawka 1 -1/+1 2012-01-31 drivers/net: Remove alloc_etherdev error messages Joe Perches 1 -3/+1 2011-12-02 niu: Fix typo in comment. David S. Miller 1 -1/+1 2011-12-02 niu: Add support for byte queue limits. David S. Miller 1 -1/+11 2011-12-02 niu: Remove redundant PHY ID test. David S. Miller 1 -2/+4 2011-11-21 net: Change mii to ethtool advertisement function names Matt Carlson1 -2/+2 2011-11-16 net: Add ethtool to mii advertisment conversion helpers Matt Carlson1 -13/+2 2011-11-14 Sweep additional floors of strcpy in .get_drvinfo routines 2011-10-19 net: add skb frag size accessorsEric Dumazet1 -3/+3 2011-10-14 niu: fix skb truesize underestimation Eric Dumazet1 -8/+4 2011-09-16 ethtool: Update ethtool_rxnfc::rule_cnt on return from ETHTOOL_GRXCLSRLALL Ben Hutchings 1 -0/+2 2011-09-16 ethtool: Clean up definitions of rule location arrays in RX NFC Ben Hutchings 1 -2/+2 2011-09-15 niu: convert to SKB paged frag API. Ian Campbell1 -5/+2 2011-08-18 net: introduce IFF_UNICAST_FLT private flag Jiri Pirko 1 -1/+4 2011-08-11 cassini/niu/sun*: Move the Sun drivers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
niu lock-up (Transmit timed out, resetting) and NETDEV WATCHDOG
Hello Using niu driver for this card: Oracle/SUN Multithreaded 10-Gigabit Ethernet Network Controller after a period (often less than 24 hours) the interface will hang with errors every 5 seconds niu: xxx: eth2: Transmit timed out, resetting Sometimes also in syslog are messages WARNING: at sch_generic:255 dev_watchdog NETDEV WATCHDOG: eth2 (niu): transmit queue 10 timed out Does anyone know which driver revision has fixed this problem or if it's still buggy? Thanks! Andrew P.S. My guess is the commit on 2012-10-02 ?? 2013-02-04 ethernet: Remove unnecessary alloc/OOM messages, alloc cleanupsJoe Perches 1 -1/+1 2013-01-09 remove init of dev-perm_addr in driversJiri Pirko 1 -26/+20 2012-12-07 drivers/net: fix up function prototypes after __dev* removalsGreg Kroah-Hartman 1 -26/+17 2012-12-03 net/sun: remove __dev* attributes Bill Pemberton 1 -45/+45 2012-10-07 drivers/net/ethernet/sun/niu.c: fix error return code Peter Senna Tschudin 1 -0/+1 2012-10-02 Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Linus Torvalds1 -1/+1 [flush_work_sync is now flush_work] 2012-08-23 niu: Use PCI Express Capability accessors Jiang Liu 1 -12/+7 2012-08-20 workqueue: deprecate flush[_delayed]_work_sync()Tejun Heo 1 -1/+1 2012-07-23 niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value Shuah Khan 1 -1/+1 2012-07-23 niu: Fix to check for dma mapping errors. 2012-06-08 Revert niu: Add support for byte queue limits.David S. Miller 1 -11/+1 2012-05-03 net/niu: remove one superfluous dma mask check Sebastian Andrzej Siewior 1 -1/+1 2012-02-23 ethernet: unify return value of .ndo_set_mac_address if address is invalid Danny Kukawka 1 -1/+1 2012-01-31 drivers/net: Remove alloc_etherdev error messages Joe Perches 1 -3/+1 2011-12-02 niu: Fix typo in comment. David S. Miller 1 -1/+1 2011-12-02 niu: Add support for byte queue limits. David S. Miller 1 -1/+11 2011-12-02 niu: Remove redundant PHY ID test. David S. Miller 1 -2/+4 2011-11-21 net: Change mii to ethtool advertisement function names Matt Carlson1 -2/+2 2011-11-16 net: Add ethtool to mii advertisment conversion helpers Matt Carlson1 -13/+2 2011-11-14 Sweep additional floors of strcpy in .get_drvinfo routines 2011-10-19 net: add skb frag size accessorsEric Dumazet1 -3/+3 2011-10-14 niu: fix skb truesize underestimation Eric Dumazet1 -8/+4 2011-09-16 ethtool: Update ethtool_rxnfc::rule_cnt on return from ETHTOOL_GRXCLSRLALL Ben Hutchings 1 -0/+2 2011-09-16 ethtool: Clean up definitions of rule location arrays in RX NFC Ben Hutchings 1 -2/+2 2011-09-15 niu: convert to SKB paged frag API. Ian Campbell1 -5/+2 2011-08-18 net: introduce IFF_UNICAST_FLT private flag Jiri Pirko 1 -1/+4 2011-08-11 cassini/niu/sun*: Move the Sun drivers -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
BUG: ipheth - NETDEV WATCHDOG: eth1 (ipheth): transmit queue 0 timed out
Hello, ipheth does not work for me at all with iPhone5 and iOS 6. Setup: Thinkpad T60 (32 bit) with openSUSE 12.1 (with current kernel repo) iPhone5 with iOS 6.1.2 USB-Cable between laptop and iphone In the moment I enable Hotspot with USB-cable connected to phone and laptop, I get the following message: [Wed Mar 20 04:51:24 2013] usb 1-2: new high-speed USB device number 4 using ehci-pci [Wed Mar 20 04:51:24 2013] usb 1-2: New USB device found, idVendor=05ac, idProduct=12a8 [Wed Mar 20 04:51:24 2013] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [Wed Mar 20 04:51:24 2013] usb 1-2: Product: iPhone [Wed Mar 20 04:51:24 2013] usb 1-2: Manufacturer: Apple Inc. [Wed Mar 20 04:51:24 2013] usb 1-2: SerialNumber: XXX [Wed Mar 20 04:51:24 2013] ipheth 1-2:4.2: Apple iPhone USB Ethernet device attached [Wed Mar 20 04:51:24 2013] usbcore: registered new interface driver ipheth [Wed Mar 20 04:52:46 2013] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready [Wed Mar 20 04:55:28 2013] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready [Wed Mar 20 04:55:33 2013] [ cut here ] [Wed Mar 20 04:55:33 2013] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.8.3/linux-3.8/net/sched/sch_generic.c:254 dev_watchdog+0x1e0/0x1f0() [Wed Mar 20 04:55:33 2013] Hardware name: 200763G [Wed Mar 20 04:55:33 2013] NETDEV WATCHDOG: eth1 (ipheth): transmit queue 0 timed out [Wed Mar 20 04:55:33 2013] Modules linked in: ipheth md4 md5 nls_utf8 cifs fscache ppdev parport_pc lp parport af_packet rfcomm bnep binfmt_misc tp_smapi(OF) thinkpad_ec(OF) cpufreq_conservative cpufreq_userspace cpufreq_powersave sha256_generic cbc dm_crypt dm_mod snd_hda_codec_analog snd_hda_intel btusb snd_hda_codec bluetooth acpi_cpufreq arc4 pcmcia iwl3945 mperf iwlegacy coretemp snd_hwdep snd_pcm mac80211 kvm_intel iTCO_wdt iTCO_vendor_support snd_timer kvm sg lpc_ich snd_page_alloc mfd_core yenta_socket cfg80211 thinkpad_acpi pcmcia_rsrc pcmcia_core sr_mod cdrom i2c_i801 rfkill snd joydev irda tpm_tis e1000e tpm tpm_bios crc_ccitt microcode video battery soundcore ac button autofs4 radeon ttm drm_kms_helper drm i2c_algo_bit scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh edd fan ata_generic ata_piix thermal processor thermal_sys [Wed Mar 20 04:55:33 2013] Pid: 0, comm: swapper/0 Tainted: GF O 3.8.3-1-desktop #1 [Wed Mar 20 04:55:33 2013] Call Trace: [Wed Mar 20 04:55:33 2013] [] try_stack_unwind+0x199/0x1b0 [Wed Mar 20 04:55:33 2013] [] dump_trace+0x47/0xf0 [Wed Mar 20 04:55:33 2013] [] show_trace_log_lvl+0x4b/0x60 [Wed Mar 20 04:55:33 2013] [] show_trace+0x18/0x20 [Wed Mar 20 04:55:33 2013] [] dump_stack+0x6d/0x72 [Wed Mar 20 04:55:33 2013] [] warn_slowpath_common+0x78/0xb0 [Wed Mar 20 04:55:33 2013] [] warn_slowpath_fmt+0x33/0x40 [Wed Mar 20 04:55:33 2013] [] dev_watchdog+0x1e0/0x1f0 [Wed Mar 20 04:55:33 2013] [] call_timer_fn+0x24/0x120 [Wed Mar 20 04:55:33 2013] [] run_timer_softirq+0x1a2/0x240 [Wed Mar 20 04:55:33 2013] [] __do_softirq+0x99/0x1e0 [Wed Mar 20 04:55:33 2013] [] do_softirq+0x76/0xb0 [Wed Mar 20 04:55:33 2013] [<000f>] 0xe [Wed Mar 20 04:55:33 2013] ---[ end trace dd5a449765340b63 ]--- [Wed Mar 20 04:55:33 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:55:38 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:55:48 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:55:58 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:56:08 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:56:18 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout The bug also occurs - when I turn on hotpot first and plug in the cable afterwards. - with the vanilla kernel from the same repo (and without loading the tp_smapi module which taints the kernel in the log above). - for others (https://bugzilla.novell.com/show_bug.cgi?id=779643). That link also contains messages from another system with an earlier kernel other iphone model and earlier iOS. What other information is required to get this problem resolved? Thanks Jörg -- Joerg Mayer We are stuck with technology when what we really want is just stuff that works. Some say that should read Microsoft instead of technology. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
BUG: ipheth - NETDEV WATCHDOG: eth1 (ipheth): transmit queue 0 timed out
Hello, ipheth does not work for me at all with iPhone5 and iOS 6. Setup: Thinkpad T60 (32 bit) with openSUSE 12.1 (with current kernel repo) iPhone5 with iOS 6.1.2 USB-Cable between laptop and iphone In the moment I enable Hotspot with USB-cable connected to phone and laptop, I get the following message: [Wed Mar 20 04:51:24 2013] usb 1-2: new high-speed USB device number 4 using ehci-pci [Wed Mar 20 04:51:24 2013] usb 1-2: New USB device found, idVendor=05ac, idProduct=12a8 [Wed Mar 20 04:51:24 2013] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [Wed Mar 20 04:51:24 2013] usb 1-2: Product: iPhone [Wed Mar 20 04:51:24 2013] usb 1-2: Manufacturer: Apple Inc. [Wed Mar 20 04:51:24 2013] usb 1-2: SerialNumber: XXX [Wed Mar 20 04:51:24 2013] ipheth 1-2:4.2: Apple iPhone USB Ethernet device attached [Wed Mar 20 04:51:24 2013] usbcore: registered new interface driver ipheth [Wed Mar 20 04:52:46 2013] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready [Wed Mar 20 04:55:28 2013] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready [Wed Mar 20 04:55:33 2013] [ cut here ] [Wed Mar 20 04:55:33 2013] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.8.3/linux-3.8/net/sched/sch_generic.c:254 dev_watchdog+0x1e0/0x1f0() [Wed Mar 20 04:55:33 2013] Hardware name: 200763G [Wed Mar 20 04:55:33 2013] NETDEV WATCHDOG: eth1 (ipheth): transmit queue 0 timed out [Wed Mar 20 04:55:33 2013] Modules linked in: ipheth md4 md5 nls_utf8 cifs fscache ppdev parport_pc lp parport af_packet rfcomm bnep binfmt_misc tp_smapi(OF) thinkpad_ec(OF) cpufreq_conservative cpufreq_userspace cpufreq_powersave sha256_generic cbc dm_crypt dm_mod snd_hda_codec_analog snd_hda_intel btusb snd_hda_codec bluetooth acpi_cpufreq arc4 pcmcia iwl3945 mperf iwlegacy coretemp snd_hwdep snd_pcm mac80211 kvm_intel iTCO_wdt iTCO_vendor_support snd_timer kvm sg lpc_ich snd_page_alloc mfd_core yenta_socket cfg80211 thinkpad_acpi pcmcia_rsrc pcmcia_core sr_mod cdrom i2c_i801 rfkill snd joydev irda tpm_tis e1000e tpm tpm_bios crc_ccitt microcode video battery soundcore ac button autofs4 radeon ttm drm_kms_helper drm i2c_algo_bit scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh edd fan ata_generic ata_piix thermal processor thermal_sys [Wed Mar 20 04:55:33 2013] Pid: 0, comm: swapper/0 Tainted: GF O 3.8.3-1-desktop #1 [Wed Mar 20 04:55:33 2013] Call Trace: [Wed Mar 20 04:55:33 2013] [c0205709] try_stack_unwind+0x199/0x1b0 [Wed Mar 20 04:55:33 2013] [c0204417] dump_trace+0x47/0xf0 [Wed Mar 20 04:55:33 2013] [c020576b] show_trace_log_lvl+0x4b/0x60 [Wed Mar 20 04:55:33 2013] [c0205798] show_trace+0x18/0x20 [Wed Mar 20 04:55:33 2013] [c071a6cb] dump_stack+0x6d/0x72 [Wed Mar 20 04:55:33 2013] [c02391c8] warn_slowpath_common+0x78/0xb0 [Wed Mar 20 04:55:33 2013] [c0239293] warn_slowpath_fmt+0x33/0x40 [Wed Mar 20 04:55:33 2013] [c0666350] dev_watchdog+0x1e0/0x1f0 [Wed Mar 20 04:55:33 2013] [c0247ed4] call_timer_fn+0x24/0x120 [Wed Mar 20 04:55:33 2013] [c0248172] run_timer_softirq+0x1a2/0x240 [Wed Mar 20 04:55:33 2013] [c0240fb9] __do_softirq+0x99/0x1e0 [Wed Mar 20 04:55:33 2013] [c02042f6] do_softirq+0x76/0xb0 [Wed Mar 20 04:55:33 2013] [000f] 0xe [Wed Mar 20 04:55:33 2013] ---[ end trace dd5a449765340b63 ]--- [Wed Mar 20 04:55:33 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:55:38 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:55:48 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:55:58 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:56:08 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout [Wed Mar 20 04:56:18 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout The bug also occurs - when I turn on hotpot first and plug in the cable afterwards. - with the vanilla kernel from the same repo (and without loading the tp_smapi module which taints the kernel in the log above). - for others (https://bugzilla.novell.com/show_bug.cgi?id=779643). That link also contains messages from another system with an earlier kernel other iphone model and earlier iOS. What other information is required to get this problem resolved? Thanks Jörg -- Joerg Mayer jma...@loplof.de We are stuck with technology when what we really want is just stuff that works. Some say that should read Microsoft instead of technology. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jörg Otte : [...] > To Summarize: Two net-regressions where introduced in v3.8 (driver r8169): > > 1) NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out > was introduced by commit > e0c075577965d1c01b30038d38bf637b027a1df3 > ("r8169: enable ALDPS for power saving") Hayes Wang authored it. You should ask him why commit e0c075577965d1c01b30038d38bf637b027a1df3 sometimes chokes with the 8168evl. And you can ask him if there is a chance that the non-8168evl that are handled by the patch (mis-)behave the same too. -- Ueimor -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
2013/2/3 Jörg Otte : > 2013/1/6 Jörg Otte : >> 2013/1/5 Francois Romieu : >>> Can you check if things improve with v3.8-rc2 after removing : >>> >>> 2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7 >>>r8169: enable internal ASPM and clock request settings >> >> this fixes a second issue for me: >> In 3.7.1 at startup the link came up after 15 sec.: >> grep r8169 dmesg.3.7.1 >> [1.956842] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded >> [1.957059] r8169 :02:00.0: irq 42 for MSI/MSI-X >> [1.957161] r8169 :02:00.0 eth0: RTL8168evl/8111evl at.. >> [1.957163] r8169 :02:00.0 eth0: jumbo features [frames.. >> [ 13.575452] r8169 :02:00.0 eth0: link down >> [ 13.575475] r8169 :02:00.0 eth0: link down >> [ 15.181317] r8169 :02:00.0 eth0: link up >> >> In 3.8rc the time increased to 24 seconds: >> grep r8169 dmesg.3.8.0 >> [1.852546] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded >> [1.852765] r8169 :02:00.0: irq 42 for MSI/MSI-X >> [1.852872] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... >> [1.852874] r8169 :02:00.0 eth0: jumbo features [frames... >> [ 14.150212] r8169 :02:00.0 eth0: link down >> [ 14.150229] r8169 :02:00.0 eth0: link down >> [ 24.140263] r8169 :02:00.0 eth0: link up >> >> But with this revert I get the old performance: >> dmesg | grep r8169 >> [1.816613] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded >> [1.816832] r8169 :02:00.0: irq 42 for MSI/MSI-X >> [1.816947] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... >> [1.816948] r8169 :02:00.0 eth0: jumbo features [frames... >> [ 13.986401] r8169 :02:00.0 eth0: link down >> [ 13.986422] r8169 :02:00.0 eth0: link down >> [ 15.623631] r8169 :02:00.0 eth0: link up >> >> >>> 3. e0c075577965d1c01b30038d38bf637b027a1df3 >>>r8169: enable ALDPS for power saving >> >> That's it! This fixes the problem for me! >> >> Thanks, Jörg > > > We are closely before v3.8 and I didn't see a solution > so far. > What is the plan regarding this issue(s)? > > Thanks, Jörg No response, so I Cc to Linus: To Summarize: Two net-regressions where introduced in v3.8 (driver r8169): 1) NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out was introduced by commit e0c075577965d1c01b30038d38bf637b027a1df3 ("r8169: enable ALDPS for power saving") 2) Boot-time increased from 15sec (V3.7) to 24sec (V3.8) by commit: d64ec841517a25f6d468bde9f67e5b4cffdc67c7 ("r8169: enable internal ASPM and clock request settings") Reverting the commits resolve the problems entirely. As long as the issues are not fixed the commits should be reverted. Thanks, Jörg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
2013/2/3 Jörg Otte jrg.o...@gmail.com: 2013/1/6 Jörg Otte jrg.o...@gmail.com: 2013/1/5 Francois Romieu rom...@fr.zoreil.com: Can you check if things improve with v3.8-rc2 after removing : 2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7 r8169: enable internal ASPM and clock request settings this fixes a second issue for me: In 3.7.1 at startup the link came up after 15 sec.: grep r8169 dmesg.3.7.1 [1.956842] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.957059] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.957161] r8169 :02:00.0 eth0: RTL8168evl/8111evl at.. [1.957163] r8169 :02:00.0 eth0: jumbo features [frames.. [ 13.575452] r8169 :02:00.0 eth0: link down [ 13.575475] r8169 :02:00.0 eth0: link down [ 15.181317] r8169 :02:00.0 eth0: link up In 3.8rc the time increased to 24 seconds: grep r8169 dmesg.3.8.0 [1.852546] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.852765] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.852872] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... [1.852874] r8169 :02:00.0 eth0: jumbo features [frames... [ 14.150212] r8169 :02:00.0 eth0: link down [ 14.150229] r8169 :02:00.0 eth0: link down [ 24.140263] r8169 :02:00.0 eth0: link up But with this revert I get the old performance: dmesg | grep r8169 [1.816613] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.816832] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.816947] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... [1.816948] r8169 :02:00.0 eth0: jumbo features [frames... [ 13.986401] r8169 :02:00.0 eth0: link down [ 13.986422] r8169 :02:00.0 eth0: link down [ 15.623631] r8169 :02:00.0 eth0: link up 3. e0c075577965d1c01b30038d38bf637b027a1df3 r8169: enable ALDPS for power saving That's it! This fixes the problem for me! Thanks, Jörg We are closely before v3.8 and I didn't see a solution so far. What is the plan regarding this issue(s)? Thanks, Jörg No response, so I Cc to Linus: To Summarize: Two net-regressions where introduced in v3.8 (driver r8169): 1) NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out was introduced by commit e0c075577965d1c01b30038d38bf637b027a1df3 (r8169: enable ALDPS for power saving) 2) Boot-time increased from 15sec (V3.7) to 24sec (V3.8) by commit: d64ec841517a25f6d468bde9f67e5b4cffdc67c7 (r8169: enable internal ASPM and clock request settings) Reverting the commits resolve the problems entirely. As long as the issues are not fixed the commits should be reverted. Thanks, Jörg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jörg Otte jrg.o...@gmail.com : [...] To Summarize: Two net-regressions where introduced in v3.8 (driver r8169): 1) NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out was introduced by commit e0c075577965d1c01b30038d38bf637b027a1df3 (r8169: enable ALDPS for power saving) Hayes Wang hayesw...@realtek.com authored it. You should ask him why commit e0c075577965d1c01b30038d38bf637b027a1df3 sometimes chokes with the 8168evl. And you can ask him if there is a chance that the non-8168evl that are handled by the patch (mis-)behave the same too. -- Ueimor -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
2013/1/6 Jörg Otte : > 2013/1/5 Francois Romieu : >> Can you check if things improve with v3.8-rc2 after removing : >> >> 1. 9ecb9aabaf634677c77af467f4e3028b09d7bcda >>r8169: workaround for missing extended GigaMAC registers >> 2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7 >>r8169: enable internal ASPM and clock request settings > > Doesn't help for this problem. > > However this fixes a second issue for me: > In 3.7.1 at startup the link came up after 15 sec.: > grep r8169 dmesg.3.7.1 > [1.956842] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [1.957059] r8169 :02:00.0: irq 42 for MSI/MSI-X > [1.957161] r8169 :02:00.0 eth0: RTL8168evl/8111evl at.. > [1.957163] r8169 :02:00.0 eth0: jumbo features [frames.. > [ 13.575452] r8169 :02:00.0 eth0: link down > [ 13.575475] r8169 :02:00.0 eth0: link down > [ 15.181317] r8169 :02:00.0 eth0: link up > > In 3.8rc the time increased to 24 seconds: > grep r8169 dmesg.3.8.0 > [1.852546] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [1.852765] r8169 :02:00.0: irq 42 for MSI/MSI-X > [1.852872] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... > [1.852874] r8169 :02:00.0 eth0: jumbo features [frames... > [ 14.150212] r8169 :02:00.0 eth0: link down > [ 14.150229] r8169 :02:00.0 eth0: link down > [ 24.140263] r8169 :02:00.0 eth0: link up > > But with this revert I get the old performance: > dmesg | grep r8169 > [1.816613] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > [1.816832] r8169 :02:00.0: irq 42 for MSI/MSI-X > [1.816947] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... > [1.816948] r8169 :02:00.0 eth0: jumbo features [frames... > [ 13.986401] r8169 :02:00.0 eth0: link down > [ 13.986422] r8169 :02:00.0 eth0: link down > [ 15.623631] r8169 :02:00.0 eth0: link up > > Thus I recommend to revert this too. > >> 3. e0c075577965d1c01b30038d38bf637b027a1df3 >>r8169: enable ALDPS for power saving > > That's it! This fixes the problem for me! > We are closely before v3.8 and I didn't see a solution so far. What is the plan regarding this issue(s)? Thanks, Jörg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
2013/1/6 Jörg Otte jrg.o...@gmail.com: 2013/1/5 Francois Romieu rom...@fr.zoreil.com: Can you check if things improve with v3.8-rc2 after removing : 1. 9ecb9aabaf634677c77af467f4e3028b09d7bcda r8169: workaround for missing extended GigaMAC registers 2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7 r8169: enable internal ASPM and clock request settings Doesn't help for this problem. However this fixes a second issue for me: In 3.7.1 at startup the link came up after 15 sec.: grep r8169 dmesg.3.7.1 [1.956842] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.957059] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.957161] r8169 :02:00.0 eth0: RTL8168evl/8111evl at.. [1.957163] r8169 :02:00.0 eth0: jumbo features [frames.. [ 13.575452] r8169 :02:00.0 eth0: link down [ 13.575475] r8169 :02:00.0 eth0: link down [ 15.181317] r8169 :02:00.0 eth0: link up In 3.8rc the time increased to 24 seconds: grep r8169 dmesg.3.8.0 [1.852546] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.852765] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.852872] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... [1.852874] r8169 :02:00.0 eth0: jumbo features [frames... [ 14.150212] r8169 :02:00.0 eth0: link down [ 14.150229] r8169 :02:00.0 eth0: link down [ 24.140263] r8169 :02:00.0 eth0: link up But with this revert I get the old performance: dmesg | grep r8169 [1.816613] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.816832] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.816947] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... [1.816948] r8169 :02:00.0 eth0: jumbo features [frames... [ 13.986401] r8169 :02:00.0 eth0: link down [ 13.986422] r8169 :02:00.0 eth0: link down [ 15.623631] r8169 :02:00.0 eth0: link up Thus I recommend to revert this too. 3. e0c075577965d1c01b30038d38bf637b027a1df3 r8169: enable ALDPS for power saving That's it! This fixes the problem for me! We are closely before v3.8 and I didn't see a solution so far. What is the plan regarding this issue(s)? Thanks, Jörg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
2013/1/5 Francois Romieu : > Can you check if things improve with v3.8-rc2 after removing : > > 1. 9ecb9aabaf634677c77af467f4e3028b09d7bcda >r8169: workaround for missing extended GigaMAC registers > 2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7 >r8169: enable internal ASPM and clock request settings Doesn't help for this problem. However this fixes a second issue for me: In 3.7.1 at startup the link came up after 15 sec.: grep r8169 dmesg.3.7.1 [1.956842] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.957059] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.957161] r8169 :02:00.0 eth0: RTL8168evl/8111evl at.. [1.957163] r8169 :02:00.0 eth0: jumbo features [frames.. [ 13.575452] r8169 :02:00.0 eth0: link down [ 13.575475] r8169 :02:00.0 eth0: link down [ 15.181317] r8169 :02:00.0 eth0: link up In 3.8rc the time increased to 24 seconds: grep r8169 dmesg.3.8.0 [1.852546] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.852765] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.852872] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... [1.852874] r8169 :02:00.0 eth0: jumbo features [frames... [ 14.150212] r8169 :02:00.0 eth0: link down [ 14.150229] r8169 :02:00.0 eth0: link down [ 24.140263] r8169 :02:00.0 eth0: link up But with this revert I get the old performance: dmesg | grep r8169 [1.816613] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.816832] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.816947] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... [1.816948] r8169 :02:00.0 eth0: jumbo features [frames... [ 13.986401] r8169 :02:00.0 eth0: link down [ 13.986422] r8169 :02:00.0 eth0: link down [ 15.623631] r8169 :02:00.0 eth0: link up Thus I recommend to revert this too. > 3. e0c075577965d1c01b30038d38bf637b027a1df3 >r8169: enable ALDPS for power saving That's it! This fixes the problem for me! Thanks, Jörg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
2013/1/5 Francois Romieu rom...@fr.zoreil.com: Can you check if things improve with v3.8-rc2 after removing : 1. 9ecb9aabaf634677c77af467f4e3028b09d7bcda r8169: workaround for missing extended GigaMAC registers 2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7 r8169: enable internal ASPM and clock request settings Doesn't help for this problem. However this fixes a second issue for me: In 3.7.1 at startup the link came up after 15 sec.: grep r8169 dmesg.3.7.1 [1.956842] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.957059] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.957161] r8169 :02:00.0 eth0: RTL8168evl/8111evl at.. [1.957163] r8169 :02:00.0 eth0: jumbo features [frames.. [ 13.575452] r8169 :02:00.0 eth0: link down [ 13.575475] r8169 :02:00.0 eth0: link down [ 15.181317] r8169 :02:00.0 eth0: link up In 3.8rc the time increased to 24 seconds: grep r8169 dmesg.3.8.0 [1.852546] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.852765] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.852872] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... [1.852874] r8169 :02:00.0 eth0: jumbo features [frames... [ 14.150212] r8169 :02:00.0 eth0: link down [ 14.150229] r8169 :02:00.0 eth0: link down [ 24.140263] r8169 :02:00.0 eth0: link up But with this revert I get the old performance: dmesg | grep r8169 [1.816613] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded [1.816832] r8169 :02:00.0: irq 42 for MSI/MSI-X [1.816947] r8169 :02:00.0 eth0: RTL8168evl/8111evl at... [1.816948] r8169 :02:00.0 eth0: jumbo features [frames... [ 13.986401] r8169 :02:00.0 eth0: link down [ 13.986422] r8169 :02:00.0 eth0: link down [ 15.623631] r8169 :02:00.0 eth0: link up Thus I recommend to revert this too. 3. e0c075577965d1c01b30038d38bf637b027a1df3 r8169: enable ALDPS for power saving That's it! This fixes the problem for me! Thanks, Jörg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jörg Otte : [...] > jojo@ahorn:~$ dmesg | grep XID > [1.808847] r8169 :02:00.0 eth0: RTL8168evl/8111evl at > 0xc9054000, 5c:9a:d8:69:2b:39, XID 0c900800 IRQ 42 Can you check if things improve with v3.8-rc2 after removing : 1. 9ecb9aabaf634677c77af467f4e3028b09d7bcda r8169: workaround for missing extended GigaMAC registers 2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7 r8169: enable internal ASPM and clock request settings 3. e0c075577965d1c01b30038d38bf637b027a1df3 r8169: enable ALDPS for power saving (you can directly try v3.7 r8169.c with v3.8-rc2 if it worked for you so far) If the regression is still there, please apply the patch below to both v3.8-rc2 unpatched and a known working version then send me their dmesg after you 'ip link set dev eth0 up'. diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index ed96f30..3d2d2446 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -90,10 +90,28 @@ static const int multicast_filter_limit = 32; #define RTL8169_TX_TIMEOUT (6*HZ) #define RTL8169_PHY_TIMEOUT(10*HZ) +static void rw8(void __iomem *ioaddr, u8 b) +{ + printk(KERN_DEBUG PFX "w %p %02x\n", ioaddr, b); + writeb(b, ioaddr); +} + +static void rw16(void __iomem *ioaddr, u16 w) +{ + printk(KERN_DEBUG PFX "w %p %04x\n", ioaddr, w); + writew(w, ioaddr); +} + +static void rw32(void __iomem *ioaddr, u32 d) +{ + printk(KERN_DEBUG PFX "w %p %08x\n", ioaddr, d); + writel(d, ioaddr); +} + /* write/read MMIO register */ -#define RTL_W8(reg, val8) writeb ((val8), ioaddr + (reg)) -#define RTL_W16(reg, val16)writew ((val16), ioaddr + (reg)) -#define RTL_W32(reg, val32)writel ((val32), ioaddr + (reg)) +#define RTL_W8(reg, val8) rw8(ioaddr + (reg), (val8)) +#define RTL_W16(reg, val16)rw16(ioaddr + (reg), (val16)) +#define RTL_W32(reg, val32)rw32(ioaddr + (reg), (val32)) #define RTL_R8(reg)readb (ioaddr + (reg)) #define RTL_R16(reg) readw (ioaddr + (reg)) #define RTL_R32(reg) readl (ioaddr + (reg)) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
2013/1/5 Francois Romieu : > Jörg Otte : > [...] >> It's a regression, it never happend before 3.8-rc. > > Please check that 'dmesg | grep XID' exhibits a 8168evl. jojo@ahorn:~$ dmesg | grep XID [1.808847] r8169 :02:00.0 eth0: RTL8168evl/8111evl at 0xc9054000, 5c:9a:d8:69:2b:39, XID 0c900800 IRQ 42 jojo@ahorn:~$ > > I'll showe and dig it. It's epidemic. > Thanks, Jörg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jörg Otte : [...] > It's a regression, it never happend before 3.8-rc. Please check that 'dmesg | grep XID' exhibits a 8168evl. I'll showe and dig it. It's epidemic. -- Ueimor -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
I frequently see the following in the syslog: [ 184.552914] [ cut here ] [ 184.552927] WARNING: at /data/kernel/linux/net/sched/sch_generic.c:254 dev_watchdog+0xf2/0x151() [ 184.552929] Hardware name: LIFEBOOK AH532 [ 184.552932] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 184.552937] Pid: 0, comm: swapper/1 Not tainted 3.8.0-rc2-b11-00221-gd1c3ed6 #15 [ 184.552939] Call Trace: [ 184.552941][] ? dev_watchdog+0xf2/0x151 [ 184.552953] [] ? warn_slowpath_common+0x73/0x87 [ 184.552956] [] ? netif_tx_unlock+0x49/0x49 [ 184.552961] [] ? warn_slowpath_fmt+0x45/0x4a [ 184.552967] [] ? netif_tx_lock+0x40/0x75 [ 184.552971] [] ? dev_watchdog+0xf2/0x151 [ 184.552977] [] ? call_timer_fn.isra.32+0x1d/0x73 [ 184.552981] [] ? run_timer_softirq+0x154/0x194 [ 184.552988] [] ? timekeeping_get_ns.constprop.6+0xd/0x31 [ 184.552992] [] ? __do_softirq+0x96/0x139 [ 184.552997] [] ? call_softirq+0x1c/0x26 [ 184.553002] [] ? do_softirq+0x2e/0x62 [ 184.553006] [] ? irq_exit+0x3d/0x98 [ 184.553011] [] ? smp_apic_timer_interrupt+0x73/0x80 [ 184.553018] [] ? apic_timer_interrupt+0x6a/0x70 [ 184.553020][] ? cpuidle_wrap_enter+0x38/0x69 [ 184.553033] [] ? cpuidle_wrap_enter+0x34/0x69 [ 184.553039] [] ? cpuidle_enter_state+0xa/0x31 [ 184.553044] [] ? cpuidle_idle_call+0x99/0xb9 [ 184.553050] [] ? cpu_idle+0x99/0xe0 [ 184.553056] [] ? start_secondary+0x1d6/0x1dc [ 184.553059] ---[ end trace 54db26a54b22f673 ]--- [ 184.587487] r8169 :02:00.0 eth0: link up It's a regression, it never happend before 3.8-rc. -- Jörg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
I frequently see the following in the syslog: [ 184.552914] [ cut here ] [ 184.552927] WARNING: at /data/kernel/linux/net/sched/sch_generic.c:254 dev_watchdog+0xf2/0x151() [ 184.552929] Hardware name: LIFEBOOK AH532 [ 184.552932] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 184.552937] Pid: 0, comm: swapper/1 Not tainted 3.8.0-rc2-b11-00221-gd1c3ed6 #15 [ 184.552939] Call Trace: [ 184.552941] IRQ [8138d4a2] ? dev_watchdog+0xf2/0x151 [ 184.552953] [81025c8a] ? warn_slowpath_common+0x73/0x87 [ 184.552956] [8138d3b0] ? netif_tx_unlock+0x49/0x49 [ 184.552961] [81025d02] ? warn_slowpath_fmt+0x45/0x4a [ 184.552967] [8138d332] ? netif_tx_lock+0x40/0x75 [ 184.552971] [8138d4a2] ? dev_watchdog+0xf2/0x151 [ 184.552977] [8102f1a1] ? call_timer_fn.isra.32+0x1d/0x73 [ 184.552981] [8102f34b] ? run_timer_softirq+0x154/0x194 [ 184.552988] [8104cb84] ? timekeeping_get_ns.constprop.6+0xd/0x31 [ 184.552992] [8102b4a5] ? __do_softirq+0x96/0x139 [ 184.552997] [8146b00c] ? call_softirq+0x1c/0x26 [ 184.553002] [81003cf4] ? do_softirq+0x2e/0x62 [ 184.553006] [8102b615] ? irq_exit+0x3d/0x98 [ 184.553011] [810184ad] ? smp_apic_timer_interrupt+0x73/0x80 [ 184.553018] [8146aa0a] ? apic_timer_interrupt+0x6a/0x70 [ 184.553020] EOI [81326f2b] ? cpuidle_wrap_enter+0x38/0x69 [ 184.553033] [81326f27] ? cpuidle_wrap_enter+0x34/0x69 [ 184.553039] [81326d81] ? cpuidle_enter_state+0xa/0x31 [ 184.553044] [81326e41] ? cpuidle_idle_call+0x99/0xb9 [ 184.553050] [81009059] ? cpu_idle+0x99/0xe0 [ 184.553056] [8145e3a4] ? start_secondary+0x1d6/0x1dc [ 184.553059] ---[ end trace 54db26a54b22f673 ]--- [ 184.587487] r8169 :02:00.0 eth0: link up It's a regression, it never happend before 3.8-rc. -- Jörg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jörg Otte jrg.o...@gmail.com : [...] It's a regression, it never happend before 3.8-rc. Please check that 'dmesg | grep XID' exhibits a 8168evl. I'll showe and dig it. It's epidemic. -- Ueimor -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
2013/1/5 Francois Romieu rom...@fr.zoreil.com: Jörg Otte jrg.o...@gmail.com : [...] It's a regression, it never happend before 3.8-rc. Please check that 'dmesg | grep XID' exhibits a 8168evl. jojo@ahorn:~$ dmesg | grep XID [1.808847] r8169 :02:00.0 eth0: RTL8168evl/8111evl at 0xc9054000, 5c:9a:d8:69:2b:39, XID 0c900800 IRQ 42 jojo@ahorn:~$ I'll showe and dig it. It's epidemic. Thanks, Jörg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jörg Otte jrg.o...@gmail.com : [...] jojo@ahorn:~$ dmesg | grep XID [1.808847] r8169 :02:00.0 eth0: RTL8168evl/8111evl at 0xc9054000, 5c:9a:d8:69:2b:39, XID 0c900800 IRQ 42 Can you check if things improve with v3.8-rc2 after removing : 1. 9ecb9aabaf634677c77af467f4e3028b09d7bcda r8169: workaround for missing extended GigaMAC registers 2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7 r8169: enable internal ASPM and clock request settings 3. e0c075577965d1c01b30038d38bf637b027a1df3 r8169: enable ALDPS for power saving (you can directly try v3.7 r8169.c with v3.8-rc2 if it worked for you so far) If the regression is still there, please apply the patch below to both v3.8-rc2 unpatched and a known working version then send me their dmesg after you 'ip link set dev eth0 up'. diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index ed96f30..3d2d2446 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -90,10 +90,28 @@ static const int multicast_filter_limit = 32; #define RTL8169_TX_TIMEOUT (6*HZ) #define RTL8169_PHY_TIMEOUT(10*HZ) +static void rw8(void __iomem *ioaddr, u8 b) +{ + printk(KERN_DEBUG PFX w %p %02x\n, ioaddr, b); + writeb(b, ioaddr); +} + +static void rw16(void __iomem *ioaddr, u16 w) +{ + printk(KERN_DEBUG PFX w %p %04x\n, ioaddr, w); + writew(w, ioaddr); +} + +static void rw32(void __iomem *ioaddr, u32 d) +{ + printk(KERN_DEBUG PFX w %p %08x\n, ioaddr, d); + writel(d, ioaddr); +} + /* write/read MMIO register */ -#define RTL_W8(reg, val8) writeb ((val8), ioaddr + (reg)) -#define RTL_W16(reg, val16)writew ((val16), ioaddr + (reg)) -#define RTL_W32(reg, val32)writel ((val32), ioaddr + (reg)) +#define RTL_W8(reg, val8) rw8(ioaddr + (reg), (val8)) +#define RTL_W16(reg, val16)rw16(ioaddr + (reg), (val16)) +#define RTL_W32(reg, val32)rw32(ioaddr + (reg), (val32)) #define RTL_R8(reg)readb (ioaddr + (reg)) #define RTL_R16(reg) readw (ioaddr + (reg)) #define RTL_R32(reg) readl (ioaddr + (reg)) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG (3Com 3c905 adapter)
Steffen Klassert wrote: On Fri, Nov 23, 2007 at 04:52:39PM +0100, BERTRAND Joël wrote: BERTRAND Joël wrote: Hello, Since I have installed a 2.6.23.1 linux kernel on my U60, I can see several NETDEV WATCHDOG. This trouble never occurs with 2.6.23-rc4. This bug occurs after a random uptime. I have made the same constation this evening on a amd64/up with two 3C905 and a 2.6.21.3 linux kernel... This evening, I have rebooted my U60 with a 2.6.23.8 kernel and the same 3C905 runs fine. Wait and see (but I suspect a kernel bug...). For main linux kernel list: End of dmesg : NETDEV WATCHDOG: eth2: transmit timed out eth2: transmit timed out, tx_status 00 status 8601. diagnostics: net 0ccc media 8880 dma 003a fifo eth2: Interrupt posted but not delivered -- IRQ blocked by another device? This looks like a problem that several network drivers had recently. The problem was genirq related and should be fixed in 2.6.23, so I'm surprised that this happens with 2.6.23.1 but not with 2.6.23-rc4. I can confirm that my U60 worked fine sixty days with 2.6.23-rc4 without any trouble. With 2.6.23.1, 3com NIC hangs after three or four days... Regards, JKB - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG (3Com 3c905 adapter)
On Fri, Nov 23, 2007 at 04:52:39PM +0100, BERTRAND Joël wrote: > BERTRAND Joël wrote: > >Hello, > > > >Since I have installed a 2.6.23.1 linux kernel on my U60, I can see > >several NETDEV WATCHDOG. This trouble never occurs with 2.6.23-rc4. > >This bug occurs after a random uptime. > > I have made the same constation this evening on a amd64/up with two > 3C905 and a 2.6.21.3 linux kernel... This evening, I have rebooted my > U60 with a 2.6.23.8 kernel and the same 3C905 runs fine. Wait and see > (but I suspect a kernel bug...). > > For main linux kernel list: > > >End of dmesg : > > > >NETDEV WATCHDOG: eth2: transmit timed out > >eth2: transmit timed out, tx_status 00 status 8601. > > diagnostics: net 0ccc media 8880 dma 003a fifo > >eth2: Interrupt posted but not delivered -- IRQ blocked by another device? This looks like a problem that several network drivers had recently. The problem was genirq related and should be fixed in 2.6.23, so I'm surprised that this happens with 2.6.23.1 but not with 2.6.23-rc4. Steffen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG (3Com 3c905 adapter)
On Fri, Nov 23, 2007 at 04:52:39PM +0100, BERTRAND Joël wrote: BERTRAND Joël wrote: Hello, Since I have installed a 2.6.23.1 linux kernel on my U60, I can see several NETDEV WATCHDOG. This trouble never occurs with 2.6.23-rc4. This bug occurs after a random uptime. I have made the same constation this evening on a amd64/up with two 3C905 and a 2.6.21.3 linux kernel... This evening, I have rebooted my U60 with a 2.6.23.8 kernel and the same 3C905 runs fine. Wait and see (but I suspect a kernel bug...). For main linux kernel list: End of dmesg : NETDEV WATCHDOG: eth2: transmit timed out eth2: transmit timed out, tx_status 00 status 8601. diagnostics: net 0ccc media 8880 dma 003a fifo eth2: Interrupt posted but not delivered -- IRQ blocked by another device? This looks like a problem that several network drivers had recently. The problem was genirq related and should be fixed in 2.6.23, so I'm surprised that this happens with 2.6.23.1 but not with 2.6.23-rc4. Steffen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG (3Com 3c905 adapter)
Steffen Klassert wrote: On Fri, Nov 23, 2007 at 04:52:39PM +0100, BERTRAND Joël wrote: BERTRAND Joël wrote: Hello, Since I have installed a 2.6.23.1 linux kernel on my U60, I can see several NETDEV WATCHDOG. This trouble never occurs with 2.6.23-rc4. This bug occurs after a random uptime. I have made the same constation this evening on a amd64/up with two 3C905 and a 2.6.21.3 linux kernel... This evening, I have rebooted my U60 with a 2.6.23.8 kernel and the same 3C905 runs fine. Wait and see (but I suspect a kernel bug...). For main linux kernel list: End of dmesg : NETDEV WATCHDOG: eth2: transmit timed out eth2: transmit timed out, tx_status 00 status 8601. diagnostics: net 0ccc media 8880 dma 003a fifo eth2: Interrupt posted but not delivered -- IRQ blocked by another device? This looks like a problem that several network drivers had recently. The problem was genirq related and should be fixed in 2.6.23, so I'm surprised that this happens with 2.6.23.1 but not with 2.6.23-rc4. I can confirm that my U60 worked fine sixty days with 2.6.23-rc4 without any trouble. With 2.6.23.1, 3com NIC hangs after three or four days... Regards, JKB - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG (3Com 3c905 adapter)
BERTRAND Joël wrote: Hello, Since I have installed a 2.6.23.1 linux kernel on my U60, I can see several NETDEV WATCHDOG. This trouble never occurs with 2.6.23-rc4. This bug occurs after a random uptime. I have made the same constation this evening on a amd64/up with two 3C905 and a 2.6.21.3 linux kernel... This evening, I have rebooted my U60 with a 2.6.23.8 kernel and the same 3C905 runs fine. Wait and see (but I suspect a kernel bug...). For main linux kernel list: End of dmesg : NETDEV WATCHDOG: eth2: transmit timed out eth2: transmit timed out, tx_status 00 status 8601. diagnostics: net 0ccc media 8880 dma 003a fifo eth2: Interrupt posted but not delivered -- IRQ blocked by another device? Flags; bus-master 1, dirty 144(0) current 144(0) Transmit list vs. f800a06b0200. 0: @f800a06b0200 length 804a status 0001004a 1: @f800a06b0260 length 8062 status 00010062 2: @f800a06b02c0 length 8062 status 00010062 3: @f800a06b0320 length 8062 status 00010062 4: @f800a06b0380 length 8062 status 00010062 5: @f800a06b03e0 length 8062 status 00010062 6: @f800a06b0440 length 8062 status 00010062 7: @f800a06b04a0 length 8062 status 00010062 8: @f800a06b0500 length 8062 status 00010062 9: @f800a06b0560 length 8062 status 00010062 10: @f800a06b05c0 length 8062 status 00010062 11: @f800a06b0620 length 8062 status 00010062 12: @f800a06b0680 length 8062 status 00010062 13: @f800a06b06e0 length 804a status 0001004a 14: @f800a06b0740 length 8062 status 80010062 15: @f800a06b07a0 length 8062 status 80010062 eth2: Resetting the Tx ring pointer. eth2: setting full-duplex. Root rayleigh:[/proc] > cat interrupts CPU0 CPU2 0: 318071624 318071584 timer 1: 0 0 sun4u PSYCHO_PCIERR 2: 0 0 sun4u PSYCHO_UE 3: 0 0 sun4u PSYCHO_CE 8: 559916 0 sun4u su(kbd) 9: 04170845 sun4u su(mouse) 10: 0 0 sun4u parport0 11: 2 0 sun4u floppy 12: 0 0 sun4u cs4231(capture) 13: 789858 0 sun4u cs4231(play) 14: 1329 17123911 sun4u eth0 15: 0 15905664 sun4u sym53c8xx 16: 30 0 sun4u sym53c8xx 17: 364465 223693 sun4u eth2 18: 11043701 0 sun4u aic7xxx 19: 1 50585 sun4u ohci_hcd:usb2 20: 0 0 sun4u ohci_hcd:usb3 21: 1 1 sun4u ehci_hcd:usb1 22: 0 0 sun4u PSYCHO_PCIERR 24: 17346674332 sun4u eth1 Root rayleigh:[/proc] > lspci :00:00.0 Host bridge: Sun Microsystems Computer Corp. Psycho PCI Bus Module :00:01.0 Bridge: Sun Microsystems Computer Corp. EBUS (rev 01) :00:01.1 Ethernet controller: Sun Microsystems Computer Corp. Happy Meal (rev 01) :00:02.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78) :00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) :00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) :00:04.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02) :00:05.0 USB Controller: NEC Corporation USB (rev 43) :00:05.1 USB Controller: NEC Corporation USB (rev 43) :00:05.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 0001:00:00.0 Host bridge: Sun Microsystems Computer Corp. Psycho PCI Bus Module 0001:80:01.0 Bridge: Sun Microsystems Computer Corp. EBUS (rev 01) 0001:80:01.1 Ethernet controller: Sun Microsystems Computer Corp. Happy Meal (rev 01) Root rayleigh:[/proc] > cat irq/17/smp_affinity f Root rayleigh:[/proc/net] > cat dev Inter-| Receive| Transmit face |bytespackets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth1:9421385785 9787261000 0 0 0 2418854910 7465439000 0 0 0 eth0:3977175515 8320824000 0 0 0 8252404266 9204851000 0 0 0 eth2:47410723 34451500 51 0 0 0 31875309 278022 87200 0 0 0 lo:240810303 1749211000 0 0 0 240810303 1749211000 0 0 0 Regards, JKB - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NETDEV WATCHDOG (3Com 3c905 adapter)
BERTRAND Joël wrote: Hello, Since I have installed a 2.6.23.1 linux kernel on my U60, I can see several NETDEV WATCHDOG. This trouble never occurs with 2.6.23-rc4. This bug occurs after a random uptime. I have made the same constation this evening on a amd64/up with two 3C905 and a 2.6.21.3 linux kernel... This evening, I have rebooted my U60 with a 2.6.23.8 kernel and the same 3C905 runs fine. Wait and see (but I suspect a kernel bug...). For main linux kernel list: End of dmesg : NETDEV WATCHDOG: eth2: transmit timed out eth2: transmit timed out, tx_status 00 status 8601. diagnostics: net 0ccc media 8880 dma 003a fifo eth2: Interrupt posted but not delivered -- IRQ blocked by another device? Flags; bus-master 1, dirty 144(0) current 144(0) Transmit list vs. f800a06b0200. 0: @f800a06b0200 length 804a status 0001004a 1: @f800a06b0260 length 8062 status 00010062 2: @f800a06b02c0 length 8062 status 00010062 3: @f800a06b0320 length 8062 status 00010062 4: @f800a06b0380 length 8062 status 00010062 5: @f800a06b03e0 length 8062 status 00010062 6: @f800a06b0440 length 8062 status 00010062 7: @f800a06b04a0 length 8062 status 00010062 8: @f800a06b0500 length 8062 status 00010062 9: @f800a06b0560 length 8062 status 00010062 10: @f800a06b05c0 length 8062 status 00010062 11: @f800a06b0620 length 8062 status 00010062 12: @f800a06b0680 length 8062 status 00010062 13: @f800a06b06e0 length 804a status 0001004a 14: @f800a06b0740 length 8062 status 80010062 15: @f800a06b07a0 length 8062 status 80010062 eth2: Resetting the Tx ring pointer. eth2: setting full-duplex. Root rayleigh:[/proc] cat interrupts CPU0 CPU2 0: 318071624 318071584 NULL timer 1: 0 0 sun4u PSYCHO_PCIERR 2: 0 0 sun4u PSYCHO_UE 3: 0 0 sun4u PSYCHO_CE 8: 559916 0 sun4u su(kbd) 9: 04170845 sun4u su(mouse) 10: 0 0 sun4u parport0 11: 2 0 sun4u floppy 12: 0 0 sun4u cs4231(capture) 13: 789858 0 sun4u cs4231(play) 14: 1329 17123911 sun4u eth0 15: 0 15905664 sun4u sym53c8xx 16: 30 0 sun4u sym53c8xx 17: 364465 223693 sun4u eth2 18: 11043701 0 sun4u aic7xxx 19: 1 50585 sun4u ohci_hcd:usb2 20: 0 0 sun4u ohci_hcd:usb3 21: 1 1 sun4u ehci_hcd:usb1 22: 0 0 sun4u PSYCHO_PCIERR 24: 17346674332 sun4u eth1 Root rayleigh:[/proc] lspci :00:00.0 Host bridge: Sun Microsystems Computer Corp. Psycho PCI Bus Module :00:01.0 Bridge: Sun Microsystems Computer Corp. EBUS (rev 01) :00:01.1 Ethernet controller: Sun Microsystems Computer Corp. Happy Meal (rev 01) :00:02.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78) :00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) :00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14) :00:04.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02) :00:05.0 USB Controller: NEC Corporation USB (rev 43) :00:05.1 USB Controller: NEC Corporation USB (rev 43) :00:05.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 0001:00:00.0 Host bridge: Sun Microsystems Computer Corp. Psycho PCI Bus Module 0001:80:01.0 Bridge: Sun Microsystems Computer Corp. EBUS (rev 01) 0001:80:01.1 Ethernet controller: Sun Microsystems Computer Corp. Happy Meal (rev 01) Root rayleigh:[/proc] cat irq/17/smp_affinity f Root rayleigh:[/proc/net] cat dev Inter-| Receive| Transmit face |bytespackets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth1:9421385785 9787261000 0 0 0 2418854910 7465439000 0 0 0 eth0:3977175515 8320824000 0 0 0 8252404266 9204851000 0 0 0 eth2:47410723 34451500 51 0 0 0 31875309 278022 87200 0 0 0 lo:240810303 1749211000 0 0 0 240810303 1749211000 0 0 0 Regards, JKB - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out"
Hi, after reading about issues with the nics on kontron boards I did a bios upgrade, but this did not change anything. However, yesterday the nic (onboard) I used died. No link at all, after switching to the next onboard nic I got a NETDEV transmit timeout with that one on kernel 2.6.22-r2. It seems the whole thing is a hardware issue. I will try to figure out with kontron. Sorry :( Karl 2007/9/12, Francois Romieu <[EMAIL PROTECTED]>: > Karl Meyer <[EMAIL PROTECTED]> : > [...] > > am am looking for this issue for some time now, but there where no > > errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 > > officially), I also ran git-bisect (for more information see the older > > messages in this thread). > > 2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before > 0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work. > Thus it is not surprizing that it works. > > Any update regarding the patchkit that I sent on 2007/08/16 ? > > It would help to narrow the culprit. > > -- > Ueimor > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out
Hi, after reading about issues with the nics on kontron boards I did a bios upgrade, but this did not change anything. However, yesterday the nic (onboard) I used died. No link at all, after switching to the next onboard nic I got a NETDEV transmit timeout with that one on kernel 2.6.22-r2. It seems the whole thing is a hardware issue. I will try to figure out with kontron. Sorry :( Karl 2007/9/12, Francois Romieu [EMAIL PROTECTED]: Karl Meyer [EMAIL PROTECTED] : [...] am am looking for this issue for some time now, but there where no errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 officially), I also ran git-bisect (for more information see the older messages in this thread). 2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before 0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work. Thus it is not surprizing that it works. Any update regarding the patchkit that I sent on 2007/08/16 ? It would help to narrow the culprit. -- Ueimor - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out"
Hi Francois, this is what I found and sent: The error exists from patch 2 on. I did some network testing with patch 1 and currently use it and have no errors so far. >From my experiences up to now patch 1 should be error free. Do you need additional info? 2007/9/12, Francois Romieu <[EMAIL PROTECTED]>: > Karl Meyer <[EMAIL PROTECTED]> : > [...] > > am am looking for this issue for some time now, but there where no > > errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 > > officially), I also ran git-bisect (for more information see the older > > messages in this thread). > > 2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before > 0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work. > Thus it is not surprizing that it works. > > Any update regarding the patchkit that I sent on 2007/08/16 ? > > It would help to narrow the culprit. > > -- > Ueimor > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out
Hi Francois, this is what I found and sent: The error exists from patch 2 on. I did some network testing with patch 1 and currently use it and have no errors so far. From my experiences up to now patch 1 should be error free. Do you need additional info? 2007/9/12, Francois Romieu [EMAIL PROTECTED]: Karl Meyer [EMAIL PROTECTED] : [...] am am looking for this issue for some time now, but there where no errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 officially), I also ran git-bisect (for more information see the older messages in this thread). 2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before 0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work. Thus it is not surprizing that it works. Any update regarding the patchkit that I sent on 2007/08/16 ? It would help to narrow the culprit. -- Ueimor - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out"
Karl Meyer <[EMAIL PROTECTED]> : [...] > am am looking for this issue for some time now, but there where no > errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 > officially), I also ran git-bisect (for more information see the older > messages in this thread). 2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before 0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work. Thus it is not surprizing that it works. Any update regarding the patchkit that I sent on 2007/08/16 ? It would help to narrow the culprit. -- Ueimor - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out
Karl Meyer [EMAIL PROTECTED] : [...] am am looking for this issue for some time now, but there where no errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 officially), I also ran git-bisect (for more information see the older messages in this thread). 2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before 0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work. Thus it is not surprizing that it works. Any update regarding the patchkit that I sent on 2007/08/16 ? It would help to narrow the culprit. -- Ueimor - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out"
Hi, am am looking for this issue for some time now, but there where no errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 officially), I also ran git-bisect (for more information see the older messages in this thread). 2007/9/3, Michal Piotrowski <[EMAIL PROTECTED]>: > Hi, > > On 01/09/07, Karl Meyer <[EMAIL PROTECTED]> wrote: > > This is what happened today: > > > > Sep 1 21:08:01 frege NETDEV WATCHDOG: eth0: transmit timed out > > frege ~ # uname -r > > 2.6.22.5-cfs-v20.5 > > Can you reproduce this on 2.6.22 (not 2.6.22.x - it might be a -stable > regression)? > > Regards, > Michal > > -- > LOG > http://www.stardust.webpages.pl/log/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out"
Hi, On 01/09/07, Karl Meyer <[EMAIL PROTECTED]> wrote: > This is what happened today: > > Sep 1 21:08:01 frege NETDEV WATCHDOG: eth0: transmit timed out > frege ~ # uname -r > 2.6.22.5-cfs-v20.5 Can you reproduce this on 2.6.22 (not 2.6.22.x - it might be a -stable regression)? Regards, Michal -- LOG http://www.stardust.webpages.pl/log/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out
Hi, On 01/09/07, Karl Meyer [EMAIL PROTECTED] wrote: This is what happened today: Sep 1 21:08:01 frege NETDEV WATCHDOG: eth0: transmit timed out frege ~ # uname -r 2.6.22.5-cfs-v20.5 Can you reproduce this on 2.6.22 (not 2.6.22.x - it might be a -stable regression)? Regards, Michal -- LOG http://www.stardust.webpages.pl/log/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out
Hi, am am looking for this issue for some time now, but there where no errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 officially), I also ran git-bisect (for more information see the older messages in this thread). 2007/9/3, Michal Piotrowski [EMAIL PROTECTED]: Hi, On 01/09/07, Karl Meyer [EMAIL PROTECTED] wrote: This is what happened today: Sep 1 21:08:01 frege NETDEV WATCHDOG: eth0: transmit timed out frege ~ # uname -r 2.6.22.5-cfs-v20.5 Can you reproduce this on 2.6.22 (not 2.6.22.x - it might be a -stable regression)? Regards, Michal -- LOG http://www.stardust.webpages.pl/log/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out"
This is what happened today: Sep 1 21:08:01 frege NETDEV WATCHDOG: eth0: transmit timed out frege ~ # uname -r 2.6.22.5-cfs-v20.5 2007/8/16, Francois Romieu <[EMAIL PROTECTED]>: > (please do not remove the netdev Cc:) > > Francois Romieu <[EMAIL PROTECTED]> : > [...] > > If it does not work I'll dissect 0e4851502f846b13b29b7f88f1250c980d57e944 > > tomorrow. > > You will find a tgz archive in attachment which contains a serie of patches > (0001-... to 0005-...) to walk from 6dccd16b7c2703e8bbf8bca62b5cf248332afbe2 > to 0e4851502f846b13b29b7f88f1250c980d57e944 in smaller steps. > > Please apply 0001 on top of 6dccd16b7c2703e8bbf8bca62b5cf248332afbe2. If it > still works, apply 0002 on top of 0001, etc. > > -- > Ueimor > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out
This is what happened today: Sep 1 21:08:01 frege NETDEV WATCHDOG: eth0: transmit timed out frege ~ # uname -r 2.6.22.5-cfs-v20.5 2007/8/16, Francois Romieu [EMAIL PROTECTED]: (please do not remove the netdev Cc:) Francois Romieu [EMAIL PROTECTED] : [...] If it does not work I'll dissect 0e4851502f846b13b29b7f88f1250c980d57e944 tomorrow. You will find a tgz archive in attachment which contains a serie of patches (0001-... to 0005-...) to walk from 6dccd16b7c2703e8bbf8bca62b5cf248332afbe2 to 0e4851502f846b13b29b7f88f1250c980d57e944 in smaller steps. Please apply 0001 on top of 6dccd16b7c2703e8bbf8bca62b5cf248332afbe2. If it still works, apply 0002 on top of 0001, etc. -- Ueimor - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out"
On 21-08-2007 12:56, Karl Meyer wrote: > fyi: > I do not know whether it is related to the problem, but since using > the version you told me there are these entries is my log: > frege Hangcheck: hangcheck value past margin! ... BTW, I don't know wheter it's related too, but I think you should try first to get rid of these errors: > Freeing unused kernel memory: 220k freed > usb_id[1320]: segfault at eip b7e25db2 esp bfd1d734 error 4 > usb_id[1329]: segfault at eip b7e1bdb2 esp bf9c9224 error 4 > usb_id[1322]: segfault at eip b7df3db2 esp bfcb66c4 error 4 > usb_id[1321]: segfault at eip b7e11db2 esp bf8f4b04 error 4 Regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out
On 21-08-2007 12:56, Karl Meyer wrote: fyi: I do not know whether it is related to the problem, but since using the version you told me there are these entries is my log: frege Hangcheck: hangcheck value past margin! ... BTW, I don't know wheter it's related too, but I think you should try first to get rid of these errors: Freeing unused kernel memory: 220k freed usb_id[1320]: segfault at eip b7e25db2 esp bfd1d734 error 4 usb_id[1329]: segfault at eip b7e1bdb2 esp bf9c9224 error 4 usb_id[1322]: segfault at eip b7df3db2 esp bfcb66c4 error 4 usb_id[1321]: segfault at eip b7e11db2 esp bf8f4b04 error 4 Regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/