Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
On Mon, Jul 02, 2007 at 09:52:26AM +0200, Jarek Poplawski wrote: From my recent patch: #1 Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work() required a work function should always (unconditionally) rearm with delay 0 - otherwise it would endlessly loop. This patch replaces this function with cancel_delayed_work(). Later kernel versions don't require this, so here it's only for uniformity. But Oleg Nesterov [EMAIL PROTECTED] found: But 2.6.22 doesn't need this change, why it was merged? In fact, I suspect this change adds a race, ... His description was right (thanks), so this patch reverts #1. Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] Oleg, I think maybe you could ack these 2 netconsole patches... They were done on your request but it looks like Andrew is waiting on something... Thanks, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
From: Jarek Poplawski [EMAIL PROTECTED] Date: Wed, 4 Jul 2007 08:41:59 +0200 On Mon, Jul 02, 2007 at 09:52:26AM +0200, Jarek Poplawski wrote: From my recent patch: #1 Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work() required a work function should always (unconditionally) rearm with delay 0 - otherwise it would endlessly loop. This patch replaces this function with cancel_delayed_work(). Later kernel versions don't require this, so here it's only for uniformity. But Oleg Nesterov [EMAIL PROTECTED] found: But 2.6.22 doesn't need this change, why it was merged? In fact, I suspect this change adds a race, ... His description was right (thanks), so this patch reverts #1. Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] Oleg, I think maybe you could ack these 2 netconsole patches... They were done on your request but it looks like Andrew is waiting on something... I plan to apply this patch, don't worry about it :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
On Tue, Jul 03, 2007 at 11:47:18PM -0700, David Miller wrote: ... I plan to apply this patch, don't worry about it :) Now I'm really worried! Don't you evere sleep? Good night, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
On Wed, Jul 04, 2007 at 08:41:59AM +0200, Jarek Poplawski wrote: ... They were done on your request but it looks like Andrew is waiting on something... Andrew, This time I'm not sorry for my English because I've just found I could speak Chiefly Midland and Southern U.S.. Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETPOLL] netconsole: fix soft lockup when removing module
On Sun, Jul 01, 2007 at 09:35:58PM +0400, Oleg Nesterov wrote: Jarek Poplawski wrote: #1 Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work() required a work function should always (unconditionally) rearm with delay 0 - otherwise it would endlessly loop. This patch replaces this function with cancel_delayed_work(). Later kernel versions don't require this, so here it's only for uniformity. But 2.6.22 doesn't need this change, why it was merged? One bad reason is given above. Should I look for another one? In fact, I suspect this change adds a race, You are right! --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -72,7 +72,8 @@ static void queue_process(struct work_struct *work) netif_tx_unlock(dev); local_irq_restore(flags); - schedule_delayed_work(npinfo-tx_work, HZ/10); + if (atomic_read(npinfo-refcnt)) + schedule_delayed_work(npinfo-tx_work, HZ/10); return; } netif_tx_unlock(dev); @@ -785,9 +786,15 @@ void netpoll_cleanup(struct netpoll *np) if (atomic_dec_and_test(npinfo-refcnt)) { skb_queue_purge(npinfo-arp_tx); skb_queue_purge(npinfo-txq); - cancel_rearming_delayed_work(npinfo-tx_work); + cancel_delayed_work(npinfo-tx_work); flush_scheduled_work(); Suppose that -refcnt == 1, and queue_process() was preempted just after atomic_read(npinfo-refcnt). netpoll_cleanup() comes, cancel_delayed_work() does nothing, flush_scheduled_work() sleeps. queue_process() gets CPU, re-schedules -tx_work, and returns. flush_scheduled_work() completes, netpoll_cleanup() frees npinfo and returns while -tx_work is pending. No? No no. (Yes!) I had some doubts about this, and you found very good reason for this. I'll soon send a patch to restore cancel_rearming_delayed_work in 2.6.22. So, 2.6.21 needs something better (maybe you've found it btw.?), but they weren't too interested, anyway. Thanks very much regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
From my recent patch: #1 Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work() required a work function should always (unconditionally) rearm with delay 0 - otherwise it would endlessly loop. This patch replaces this function with cancel_delayed_work(). Later kernel versions don't require this, so here it's only for uniformity. But Oleg Nesterov [EMAIL PROTECTED] found: But 2.6.22 doesn't need this change, why it was merged? In fact, I suspect this change adds a race, ... His description was right (thanks), so this patch reverts #1. Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] --- diff -Nurp 2.6.22-rc7-/net/core/netpoll.c 2.6.22-rc7/net/core/netpoll.c --- 2.6.22-rc7-/net/core/netpoll.c 2007-07-02 09:03:27.0 +0200 +++ 2.6.22-rc7/net/core/netpoll.c 2007-07-02 09:32:34.0 +0200 @@ -72,8 +72,7 @@ static void queue_process(struct work_st netif_tx_unlock(dev); local_irq_restore(flags); - if (atomic_read(npinfo-refcnt)) - schedule_delayed_work(npinfo-tx_work, HZ/10); + schedule_delayed_work(npinfo-tx_work, HZ/10); return; } netif_tx_unlock(dev); @@ -786,7 +785,7 @@ void netpoll_cleanup(struct netpoll *np) if (atomic_dec_and_test(npinfo-refcnt)) { skb_queue_purge(npinfo-arp_tx); skb_queue_purge(npinfo-txq); - cancel_delayed_work(npinfo-tx_work); + cancel_rearming_delayed_work(npinfo-tx_work); flush_scheduled_work(); /* clean after last, unfinished work */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Re: [NETPOLL] netconsole: fix soft lockup when removing module
On 07/02, Jarek Poplawski wrote: diff -Nurp 2.6.22-rc7-/net/core/netpoll.c 2.6.22-rc7/net/core/netpoll.c --- 2.6.22-rc7-/net/core/netpoll.c2007-07-02 09:03:27.0 +0200 +++ 2.6.22-rc7/net/core/netpoll.c 2007-07-02 09:32:34.0 +0200 @@ -72,8 +72,7 @@ static void queue_process(struct work_st netif_tx_unlock(dev); local_irq_restore(flags); - if (atomic_read(npinfo-refcnt)) - schedule_delayed_work(npinfo-tx_work, HZ/10); + schedule_delayed_work(npinfo-tx_work, HZ/10); return; } netif_tx_unlock(dev); @@ -786,7 +785,7 @@ void netpoll_cleanup(struct netpoll *np) if (atomic_dec_and_test(npinfo-refcnt)) { skb_queue_purge(npinfo-arp_tx); skb_queue_purge(npinfo-txq); - cancel_delayed_work(npinfo-tx_work); + cancel_rearming_delayed_work(npinfo-tx_work); flush_scheduled_work(); While you are here, could you also delete this flush_scheduled_work() ? It is not needed any longer. Oleg. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETPOLL] netconsole: fix soft lockup when removing module
On 07/02, Jarek Poplawski wrote: --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -72,7 +72,8 @@ static void queue_process(struct work_struct *work) netif_tx_unlock(dev); local_irq_restore(flags); - schedule_delayed_work(npinfo-tx_work, HZ/10); + if (atomic_read(npinfo-refcnt)) + schedule_delayed_work(npinfo-tx_work, HZ/10); return; } [...snip...] So, 2.6.21 needs something better (maybe you've found it btw.?), but they weren't too interested, anyway. We can do a double flush trick. If queue_process() checks -refcnt before schedule_delayed_work() like above, netpoll_cleanup() can do flush_scheduled_work(); // the next invocation of queue_process() // must see -refcnt == 0 if (!cancel_delayed_work(npinfo-tx_work)) { /* may be queued, wait for completion */ flush_scheduled_work(); } Jarek, I don't understand net/, a silly question. Why do we need the #2 chunk? Isn't it better to move skb_queue_purge(npinfo-txq) after cancel_..._work() instead? Oleg. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETPOLL] netconsole: fix soft lockup when removing module
On Mon, Jul 02, 2007 at 01:24:08PM +0400, Oleg Nesterov wrote: On 07/02, Jarek Poplawski wrote: --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -72,7 +72,8 @@ static void queue_process(struct work_struct *work) netif_tx_unlock(dev); local_irq_restore(flags); - schedule_delayed_work(npinfo-tx_work, HZ/10); + if (atomic_read(npinfo-refcnt)) + schedule_delayed_work(npinfo-tx_work, HZ/10); return; } [...snip...] So, 2.6.21 needs something better (maybe you've found it btw.?), but they weren't too interested, anyway. We can do a double flush trick. If queue_process() checks -refcnt before schedule_delayed_work() like above, netpoll_cleanup() can do flush_scheduled_work(); // the next invocation of queue_process() // must see -refcnt == 0 if (!cancel_delayed_work(npinfo-tx_work)) { /* may be queued, wait for completion */ flush_scheduled_work(); } I'll try to think about it later, but I don't plan to do next patch, so feel free to send this. I didn't plan to fix netpoll at all (I never didn't use nor studied this before...). But couldn't stand this stupid lockup stays in 2.6.21. Now, I see it probably doesn't annoy more than 2 or 3 people around... Jarek, I don't understand net/, a silly question. Why do we need the #2 chunk? Isn't it better to move skb_queue_purge(npinfo-txq) after cancel_..._work() instead? I've thought about this too, but because I don't know netpoll/netconsole enough I didn't want to change more than minimum needed. skb_queue_purge() uses heavy locking (irqsave) and I don't remember now if I've found the reason or simply believed somebody had to have a reason to do this there, anyway, if moved after cancel_ it could be done without this locking, and something like while () instead of my if () should be enough. But there was not much interest about this patch, and I'm not currently interested to be the main netconsole expert too, so maybe you would like to try... Cheers, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html