Re: [PATCH] netpoll can lock up on low memory.
On Sat, Aug 06, 2005 at 05:57:20AM -0400, Steven Rostedt wrote: > On Sat, 2005-08-06 at 02:46 -0700, David S. Miller wrote: > > Can you guys stop peeing your pants over this, put aside > > your differences, and work on a mutually acceptable fix > > for these bugs? > > > > Much appreciated, thanks :-) > > In my last email, I stated that this discussion seems to have > demonstrated that the e1000 driver's netpoll is indeed broken, and needs > to be fixed. I submitted eariler a patch for this, but it's untested > and someone who owns an e1000 needs to try it. I've got your e1000 change in my queue and I'll try to test it tomorrow (realized I've got e1000 in my laptop). > As for all the netpoll issues, I'm satisfied with whatever you guys > decide. But I've seen lots of problems posted over the netpoll and > e1000, where people send in patches that do everything but fix the > e1000, and that's where I chimed in. Andi's patch looks like it fixes a related but slightly different problem. I'm working on a variant. And I'll try to make the skb allocation code eventually give up too. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Sat, Aug 06, 2005 at 09:58:27AM +0200, Ingo Molnar wrote: > > btw., the current NR_SKBS 32 in netpoll.c seems quite low, especially > e1000 can have a whole lot more skbs queued at once. Might be more > robust to increase it to 128 or 256? Not sure that the card's queueing really makes a difference. It either eventually releases the queued SKBs or it doesn't. What's more important is that we be able to survive bursts like the output of sysrq-t. This seems to work already. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
Steven Rostedt wrote: In my last email, I stated that this discussion seems to have demonstrated that the e1000 driver's netpoll is indeed broken, and needs to be fixed. I submitted eariler a patch for this, but it's untested and someone who owns an e1000 needs to try it. I can test this, but not right now: Im trying, again, to find my hard lockup issue, and so I will try to run this machine until it locks up. It lasted 9 days at one time, so it could potentially take some time, I'm afraid. --- John Bäckstrand - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Sat, Aug 06, 2005 at 09:45:03AM +0200, Ingo Molnar wrote: > > * Andi Kleen <[EMAIL PROTECTED]> wrote: > > > On Fri, Aug 05, 2005 at 01:01:57PM -0700, Matt Mackall wrote: > > > The netpoll philosophy is to assume that its traffic is an absolute > > > priority - it is better to potentially hang trying to deliver a panic > > > message than to give up and crash silently. > > > > That would be ok if netpoll was only used to deliver panics. But it is > > not. It delivers all messages, and you cannot hang the kernel during > > that. Actually even for panics it is wrong, because often it is more > > important to reboot in a panic than (with a panic timeout) to actually > > deliver the panic. That's needed e.g. in a failover cluster. > > without going into the merits of this discussion, reliable failover > clusters must include (and do include) an external ability to cut power. > No amount of in-kernel logic will prevent the kernel from hanging, given > a bad enough kernel bug. Ok, true, but we should do a best effort. > > So the right question is not 'can we prevent the kernel from hanging, > ever' (we cannot), but 'which change makes it less likely for the kernel > to hang'. (and, obviously: assuming all other kernel components are > functioning per specification, netpoll itself most not hang :-) > > even a plain printk to VGA can hang in certain kernel crashes. Netpoll > is more complex and thus has more exposure to hangs. E.g. netpoll relies > on the network driver to correctly recycle skbs within a bound amount of > time. If the network driver leaks skbs, it's game over for netpoll. I don't think we even need to think about such rare cases, until the easy cases ("everything hangs when the cable is pulled") are not fixed. > [ i'd prefer a hang over nondeterministic behavior, and e.g. losing > console messages is sure nondeterministic behavior. What if the > console message is "WARNING: the box has just been broken into"? ] That just makes netconsole useless in production. If it causes frequenet hangs people will not use it. > > we could do one thing (see the patch below): i think it would be useful > to fill up the netlogging skb queue straight at initialization time. > Especially if netpoll is used for dumping alone, the system might not be > in a situation to fill up the queue at the point of crash, so better be > a bit more prepared and keep the pipeline filled. You're solving a completely different issue here? -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Sat, 2005-08-06 at 02:46 -0700, David S. Miller wrote: > Can you guys stop peeing your pants over this, put aside > your differences, and work on a mutually acceptable fix > for these bugs? > > Much appreciated, thanks :-) In my last email, I stated that this discussion seems to have demonstrated that the e1000 driver's netpoll is indeed broken, and needs to be fixed. I submitted eariler a patch for this, but it's untested and someone who owns an e1000 needs to try it. As for all the netpoll issues, I'm satisfied with whatever you guys decide. But I've seen lots of problems posted over the netpoll and e1000, where people send in patches that do everything but fix the e1000, and that's where I chimed in. Thank you, my pants are dry now :-) -- Steve - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
Can you guys stop peeing your pants over this, put aside your differences, and work on a mutually acceptable fix for these bugs? Much appreciated, thanks :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
btw., the current NR_SKBS 32 in netpoll.c seems quite low, especially e1000 can have a whole lot more skbs queued at once. Might be more robust to increase it to 128 or 256? Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
* Andi Kleen <[EMAIL PROTECTED]> wrote: > On Fri, Aug 05, 2005 at 01:01:57PM -0700, Matt Mackall wrote: > > The netpoll philosophy is to assume that its traffic is an absolute > > priority - it is better to potentially hang trying to deliver a panic > > message than to give up and crash silently. > > That would be ok if netpoll was only used to deliver panics. But it is > not. It delivers all messages, and you cannot hang the kernel during > that. Actually even for panics it is wrong, because often it is more > important to reboot in a panic than (with a panic timeout) to actually > deliver the panic. That's needed e.g. in a failover cluster. without going into the merits of this discussion, reliable failover clusters must include (and do include) an external ability to cut power. No amount of in-kernel logic will prevent the kernel from hanging, given a bad enough kernel bug. So the right question is not 'can we prevent the kernel from hanging, ever' (we cannot), but 'which change makes it less likely for the kernel to hang'. (and, obviously: assuming all other kernel components are functioning per specification, netpoll itself most not hang :-) even a plain printk to VGA can hang in certain kernel crashes. Netpoll is more complex and thus has more exposure to hangs. E.g. netpoll relies on the network driver to correctly recycle skbs within a bound amount of time. If the network driver leaks skbs, it's game over for netpoll. [ i'd prefer a hang over nondeterministic behavior, and e.g. losing console messages is sure nondeterministic behavior. What if the console message is "WARNING: the box has just been broken into"? ] we could do one thing (see the patch below): i think it would be useful to fill up the netlogging skb queue straight at initialization time. Especially if netpoll is used for dumping alone, the system might not be in a situation to fill up the queue at the point of crash, so better be a bit more prepared and keep the pipeline filled. Ingo Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> --- net/core/netpoll.c.orig +++ net/core/netpoll.c @@ -720,6 +720,8 @@ int netpoll_setup(struct netpoll *np) } /* last thing to do is link it to the net device structure */ ndev->npinfo = npinfo; + /* fill up the skb queue */ + refill_skbs(); return 0; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Saturday 06 August 2005 12:32, Steven Rostedt wrote: > > > If you need to really get the data out, then the design should be > > > changed. Have some return value showing the failure, check for > > > oops_in_progress or whatever, and try again after turning interrupts > > > back on, and getting to a point where the system can free up memory > > > (write to swap, etc). Just a busy loop without ever getting a skb is > > > just bad. > > > > Why, pray tell, do you think there will be a second chance after > > re-enabling interrupts? How does this work when we're panicking or > > oopsing where we most care? How does this work when the netpoll client > > is the kernel debugger and the machine is completely stopped because > > we're tracing it? > > What I meant was to check for an oops and maybe then don't break out. > Otherwise let the system try to reclaim memory. Since this is locked > when the alloc_skb called with GFP_ATOMIC and fails. You might want to take a look at my stupid little __GFP_MEMALLOC hack in the network block IO deadlock thread on netdev. It will let you use the memalloc reserve from atomic context. As long as you can be sure your usage will be bounded and you will eventually give it back, this should be ok. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, 2005-08-05 at 18:53 -0700, Matt Mackall wrote: > On Fri, Aug 05, 2005 at 08:23:55PM -0400, Steven Rostedt wrote: [...] > > If you need to really get the data out, then the design should be > > changed. Have some return value showing the failure, check for > > oops_in_progress or whatever, and try again after turning interrupts > > back on, and getting to a point where the system can free up memory > > (write to swap, etc). Just a busy loop without ever getting a skb is > > just bad. > > Why, pray tell, do you think there will be a second chance after > re-enabling interrupts? How does this work when we're panicking or > oopsing where we most care? How does this work when the netpoll client > is the kernel debugger and the machine is completely stopped because > we're tracing it? What I meant was to check for an oops and maybe then don't break out. Otherwise let the system try to reclaim memory. Since this is locked when the alloc_skb called with GFP_ATOMIC and fails. > > As for busy loops, let me direct you to the "poll" part of the name. > It is in fact the whole point. In the kernel I would think that a poll would probe for an event and let the system continue if the event hasn't arrived. Not block all activities until an event has arrived. > > > > > So even a long timeout would not do? So you don't even get a message to > > > > the console? > > > > > > In general, there's no way to measure time here. And if we're > > > using netconsole, what makes you think there's any other console? > > > > Why assume that there isn't another console? The screen may be used > > with netconsole, you just lose whatever has been scrolled too far. > > Yes, there may be another console, but we should by no means depend on > that being the case. We should in fact assume it's not. > > > > > > > Also, as Andi told me, the printk here would probably not show up > > > > > > anyway if this happens with netconsole. > > > > > > > > > > That's fine. But in fact, it does show up occassionally - I've seen > > > > > it. > > > > > > > > Then maybe what Andi told me is not true ;-) > > > > > > > > Oh, and did your machine crash when you saw it? Have you seen it with > > > > the e1000 driver? > > > > > > No and no. Most of my own testing is done with tg3. > > > > > > > If you saw the message and the system didn't crash, then that's proof > > that if the driver is not working properly, you would have lock up the > > system, and the system was _not_ in a state that it _had_ to get the > > message out. > > Let me be more precise. I've seen it in the middle of an oops dump, > where it complained, then made further progress, and then died. In > other words, the code works. And I've since upped the pool size. OK, this is more clear than what you said previously. When I asked if the system crashed, I should have asked if the system was crashing. I thought that you meant that you saw this in normal activity with no oops. So, if anything, this discussion has pointed out that the e1000 has a problem with its netpoll. I wrote an earlier patch, but since I don't own a e1000, someone will need to test it, or at least check to see if it looks OK. -- Steve - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, Aug 05, 2005 at 08:23:55PM -0400, Steven Rostedt wrote: > On Fri, 2005-08-05 at 14:28 -0700, Matt Mackall wrote: > > > > Netpoll generally must assume it won't get a second chance, as it's > > being called by things like oops() and panic() and used by things like > > kgdb. If netpoll fails, the box is dead anyway. > > But it is also being called by every printk in the kernel. What happens > when the printk that causes this lock up is not a panic but just some > info print. One would use netconsole when they turn on more verbose > printing, to keep the output fast, right? So if the system gets a > little memory tight, but not to the point of failing, this will cause a > lock up and no one would know why. This doesn't happen, or if it does, it happens far less often than genuine crashes. This can only happen when netpoll's burned through its entire pool of SKBs in a single interrupt and the card never releases them, despite repeated prodding. In other words, you'll get most of a dump out of the box, but you're probably screwed no matter what you do. Also note that _any_ printk can be the kernel's dying breath. This is why for both serial and video we do polled I/O to be sure we actually get our message out. Netconsole is no different. > If you need to really get the data out, then the design should be > changed. Have some return value showing the failure, check for > oops_in_progress or whatever, and try again after turning interrupts > back on, and getting to a point where the system can free up memory > (write to swap, etc). Just a busy loop without ever getting a skb is > just bad. Why, pray tell, do you think there will be a second chance after re-enabling interrupts? How does this work when we're panicking or oopsing where we most care? How does this work when the netpoll client is the kernel debugger and the machine is completely stopped because we're tracing it? As for busy loops, let me direct you to the "poll" part of the name. It is in fact the whole point. > > > So even a long timeout would not do? So you don't even get a message to > > > the console? > > > > In general, there's no way to measure time here. And if we're > > using netconsole, what makes you think there's any other console? > > Why assume that there isn't another console? The screen may be used > with netconsole, you just lose whatever has been scrolled too far. Yes, there may be another console, but we should by no means depend on that being the case. We should in fact assume it's not. > > > > > Also, as Andi told me, the printk here would probably not show up > > > > > anyway if this happens with netconsole. > > > > > > > > That's fine. But in fact, it does show up occassionally - I've seen > > > > it. > > > > > > Then maybe what Andi told me is not true ;-) > > > > > > Oh, and did your machine crash when you saw it? Have you seen it with > > > the e1000 driver? > > > > No and no. Most of my own testing is done with tg3. > > > > If you saw the message and the system didn't crash, then that's proof > that if the driver is not working properly, you would have lock up the > system, and the system was _not_ in a state that it _had_ to get the > message out. Let me be more precise. I've seen it in the middle of an oops dump, where it complained, then made further progress, and then died. In other words, the code works. And I've since upped the pool size. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, Aug 05, 2005 at 11:51:18PM +0200, Andi Kleen wrote: > > > If that was the policy it would be a quite dumb one and make netpoll > > > totally unsuitable for production use. I hope it is not. > > > > Suggest you rip __GFP_NOFAIL out of JBD before complaining about this. > > So you're suggesting we should become as bad at handling networking > errors as we are at handling IO errors? No, I'm suggesting that the machine will hang forever sometimes and no amount of patching up netpoll or JBD, etc. will fix that. A hardware watchdog is a requirement for robust failover anyway and if you think otherwise, you're dreaming. And for reference, both are examples of theoretical should-never-happen memory allocation failures. > > > Dave, can you please apply the timeout patch anyways? > > > > Yes, let's go right over the maintainer's objections almost > > immediately after he chimes into the thread. I'll spare the list the > > colorful language this inspires. > > Sure when the maintainer has a unreasonable position on something > I think that's justified. Yours in this case is clearly unreasonable. What's clear is that you didn't like my position from my very first post in this thread and immediately went for the nuclear option without even trying to discuss it. Are you even aware that the patch we're discussing here is for a problem that has yet to be observed and that Steven's initial analysis had missed a couple things? -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, 2005-08-05 at 23:26 +0200, Andi Kleen wrote: > I suspect Steven's patch for the e1000 is needed in addition to > handle different cases too. > I haven't tested it. Someone with a e1000 must see if it works. I submitted the e100 fix that had the same problem, but I would feel better if the patch I sent for the e1000 actually got tested. To test, one would setup a box with the e1000 and netconsole. Run with something doing several printks (possible using sysrq-t or such), and then unplug the cable (without Andi's patch) and replug it back in. If the patch worked, the system would hang while the cable was detached, but come back shortly after the cable was plugged back in. -- Steve - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, 2005-08-05 at 14:28 -0700, Matt Mackall wrote: > > Netpoll generally must assume it won't get a second chance, as it's > being called by things like oops() and panic() and used by things like > kgdb. If netpoll fails, the box is dead anyway. > But it is also being called by every printk in the kernel. What happens when the printk that causes this lock up is not a panic but just some info print. One would use netconsole when they turn on more verbose printing, to keep the output fast, right? So if the system gets a little memory tight, but not to the point of failing, this will cause a lock up and no one would know why. If you need to really get the data out, then the design should be changed. Have some return value showing the failure, check for oops_in_progress or whatever, and try again after turning interrupts back on, and getting to a point where the system can free up memory (write to swap, etc). Just a busy loop without ever getting a skb is just bad. > > > The netpoll philosophy is to assume that its traffic is an absolute > > > priority - it is better to potentially hang trying to deliver a panic > > > message than to give up and crash silently. > > > > So even a long timeout would not do? So you don't even get a message to > > the console? > > In general, there's no way to measure time here. And if we're > using netconsole, what makes you think there's any other console? Why assume that there isn't another console? The screen may be used with netconsole, you just lose whatever has been scrolled too far. > > > > > Also, as Andi told me, the printk here would probably not show up > > > > anyway if this happens with netconsole. > > > > > > That's fine. But in fact, it does show up occassionally - I've seen > > > it. > > > > Then maybe what Andi told me is not true ;-) > > > > Oh, and did your machine crash when you saw it? Have you seen it with > > the e1000 driver? > > No and no. Most of my own testing is done with tg3. > If you saw the message and the system didn't crash, then that's proof that if the driver is not working properly, you would have lock up the system, and the system was _not_ in a state that it _had_ to get the message out. -- Steve - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
> > If that was the policy it would be a quite dumb one and make netpoll > > totally unsuitable for production use. I hope it is not. > > Suggest you rip __GFP_NOFAIL out of JBD before complaining about this. So you're suggesting we should become as bad at handling networking errors as we are at handling IO errors? > > Dave, can you please apply the timeout patch anyways? > > Yes, let's go right over the maintainer's objections almost > immediately after he chimes into the thread. I'll spare the list the > colorful language this inspires. Sure when the maintainer has a unreasonable position on something I think that's justified. Yours in this case is clearly unreasonable. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, Aug 05, 2005 at 11:26:10PM +0200, Andi Kleen wrote: > On Fri, Aug 05, 2005 at 01:01:57PM -0700, Matt Mackall wrote: > > The netpoll philosophy is to assume that its traffic is an absolute > > priority - it is better to potentially hang trying to deliver a panic > > message than to give up and crash silently. > > That would be ok if netpoll was only used to deliver panics. But > it is not. It delivers all messages, and you cannot hang the > kernel during that. Actually even for panics it is wrong, because often > it is more important to reboot in a panic than (with a panic timeout) > to actually deliver the panic. That's needed e.g. in a failover cluster. > > If that was the policy it would be a quite dumb one and make netpoll > totally unsuitable for production use. I hope it is not. Suggest you rip __GFP_NOFAIL out of JBD before complaining about this. > Dave, can you please apply the timeout patch anyways? Yes, let's go right over the maintainer's objections almost immediately after he chimes into the thread. I'll spare the list the colorful language this inspires. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, Aug 05, 2005 at 01:01:57PM -0700, Matt Mackall wrote: > The netpoll philosophy is to assume that its traffic is an absolute > priority - it is better to potentially hang trying to deliver a panic > message than to give up and crash silently. That would be ok if netpoll was only used to deliver panics. But it is not. It delivers all messages, and you cannot hang the kernel during that. Actually even for panics it is wrong, because often it is more important to reboot in a panic than (with a panic timeout) to actually deliver the panic. That's needed e.g. in a failover cluster. If that was the policy it would be a quite dumb one and make netpoll totally unsuitable for production use. I hope it is not. > > > Also, as Andi told me, the printk here would probably not show up > > anyway if this happens with netconsole. > > That's fine. But in fact, it does show up occassionally - I've seen > it. > > NAK'ed. Too bad. This would mean that all serious non toy users of netpoll would have to carry this patch on their own. But that wouldn't be good. Dave, can you please apply the timeout patch anyways? I suspect Steven's patch for the e1000 is needed in addition to handle different cases too. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, Aug 05, 2005 at 04:57:00PM -0400, Steven Rostedt wrote: > On Fri, 2005-08-05 at 13:01 -0700, Matt Mackall wrote: > > On Fri, Aug 05, 2005 at 10:36:31AM -0400, Steven Rostedt wrote: > > > Looking at the netpoll routines, I noticed that the find_skb could > > > lockup if the memory is low. This is because the allocations are > > > called with GFP_ATOMIC (since this is in interrupt context) and if > > > it fails, it will continue to fail. This is just by observing the > > > code, I didn't have this actually happen. So if this is not the > > > case, please let me know how it can get out. Otherwise, please > > > accept this patch. > > > > By netpoll_poll() tickling the driver enough to free the currently > > queued outgoing SKBs. > > I believe that the e1000 wont free up any outgoing packets since the > netpoll call doesn't seem to get to the e1000_clean_tx part of the > e1000_intr, otherwise the system wouldn't lock under the > netpoll_send_skb when one disconnects the wire and puts it back in. The > disconnect would lock it up anyway (with Andi's patch it now doesn't) > but since it won't come back up after the link is back up, there seems > to be something wrong with the e1000 netpoll driver. This is because > the e1000_netpoll doesn't seem to be cleaning up the tx buffer and start > the queue back up. That does seem like a driver problem. > > Also note that by the time we're in this loop, we're ready to take > > desperate measures. We've already exhausted our private queue of SKBs > > so we have no alternative but to keep kicking the driver until > > something happens. > > OK, the system is under heavy memory load and starts eating up the > netpoll packets. When the last packet is gone, and you have something > like the e1000 that doesn't clean up its packets with netpoll, then you > just locked up the system. > > The scary part of this loop is that if the netpoll doesn't come up with > the goods, its game over. Say we are at desperate measures but it could > be a case where we need to output more information and lockup here > before we can go out and free some memory. Realistically, we were probably going to crash anyway at this point as we're apparently failing to recycle SKBs. Netpoll generally must assume it won't get a second chance, as it's being called by things like oops() and panic() and used by things like kgdb. If netpoll fails, the box is dead anyway. > > The netpoll philosophy is to assume that its traffic is an absolute > > priority - it is better to potentially hang trying to deliver a panic > > message than to give up and crash silently. > > So even a long timeout would not do? So you don't even get a message to > the console? In general, there's no way to measure time here. And if we're using netconsole, what makes you think there's any other console? > > > Also, as Andi told me, the printk here would probably not show up > > > anyway if this happens with netconsole. > > > > That's fine. But in fact, it does show up occassionally - I've seen > > it. > > Then maybe what Andi told me is not true ;-) > > Oh, and did your machine crash when you saw it? Have you seen it with > the e1000 driver? No and no. Most of my own testing is done with tg3. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, 2005-08-05 at 13:01 -0700, Matt Mackall wrote: > On Fri, Aug 05, 2005 at 10:36:31AM -0400, Steven Rostedt wrote: > > Looking at the netpoll routines, I noticed that the find_skb could > > lockup if the memory is low. This is because the allocations are > > called with GFP_ATOMIC (since this is in interrupt context) and if > > it fails, it will continue to fail. This is just by observing the > > code, I didn't have this actually happen. So if this is not the > > case, please let me know how it can get out. Otherwise, please > > accept this patch. > > By netpoll_poll() tickling the driver enough to free the currently > queued outgoing SKBs. I believe that the e1000 wont free up any outgoing packets since the netpoll call doesn't seem to get to the e1000_clean_tx part of the e1000_intr, otherwise the system wouldn't lock under the netpoll_send_skb when one disconnects the wire and puts it back in. The disconnect would lock it up anyway (with Andi's patch it now doesn't) but since it won't come back up after the link is back up, there seems to be something wrong with the e1000 netpoll driver. This is because the e1000_netpoll doesn't seem to be cleaning up the tx buffer and start the queue back up. > > Also note that by the time we're in this loop, we're ready to take > desperate measures. We've already exhausted our private queue of SKBs > so we have no alternative but to keep kicking the driver until > something happens. OK, the system is under heavy memory load and starts eating up the netpoll packets. When the last packet is gone, and you have something like the e1000 that doesn't clean up its packets with netpoll, then you just locked up the system. The scary part of this loop is that if the netpoll doesn't come up with the goods, its game over. Say we are at desperate measures but it could be a case where we need to output more information and lockup here before we can go out and free some memory. > > The netpoll philosophy is to assume that its traffic is an absolute > priority - it is better to potentially hang trying to deliver a panic > message than to give up and crash silently. So even a long timeout would not do? So you don't even get a message to the console? > > > Also, as Andi told me, the printk here would probably not show up > > anyway if this happens with netconsole. > > That's fine. But in fact, it does show up occassionally - I've seen > it. Then maybe what Andi told me is not true ;-) Oh, and did your machine crash when you saw it? Have you seen it with the e1000 driver? > > NAK'ed. > (ouch!) OK, since my argument is currently only theory, and I don't have a e1000 card to test this on, I'll take out my fix to the e100 (where it cleaned up it's tx drivers in netpoll) and see if I can get the machine to lockup here just by putting it under extreme memory loads. -- Steve - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Fri, Aug 05, 2005 at 10:36:31AM -0400, Steven Rostedt wrote: > Looking at the netpoll routines, I noticed that the find_skb could > lockup if the memory is low. This is because the allocations are > called with GFP_ATOMIC (since this is in interrupt context) and if > it fails, it will continue to fail. This is just by observing the > code, I didn't have this actually happen. So if this is not the > case, please let me know how it can get out. Otherwise, please > accept this patch. By netpoll_poll() tickling the driver enough to free the currently queued outgoing SKBs. Also note that by the time we're in this loop, we're ready to take desperate measures. We've already exhausted our private queue of SKBs so we have no alternative but to keep kicking the driver until something happens. The netpoll philosophy is to assume that its traffic is an absolute priority - it is better to potentially hang trying to deliver a panic message than to give up and crash silently. > Also, as Andi told me, the printk here would probably not show up > anyway if this happens with netconsole. That's fine. But in fact, it does show up occassionally - I've seen it. NAK'ed. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] netpoll can lock up on low memory.
Looking at the netpoll routines, I noticed that the find_skb could lockup if the memory is low. This is because the allocations are called with GFP_ATOMIC (since this is in interrupt context) and if it fails, it will continue to fail. This is just by observing the code, I didn't have this actually happen. So if this is not the case, please let me know how it can get out. Otherwise, please accept this patch. Also, as Andi told me, the printk here would probably not show up anyway if this happens with netconsole. Here I changed it to break out instead of just looping. -- Steve Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- linux-2.6.13-rc3/net/core/netpoll.c.orig2005-08-05 09:37:00.0 -0400 +++ linux-2.6.13-rc3/net/core/netpoll.c 2005-08-05 10:29:32.0 -0400 @@ -229,8 +229,9 @@ repeat: } if(!skb) { - count++; - if (once && (count == 100)) { + if (count++ == 10) + return NULL; + if (once) printk("out of netpoll skbs!\n"); once = 0; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html