Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread Matt Mackall
On Sat, Aug 06, 2005 at 05:57:20AM -0400, Steven Rostedt wrote:
> On Sat, 2005-08-06 at 02:46 -0700, David S. Miller wrote:
> > Can you guys stop peeing your pants over this, put aside
> > your differences, and work on a mutually acceptable fix
> > for these bugs?
> > 
> > Much appreciated, thanks :-)
> 
> In my last email, I stated that this discussion seems to have
> demonstrated that the e1000 driver's netpoll is indeed broken, and needs
> to be fixed.  I submitted eariler a patch for this, but it's untested
> and someone who owns an e1000 needs to try it.

I've got your e1000 change in my queue and I'll try to test it
tomorrow (realized I've got e1000 in my laptop).
 
> As for all the netpoll issues, I'm satisfied with whatever you guys
> decide.  But I've seen lots of problems posted over the netpoll and
> e1000, where people send in patches that do everything but fix the
> e1000, and that's where I chimed in.

Andi's patch looks like it fixes a related but slightly different
problem. I'm working on a variant. And I'll try to make the skb
allocation code eventually give up too.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread Matt Mackall
On Sat, Aug 06, 2005 at 09:58:27AM +0200, Ingo Molnar wrote:
> 
> btw., the current NR_SKBS 32 in netpoll.c seems quite low, especially 
> e1000 can have a whole lot more skbs queued at once. Might be more 
> robust to increase it to 128 or 256?

Not sure that the card's queueing really makes a difference. It either
eventually releases the queued SKBs or it doesn't. What's more
important is that we be able to survive bursts like the output of
sysrq-t. This seems to work already.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread John Bäckstrand

Steven Rostedt wrote:


In my last email, I stated that this discussion seems to have
demonstrated that the e1000 driver's netpoll is indeed broken, and needs
to be fixed.  I submitted eariler a patch for this, but it's untested
and someone who owns an e1000 needs to try it.


I can test this, but not right now: Im trying, again, to find my hard 
lockup issue, and so I will try to run this machine until it locks up. 
It lasted 9 days at one time, so it could potentially take some time, 
I'm afraid.


---
John Bäckstrand
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread Andi Kleen
On Sat, Aug 06, 2005 at 09:45:03AM +0200, Ingo Molnar wrote:
> 
> * Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > On Fri, Aug 05, 2005 at 01:01:57PM -0700, Matt Mackall wrote:
> > > The netpoll philosophy is to assume that its traffic is an absolute
> > > priority - it is better to potentially hang trying to deliver a panic
> > > message than to give up and crash silently.
> > 
> > That would be ok if netpoll was only used to deliver panics. But it is 
> > not. It delivers all messages, and you cannot hang the kernel during 
> > that. Actually even for panics it is wrong, because often it is more 
> > important to reboot in a panic than (with a panic timeout) to actually 
> > deliver the panic. That's needed e.g. in a failover cluster.
> 
> without going into the merits of this discussion, reliable failover 
> clusters must include (and do include) an external ability to cut power.  
> No amount of in-kernel logic will prevent the kernel from hanging, given 
> a bad enough kernel bug.

Ok, true, but we should do a best effort.

> 
> So the right question is not 'can we prevent the kernel from hanging, 
> ever' (we cannot), but 'which change makes it less likely for the kernel 
> to hang'. (and, obviously: assuming all other kernel components are 
> functioning per specification, netpoll itself most not hang :-)
> 
> even a plain printk to VGA can hang in certain kernel crashes. Netpoll 
> is more complex and thus has more exposure to hangs. E.g. netpoll relies 
> on the network driver to correctly recycle skbs within a bound amount of 
> time. If the network driver leaks skbs, it's game over for netpoll.

I don't think we even need to think about such rare cases,
until the easy cases ("everything hangs when the cable is pulled") 
are not fixed.

> [ i'd prefer a hang over nondeterministic behavior, and e.g. losing 
>   console messages is sure nondeterministic behavior. What if the 
>   console message is "WARNING: the box has just been broken into"? ]

That just makes netconsole useless in production. If it causes frequenet
hangs people will not use it.


> 
> we could do one thing (see the patch below): i think it would be useful 
> to fill up the netlogging skb queue straight at initialization time.  
> Especially if netpoll is used for dumping alone, the system might not be 
> in a situation to fill up the queue at the point of crash, so better be 
> a bit more prepared and keep the pipeline filled.

You're solving a completely different issue here?

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread Steven Rostedt
On Sat, 2005-08-06 at 02:46 -0700, David S. Miller wrote:
> Can you guys stop peeing your pants over this, put aside
> your differences, and work on a mutually acceptable fix
> for these bugs?
> 
> Much appreciated, thanks :-)

In my last email, I stated that this discussion seems to have
demonstrated that the e1000 driver's netpoll is indeed broken, and needs
to be fixed.  I submitted eariler a patch for this, but it's untested
and someone who owns an e1000 needs to try it.

As for all the netpoll issues, I'm satisfied with whatever you guys
decide.  But I've seen lots of problems posted over the netpoll and
e1000, where people send in patches that do everything but fix the
e1000, and that's where I chimed in.

Thank you, my pants are dry now :-)

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread David S. Miller

Can you guys stop peeing your pants over this, put aside
your differences, and work on a mutually acceptable fix
for these bugs?

Much appreciated, thanks :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread Ingo Molnar

btw., the current NR_SKBS 32 in netpoll.c seems quite low, especially 
e1000 can have a whole lot more skbs queued at once. Might be more 
robust to increase it to 128 or 256?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread Ingo Molnar

* Andi Kleen <[EMAIL PROTECTED]> wrote:

> On Fri, Aug 05, 2005 at 01:01:57PM -0700, Matt Mackall wrote:
> > The netpoll philosophy is to assume that its traffic is an absolute
> > priority - it is better to potentially hang trying to deliver a panic
> > message than to give up and crash silently.
> 
> That would be ok if netpoll was only used to deliver panics. But it is 
> not. It delivers all messages, and you cannot hang the kernel during 
> that. Actually even for panics it is wrong, because often it is more 
> important to reboot in a panic than (with a panic timeout) to actually 
> deliver the panic. That's needed e.g. in a failover cluster.

without going into the merits of this discussion, reliable failover 
clusters must include (and do include) an external ability to cut power.  
No amount of in-kernel logic will prevent the kernel from hanging, given 
a bad enough kernel bug.

So the right question is not 'can we prevent the kernel from hanging, 
ever' (we cannot), but 'which change makes it less likely for the kernel 
to hang'. (and, obviously: assuming all other kernel components are 
functioning per specification, netpoll itself most not hang :-)

even a plain printk to VGA can hang in certain kernel crashes. Netpoll 
is more complex and thus has more exposure to hangs. E.g. netpoll relies 
on the network driver to correctly recycle skbs within a bound amount of 
time. If the network driver leaks skbs, it's game over for netpoll.

[ i'd prefer a hang over nondeterministic behavior, and e.g. losing 
  console messages is sure nondeterministic behavior. What if the 
  console message is "WARNING: the box has just been broken into"? ]

we could do one thing (see the patch below): i think it would be useful 
to fill up the netlogging skb queue straight at initialization time.  
Especially if netpoll is used for dumping alone, the system might not be 
in a situation to fill up the queue at the point of crash, so better be 
a bit more prepared and keep the pipeline filled.

Ingo

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>

--- net/core/netpoll.c.orig
+++ net/core/netpoll.c
@@ -720,6 +720,8 @@ int netpoll_setup(struct netpoll *np)
}
/* last thing to do is link it to the net device structure */
ndev->npinfo = npinfo;
+   /* fill up the skb queue */
+   refill_skbs();
 
return 0;
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread Daniel Phillips
On Saturday 06 August 2005 12:32, Steven Rostedt wrote:
> > > If you need to really get the data out, then the design should be
> > > changed.  Have some return value showing the failure, check for
> > > oops_in_progress or whatever, and try again after turning interrupts
> > > back on, and getting to a point where the system can free up memory
> > > (write to swap, etc).  Just a busy loop without ever getting a skb is
> > > just bad.
> >
> > Why, pray tell, do you think there will be a second chance after
> > re-enabling interrupts? How does this work when we're panicking or
> > oopsing where we most care? How does this work when the netpoll client
> > is the kernel debugger and the machine is completely stopped because
> > we're tracing it?
>
> What I meant was to check for an oops and maybe then don't break out.
> Otherwise let the system try to reclaim memory. Since this is locked
> when the alloc_skb called with GFP_ATOMIC and fails.

You might want to take a look at my stupid little __GFP_MEMALLOC hack in the 
network block IO deadlock thread on netdev.  It will let you use the memalloc 
reserve from atomic context.  As long as you can be sure your usage will be 
bounded and you will eventually give it back, this should be ok.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Steven Rostedt
On Fri, 2005-08-05 at 18:53 -0700, Matt Mackall wrote:
> On Fri, Aug 05, 2005 at 08:23:55PM -0400, Steven Rostedt wrote:
[...]
> > If you need to really get the data out, then the design should be
> > changed.  Have some return value showing the failure, check for
> > oops_in_progress or whatever, and try again after turning interrupts
> > back on, and getting to a point where the system can free up memory
> > (write to swap, etc).  Just a busy loop without ever getting a skb is
> > just bad.
> 
> Why, pray tell, do you think there will be a second chance after
> re-enabling interrupts? How does this work when we're panicking or
> oopsing where we most care? How does this work when the netpoll client
> is the kernel debugger and the machine is completely stopped because
> we're tracing it?

What I meant was to check for an oops and maybe then don't break out.
Otherwise let the system try to reclaim memory. Since this is locked
when the alloc_skb called with GFP_ATOMIC and fails.

> 
> As for busy loops, let me direct you to the "poll" part of the name.
> It is in fact the whole point.

In the kernel I would think that a poll would probe for an event and let
the system continue if the event hasn't arrived.  Not block all
activities until an event has arrived.

> 
> > > > So even a long timeout would not do?  So you don't even get a message to
> > > > the console?
> > > 
> > > In general, there's no way to measure time here. And if we're
> > > using netconsole, what makes you think there's any other console?
> > 
> > Why assume that there isn't another console?  The screen may be used
> > with netconsole, you just lose whatever has been scrolled too far.
> 
> Yes, there may be another console, but we should by no means depend on
> that being the case. We should in fact assume it's not.
> 
> > > > > > Also, as Andi told me, the printk here would probably not show up
> > > > > > anyway if this happens with netconsole.
> > > > > 
> > > > > That's fine. But in fact, it does show up occassionally - I've seen
> > > > > it.
> > > > 
> > > > Then maybe what Andi told me is not true ;-)
> > > > 
> > > > Oh, and did your machine crash when you saw it?  Have you seen it with
> > > > the e1000 driver?
> > > 
> > > No and no. Most of my own testing is done with tg3.
> > > 
> > 
> > If you saw the message and the system didn't crash, then that's proof
> > that if the driver is not working properly, you would have lock up the
> > system, and the system was _not_ in a state that it _had_ to get the
> > message out.
> 
> Let me be more precise. I've seen it in the middle of an oops dump,
> where it complained, then made further progress, and then died. In
> other words, the code works. And I've since upped the pool size.

OK, this is more clear than what you said previously.  When I asked if
the system crashed, I should have asked if the system was crashing.  I
thought that you meant that you saw this in normal activity with no
oops.

So, if anything, this discussion has pointed out that the e1000 has a
problem with its netpoll.  I wrote an earlier patch, but since I don't
own a e1000, someone will need to test it, or at least check to see if
it looks OK.

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Matt Mackall
On Fri, Aug 05, 2005 at 08:23:55PM -0400, Steven Rostedt wrote:
> On Fri, 2005-08-05 at 14:28 -0700, Matt Mackall wrote:
> > 
> > Netpoll generally must assume it won't get a second chance, as it's
> > being called by things like oops() and panic() and used by things like
> > kgdb. If netpoll fails, the box is dead anyway.
> 
> But it is also being called by every printk in the kernel. What happens
> when the printk that causes this lock up is not a panic but just some
> info print.  One would use netconsole when they turn on more verbose
> printing, to keep the output fast, right?  So if the system gets a
> little memory tight, but not to the point of failing, this will cause a
> lock up and no one would know why. 

This doesn't happen, or if it does, it happens far less often than
genuine crashes. This can only happen when netpoll's burned through
its entire pool of SKBs in a single interrupt and the card never
releases them, despite repeated prodding. In other words, you'll get
most of a dump out of the box, but you're probably screwed no matter
what you do.

Also note that _any_ printk can be the kernel's dying breath. This is
why for both serial and video we do polled I/O to be sure we actually
get our message out. Netconsole is no different.

> If you need to really get the data out, then the design should be
> changed.  Have some return value showing the failure, check for
> oops_in_progress or whatever, and try again after turning interrupts
> back on, and getting to a point where the system can free up memory
> (write to swap, etc).  Just a busy loop without ever getting a skb is
> just bad.

Why, pray tell, do you think there will be a second chance after
re-enabling interrupts? How does this work when we're panicking or
oopsing where we most care? How does this work when the netpoll client
is the kernel debugger and the machine is completely stopped because
we're tracing it?

As for busy loops, let me direct you to the "poll" part of the name.
It is in fact the whole point.

> > > So even a long timeout would not do?  So you don't even get a message to
> > > the console?
> > 
> > In general, there's no way to measure time here. And if we're
> > using netconsole, what makes you think there's any other console?
> 
> Why assume that there isn't another console?  The screen may be used
> with netconsole, you just lose whatever has been scrolled too far.

Yes, there may be another console, but we should by no means depend on
that being the case. We should in fact assume it's not.

> > > > > Also, as Andi told me, the printk here would probably not show up
> > > > > anyway if this happens with netconsole.
> > > > 
> > > > That's fine. But in fact, it does show up occassionally - I've seen
> > > > it.
> > > 
> > > Then maybe what Andi told me is not true ;-)
> > > 
> > > Oh, and did your machine crash when you saw it?  Have you seen it with
> > > the e1000 driver?
> > 
> > No and no. Most of my own testing is done with tg3.
> > 
> 
> If you saw the message and the system didn't crash, then that's proof
> that if the driver is not working properly, you would have lock up the
> system, and the system was _not_ in a state that it _had_ to get the
> message out.

Let me be more precise. I've seen it in the middle of an oops dump,
where it complained, then made further progress, and then died. In
other words, the code works. And I've since upped the pool size.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Matt Mackall
On Fri, Aug 05, 2005 at 11:51:18PM +0200, Andi Kleen wrote:
> > > If that was the policy it would be a quite dumb one and make netpoll
> > > totally unsuitable for production use. I hope it is not.
> > 
> > Suggest you rip __GFP_NOFAIL out of JBD before complaining about this.
> 
> So you're suggesting we should become as bad at handling networking
> errors as we are at handling IO errors?  

No, I'm suggesting that the machine will hang forever sometimes and no
amount of patching up netpoll or JBD, etc. will fix that. A hardware
watchdog is a requirement for robust failover anyway and if you think
otherwise, you're dreaming.

And for reference, both are examples of theoretical
should-never-happen memory allocation failures.
 
> > > Dave, can you please apply the timeout patch anyways?
> > 
> > Yes, let's go right over the maintainer's objections almost
> > immediately after he chimes into the thread. I'll spare the list the
> > colorful language this inspires.
> 
> Sure when the maintainer has a unreasonable position on something
> I think that's justified. Yours in this case is clearly unreasonable.

What's clear is that you didn't like my position from my very first
post in this thread and immediately went for the nuclear option
without even trying to discuss it.

Are you even aware that the patch we're discussing here is for a problem
that has yet to be observed and that Steven's initial analysis had
missed a couple things?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Steven Rostedt
On Fri, 2005-08-05 at 23:26 +0200, Andi Kleen wrote:

> I suspect Steven's patch for the e1000 is needed in addition to
> handle different cases too.
> 

I haven't tested it. Someone with a e1000 must see if it works. I
submitted the e100 fix that had the same problem, but I would feel
better if the patch I sent for the e1000 actually got tested.

To test, one would setup a box with the e1000 and netconsole. Run with
something doing several printks (possible using sysrq-t or such), and
then unplug the cable (without Andi's patch) and replug it back in. If
the patch worked, the system would hang while the cable was detached,
but come back shortly after the cable was plugged back in.

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Steven Rostedt
On Fri, 2005-08-05 at 14:28 -0700, Matt Mackall wrote:
> 
> Netpoll generally must assume it won't get a second chance, as it's
> being called by things like oops() and panic() and used by things like
> kgdb. If netpoll fails, the box is dead anyway.
> 

But it is also being called by every printk in the kernel. What happens
when the printk that causes this lock up is not a panic but just some
info print.  One would use netconsole when they turn on more verbose
printing, to keep the output fast, right?  So if the system gets a
little memory tight, but not to the point of failing, this will cause a
lock up and no one would know why. 

If you need to really get the data out, then the design should be
changed.  Have some return value showing the failure, check for
oops_in_progress or whatever, and try again after turning interrupts
back on, and getting to a point where the system can free up memory
(write to swap, etc).  Just a busy loop without ever getting a skb is
just bad.

> > > The netpoll philosophy is to assume that its traffic is an absolute
> > > priority - it is better to potentially hang trying to deliver a panic
> > > message than to give up and crash silently.
> > 
> > So even a long timeout would not do?  So you don't even get a message to
> > the console?
> 
> In general, there's no way to measure time here. And if we're
> using netconsole, what makes you think there's any other console?

Why assume that there isn't another console?  The screen may be used
with netconsole, you just lose whatever has been scrolled too far.

> 
> > > > Also, as Andi told me, the printk here would probably not show up
> > > > anyway if this happens with netconsole.
> > > 
> > > That's fine. But in fact, it does show up occassionally - I've seen
> > > it.
> > 
> > Then maybe what Andi told me is not true ;-)
> > 
> > Oh, and did your machine crash when you saw it?  Have you seen it with
> > the e1000 driver?
> 
> No and no. Most of my own testing is done with tg3.
> 

If you saw the message and the system didn't crash, then that's proof
that if the driver is not working properly, you would have lock up the
system, and the system was _not_ in a state that it _had_ to get the
message out.

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Andi Kleen
> > If that was the policy it would be a quite dumb one and make netpoll
> > totally unsuitable for production use. I hope it is not.
> 
> Suggest you rip __GFP_NOFAIL out of JBD before complaining about this.

So you're suggesting we should become as bad at handling networking
errors as we are at handling IO errors?  

> > Dave, can you please apply the timeout patch anyways?
> 
> Yes, let's go right over the maintainer's objections almost
> immediately after he chimes into the thread. I'll spare the list the
> colorful language this inspires.

Sure when the maintainer has a unreasonable position on something
I think that's justified. Yours in this case is clearly unreasonable.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Matt Mackall
On Fri, Aug 05, 2005 at 11:26:10PM +0200, Andi Kleen wrote:
> On Fri, Aug 05, 2005 at 01:01:57PM -0700, Matt Mackall wrote:
> > The netpoll philosophy is to assume that its traffic is an absolute
> > priority - it is better to potentially hang trying to deliver a panic
> > message than to give up and crash silently.
> 
> That would be ok if netpoll was only used to deliver panics. But 
> it is not. It delivers all messages, and you cannot hang the 
> kernel during that. Actually even for panics it is wrong, because often
> it is more important to reboot in a panic than (with a panic timeout) 
> to actually deliver the panic. That's needed e.g. in a failover cluster.
> 
> If that was the policy it would be a quite dumb one and make netpoll
> totally unsuitable for production use. I hope it is not.

Suggest you rip __GFP_NOFAIL out of JBD before complaining about this.

> Dave, can you please apply the timeout patch anyways?

Yes, let's go right over the maintainer's objections almost
immediately after he chimes into the thread. I'll spare the list the
colorful language this inspires.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Andi Kleen
On Fri, Aug 05, 2005 at 01:01:57PM -0700, Matt Mackall wrote:
> The netpoll philosophy is to assume that its traffic is an absolute
> priority - it is better to potentially hang trying to deliver a panic
> message than to give up and crash silently.

That would be ok if netpoll was only used to deliver panics. But 
it is not. It delivers all messages, and you cannot hang the 
kernel during that. Actually even for panics it is wrong, because often
it is more important to reboot in a panic than (with a panic timeout) 
to actually deliver the panic. That's needed e.g. in a failover cluster.

If that was the policy it would be a quite dumb one and make netpoll
totally unsuitable for production use. I hope it is not.


> 
> > Also, as Andi told me, the printk here would probably not show up
> > anyway if this happens with netconsole.
> 
> That's fine. But in fact, it does show up occassionally - I've seen
> it.
> 
> NAK'ed.

Too bad. This would mean that all serious non toy users of netpoll
would have to carry this patch on their own. But that wouldn't be good.

Dave, can you please apply the timeout patch anyways?

I suspect Steven's patch for the e1000 is needed in addition to
handle different cases too.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Matt Mackall
On Fri, Aug 05, 2005 at 04:57:00PM -0400, Steven Rostedt wrote:
> On Fri, 2005-08-05 at 13:01 -0700, Matt Mackall wrote:
> > On Fri, Aug 05, 2005 at 10:36:31AM -0400, Steven Rostedt wrote:
> > > Looking at the netpoll routines, I noticed that the find_skb could
> > > lockup if the memory is low. This is because the allocations are
> > > called with GFP_ATOMIC (since this is in interrupt context) and if
> > > it fails, it will continue to fail. This is just by observing the
> > > code, I didn't have this actually happen. So if this is not the
> > > case, please let me know how it can get out. Otherwise, please
> > > accept this patch.
> > 
> > By netpoll_poll() tickling the driver enough to free the currently
> > queued outgoing SKBs.
> 
> I believe that the e1000 wont free up any outgoing packets since the
> netpoll call doesn't seem to get to the e1000_clean_tx part of the
> e1000_intr, otherwise the system wouldn't lock under the
> netpoll_send_skb when one disconnects the wire and puts it back in.  The
> disconnect would lock it up anyway (with Andi's patch it now doesn't)
> but since it won't come back up after the link is back up, there seems
> to be something wrong with the e1000 netpoll driver.  This is because
> the e1000_netpoll doesn't seem to be cleaning up the tx buffer and start
> the queue back up.

That does seem like a driver problem.

> > Also note that by the time we're in this loop, we're ready to take
> > desperate measures. We've already exhausted our private queue of SKBs
> > so we have no alternative but to keep kicking the driver until
> > something happens.
> 
> OK, the system is under heavy memory load and starts eating up the
> netpoll packets.  When the last packet is gone, and you have something
> like the e1000 that doesn't clean up its packets with netpoll, then you
> just locked up the system.
> 
> The scary part of this loop is that if the netpoll doesn't come up with
> the goods, its game over.  Say we are at desperate measures but it could
> be a case where we need to output more information and lockup here
> before we can go out and free some memory. 

Realistically, we were probably going to crash anyway at this point as
we're apparently failing to recycle SKBs.

Netpoll generally must assume it won't get a second chance, as it's
being called by things like oops() and panic() and used by things like
kgdb. If netpoll fails, the box is dead anyway.

> > The netpoll philosophy is to assume that its traffic is an absolute
> > priority - it is better to potentially hang trying to deliver a panic
> > message than to give up and crash silently.
> 
> So even a long timeout would not do?  So you don't even get a message to
> the console?

In general, there's no way to measure time here. And if we're
using netconsole, what makes you think there's any other console?

> > > Also, as Andi told me, the printk here would probably not show up
> > > anyway if this happens with netconsole.
> > 
> > That's fine. But in fact, it does show up occassionally - I've seen
> > it.
> 
> Then maybe what Andi told me is not true ;-)
> 
> Oh, and did your machine crash when you saw it?  Have you seen it with
> the e1000 driver?

No and no. Most of my own testing is done with tg3.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Steven Rostedt
On Fri, 2005-08-05 at 13:01 -0700, Matt Mackall wrote:
> On Fri, Aug 05, 2005 at 10:36:31AM -0400, Steven Rostedt wrote:
> > Looking at the netpoll routines, I noticed that the find_skb could
> > lockup if the memory is low. This is because the allocations are
> > called with GFP_ATOMIC (since this is in interrupt context) and if
> > it fails, it will continue to fail. This is just by observing the
> > code, I didn't have this actually happen. So if this is not the
> > case, please let me know how it can get out. Otherwise, please
> > accept this patch.
> 
> By netpoll_poll() tickling the driver enough to free the currently
> queued outgoing SKBs.

I believe that the e1000 wont free up any outgoing packets since the
netpoll call doesn't seem to get to the e1000_clean_tx part of the
e1000_intr, otherwise the system wouldn't lock under the
netpoll_send_skb when one disconnects the wire and puts it back in.  The
disconnect would lock it up anyway (with Andi's patch it now doesn't)
but since it won't come back up after the link is back up, there seems
to be something wrong with the e1000 netpoll driver.  This is because
the e1000_netpoll doesn't seem to be cleaning up the tx buffer and start
the queue back up.

> 
> Also note that by the time we're in this loop, we're ready to take
> desperate measures. We've already exhausted our private queue of SKBs
> so we have no alternative but to keep kicking the driver until
> something happens.

OK, the system is under heavy memory load and starts eating up the
netpoll packets.  When the last packet is gone, and you have something
like the e1000 that doesn't clean up its packets with netpoll, then you
just locked up the system.

The scary part of this loop is that if the netpoll doesn't come up with
the goods, its game over.  Say we are at desperate measures but it could
be a case where we need to output more information and lockup here
before we can go out and free some memory. 

> 
> The netpoll philosophy is to assume that its traffic is an absolute
> priority - it is better to potentially hang trying to deliver a panic
> message than to give up and crash silently.

So even a long timeout would not do?  So you don't even get a message to
the console?

> 
> > Also, as Andi told me, the printk here would probably not show up
> > anyway if this happens with netconsole.
> 
> That's fine. But in fact, it does show up occassionally - I've seen
> it.

Then maybe what Andi told me is not true ;-)

Oh, and did your machine crash when you saw it?  Have you seen it with
the e1000 driver?

> 
> NAK'ed.
> 
(ouch!)

OK, since my argument is currently only theory, and I don't have a e1000
card to test this on, I'll take out my fix to the e100 (where it cleaned
up it's tx drivers in netpoll) and see if I can get the machine to
lockup here just by putting it under extreme memory loads.

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Matt Mackall
On Fri, Aug 05, 2005 at 10:36:31AM -0400, Steven Rostedt wrote:
> Looking at the netpoll routines, I noticed that the find_skb could
> lockup if the memory is low. This is because the allocations are
> called with GFP_ATOMIC (since this is in interrupt context) and if
> it fails, it will continue to fail. This is just by observing the
> code, I didn't have this actually happen. So if this is not the
> case, please let me know how it can get out. Otherwise, please
> accept this patch.

By netpoll_poll() tickling the driver enough to free the currently
queued outgoing SKBs.

Also note that by the time we're in this loop, we're ready to take
desperate measures. We've already exhausted our private queue of SKBs
so we have no alternative but to keep kicking the driver until
something happens.

The netpoll philosophy is to assume that its traffic is an absolute
priority - it is better to potentially hang trying to deliver a panic
message than to give up and crash silently.

> Also, as Andi told me, the printk here would probably not show up
> anyway if this happens with netconsole.

That's fine. But in fact, it does show up occassionally - I've seen
it.

NAK'ed.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] netpoll can lock up on low memory.

2005-08-05 Thread Steven Rostedt
Looking at the netpoll routines, I noticed that the find_skb could
lockup if the memory is low.  This is because the allocations are called
with GFP_ATOMIC (since this is in interrupt context) and if it fails, it
will continue to fail. This is just by observing the code, I didn't have
this actually happen. So if this is not the case, please let me know how
it can get out. Otherwise, please accept this patch.  Also, as Andi told
me, the printk here would probably not show up anyway if this happens
with netconsole.

Here I changed it to break out instead of just looping.

-- Steve


Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

--- linux-2.6.13-rc3/net/core/netpoll.c.orig2005-08-05 09:37:00.0 
-0400
+++ linux-2.6.13-rc3/net/core/netpoll.c 2005-08-05 10:29:32.0 -0400
@@ -229,8 +229,9 @@ repeat:
}
 
if(!skb) {
-   count++;
-   if (once && (count == 100)) {
+   if (count++ == 10)
+   return NULL;
+   if (once)
printk("out of netpoll skbs!\n");
once = 0;
}


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html