Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-22 Thread Grant Grundler
[ sorry, for the record I'm not trimming Val's comments - mine are much below]

On Wed, Jun 21, 2006 at 05:43:40PM -0700, Valerie Henson wrote:
...
> Hi folks,
> 
> The quick summary of my thoughts on this patch is that it isn't the
> ideal patch, but it works and it's well-tested.  Doing my preferred
> fix (adding a shutdown flag) would be invasive and take many weeks to
> reach the level of stability of Grant's patch.  So I'm taking this
> patch but adding a comment describing my preferred fix.
> 
> Here's the long version.  The obvious ordering of bringing up the card
> is:
> 
> request_irq()
> setup DMA resources
> enable interrupts
> start DMA engine

The problem here is request_irq() enables interrupts.
SO the third step happens as part of the first step.

And if an IRQ happens to be pending, it will blow up.
But since we reset the chip at init time, that should
never happen.

> 
> And the obvious ordering of shutting it down is:
> 
> stop DMA engine
> disable interrupts
> remove DMA resources
> free_irq()

This attempts to be symetrical with the init sequence and 
I like that approach where possible. It's not symetrical
in practice because the interrupt code restarts DMA.
(which is what you've noted below)

> The problem with the above shutdown order is that we can receive an
> interrupt in between stopping DMA and disabling interrupts, and the
> tulip irq handler will reenable DMA =><= Boom!  The solution I prefer
> is to make the irq handler aware of whether we are shutting down or
> not, and not reenable DMA in that case.

While I believe this will work too, I don't advocate this approach
because I don't like to add code to interrupt handlers unless
it's the _only_ way to fix a problem. 

>   However, it is a non-trivial
> patch to get right, and I would rather have the bug fixed short-term
> with the ordering Grant uses:
> 
> disable interrupts
> free_irq()
> stop rxtx
> remove DMA resources
> 
> Make sense to everyone?  I'll redo the patch with the comment and get
> Grant's sign-off.

Yes. I'm ok with anything similar to what you posted above:

Signed-off-by: Grant Grundler <[EMAIL PROTECTED]>


And Val, thanks for accepting the patch and taking the time to
explain where you want to go next with the code in this case.
I'll be happy to setup remote access to my test environment when
you are ready to test that next step.

thanks,
grant

> 
> -VAL
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-21 Thread Valerie Henson
On Tue, Jun 13, 2006 at 05:55:31PM -0600, Grant Grundler wrote:
> On Thu, Jun 08, 2006 at 11:01:20AM -0600, Grant Grundler wrote:
> > Here is a new patch that moves free_irq() into tulip_down().
> > The resulting code is structured the same as cp_close().
> 
> Val,
> Two details are wrong in version 2 and are fixed in v3 (appended below):
> 
> o we don't need synchronize_irq() before calling free_irq().
>   (It should be removed from cp_close() too)
>   Thanks to willy for pointing me at kernel/irq/manage.c.
> 
> o tulip_stop_rxtx() has to be called _after_ free_irq().
>   ie. v2 patch didn't fix the original race condition
>   and when under test, dies about as fast as the original code.
> 
> Tested on rx4640 (HP IA64) for several hours.
> Please apply.

Hi folks,

The quick summary of my thoughts on this patch is that it isn't the
ideal patch, but it works and it's well-tested.  Doing my preferred
fix (adding a shutdown flag) would be invasive and take many weeks to
reach the level of stability of Grant's patch.  So I'm taking this
patch but adding a comment describing my preferred fix.

Here's the long version.  The obvious ordering of bringing up the card
is:

request_irq()
setup DMA resources
enable interrupts
start DMA engine

And the obvious ordering of shutting it down is:

stop DMA engine
disable interrupts
remove DMA resources
free_irq()

The problem with the above shutdown order is that we can receive an
interrupt in between stopping DMA and disabling interrupts, and the
tulip irq handler will reenable DMA =><= Boom!  The solution I prefer
is to make the irq handler aware of whether we are shutting down or
not, and not reenable DMA in that case.  However, it is a non-trivial
patch to get right, and I would rather have the bug fixed short-term
with the ordering Grant uses:

disable interrupts
free_irq()
stop rxtx
remove DMA resources

Make sense to everyone?  I'll redo the patch with the comment and get
Grant's sign-off.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-16 Thread Jeff Garzik

Grant Grundler wrote:

[ Jeff, apologies. I hit "reply" instead of "group reply" on previous email.
  I've added everyone back on the cc list.]

On Fri, Jun 16, 2006 at 11:30:32AM -0400, Jeff Garzik wrote:
...

Are you saying this sequence won't mask interrupts on tulip?

   /* Disable interrupts by clearing the interrupt mask. */
   iowrite32 (0x, ioaddr + CSR7);
   ioread32 (ioaddr + CSR7);   /* flush posted write */

It does not stop the generation of interrupt events.


This use of "interrupt events" is misleading.
The CPU does not sees these "interrupt events" once we mask interrupts.


The DMA engine is still running, packets are still being received.
The above code sequence does not change that.


I agree. And I'm asking why does anyone care?
We clean that up after IRQs are stopped from being delivered to the CPU.

...

Secondly, since you have ignored the two previous times I've asked,
I'll presume you agree that if firmware leaves it in this state
(pending, masked interrupts), that the driver has to (and does)
handle it.

There is no firmware involved here, at any level, after boot.


I agree.  What about at boot time?


We reset the chip each time we do an interface-up.


The needed task in the driver has been the same since this thread 
started:  (1) stop generating new work [stop DMA engine], and (2) 
quiesce the hardware.  And it must happen in that order.


No it doesn't. I've proven it works in the order I've proposed
on pretty damn anal HW.

Setting the interrupt mask register to zero doesn't stop new work from 
appearing.


I agree. It stops the "screaming interrupt" problem. The fact that we
are in "close" or "down" routine means the user already decided
they don't care if new packets do or do not arrive.

Unless you can point to a real user who is affected by
my proposed patch, I ask again patch v3 be accepted.


All users are affected.  There is still a race window due to calling 
free_irq() before stopping the DMA engine, you've simply minimized it.


I don't see why it's so difficult to see

(a) you are introducing a non-standard ordering, different from other 
net drivers
(b) you are introducing an ordering that is counter to how the hardware 
is designed
(c) if you follow the natural ordering, you are GUARANTEED not to have 
screaming interrupts, DMA still going across the PCI bus, and dozens of 
other details.


Rather than running through all of these details, and tweaking your 
patch for each of them (as you have done with V1->V2 and V2->V3), simply 
_guarantee_ that you will not have problems.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-16 Thread Grant Grundler

[ Jeff, apologies. I hit "reply" instead of "group reply" on previous email.
  I've added everyone back on the cc list.]

On Fri, Jun 16, 2006 at 11:30:32AM -0400, Jeff Garzik wrote:
...
> >Are you saying this sequence won't mask interrupts on tulip?
> >
> >/* Disable interrupts by clearing the interrupt mask. */
> >iowrite32 (0x, ioaddr + CSR7);
> >ioread32 (ioaddr + CSR7);   /* flush posted write */
> 
> It does not stop the generation of interrupt events.

This use of "interrupt events" is misleading.
The CPU does not sees these "interrupt events" once we mask interrupts.

> The DMA engine is still running, packets are still being received.
> The above code sequence does not change that.

I agree. And I'm asking why does anyone care?
We clean that up after IRQs are stopped from being delivered to the CPU.

...
> >Secondly, since you have ignored the two previous times I've asked,
> >I'll presume you agree that if firmware leaves it in this state
> >(pending, masked interrupts), that the driver has to (and does)
> >handle it.
> 
> There is no firmware involved here, at any level, after boot.

I agree.  What about at boot time?

> The needed task in the driver has been the same since this thread 
> started:  (1) stop generating new work [stop DMA engine], and (2) 
> quiesce the hardware.  And it must happen in that order.

No it doesn't. I've proven it works in the order I've proposed
on pretty damn anal HW.

> Setting the interrupt mask register to zero doesn't stop new work from 
> appearing.

I agree. It stops the "screaming interrupt" problem. The fact that we
are in "close" or "down" routine means the user already decided
they don't care if new packets do or do not arrive.

Unless you can point to a real user who is affected by
my proposed patch, I ask again patch v3 be accepted.

thanks,
grant
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-16 Thread Grant Grundler
On Fri, Jun 16, 2006 at 03:32:56AM -0400, Jeff Garzik wrote:
> >Correct. Before calling free_irq(), patch V3 masks interrupts on tulip
> >and guarantees the tulip will not generate new interrupts before that call.
> 
> Incorrect -- it does not guarantee that tulip will not generate new 
> interrupt events.

Are you saying the following sequence doesn't mask tulip interrupts?

/* Disable interrupts by clearing the interrupt mask. */
iowrite32 (0x, ioaddr + CSR7);
ioread32 (ioaddr + CSR7);   /* flush posted write */

Secondly, since you've ignored the two previous times I've asked,
I'll assume you agree tulip driver has to (and does) handle
the case of pending, masked interrupts at initialization because
firmware might leave it in that state.

thanks,
grant
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-16 Thread Jeff Garzik

Grant Grundler wrote:

On Thu, Jun 15, 2006 at 10:30:17PM +0200, Francois Romieu wrote:

Afaik free_irq() on a shared irq does not touch the hardware and
irqs are anything but synchronous. If free_irq() is issued before
the device is idle and its irq are not acked, it's wrong.


Correct. Before calling free_irq(), patch V3 masks interrupts on tulip
and guarantees the tulip will not generate new interrupts before that call.


Incorrect -- it does not guarantee that tulip will not generate new 
interrupt events.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-15 Thread Grant Grundler
On Thu, Jun 15, 2006 at 10:30:17PM +0200, Francois Romieu wrote:
> Afaik free_irq() on a shared irq does not touch the hardware and
> irqs are anything but synchronous. If free_irq() is issued before
> the device is idle and its irq are not acked, it's wrong.

Correct. Before calling free_irq(), patch V3 masks interrupts on tulip
and guarantees the tulip will not generate new interrupts before that call.

grant
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-15 Thread Francois Romieu
Grant Grundler <[EMAIL PROTECTED]> :
[shared irq]
> I thought we've worked through that already:
>   http://www.spinics.net/lists/netdev/msg05902.html
> 
> Patch v3 takes care of that problem.
> The first step in the sequence is to mask IRQs on the tulip.
> The "neighbor" device sharing the IRQ will not see any interrupts from
> the tulip after that.

Afaik free_irq() on a shared irq does not touch the hardware and
irqs are anything but synchronous. If free_irq() is issued before
the device is idle and its irq are not acked, it's wrong.

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-14 Thread Grant Grundler
On Wed, Jun 14, 2006 at 10:47:20PM +0200, Francois Romieu wrote:
> Grant Grundler <[EMAIL PROTECTED]> :
> [...]
> > I'm not keen on adding more code to tulip_interrupt() routine
> > for something that rarely happens (compared to IRQs) and is handled
> > outside the interrupt routine.  I'm pretty sure stopping interrupts
> > before stopping DMA is sufficient.
> > Can you show an example where it doesn't work?
> 
> Shared irq. 
> 
> The device has not quiesced, the kernel stop listening to it and the
> neighbor device receives a late interruption from the network device.

I thought we've worked through that already:
http://www.spinics.net/lists/netdev/msg05902.html

Patch v3 takes care of that problem.
The first step in the sequence is to mask IRQs on the tulip.
The "neighbor" device sharing the IRQ will not see any interrupts from
the tulip after that.

thanks,
grant
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-14 Thread Grant Grundler
On Wed, Jun 14, 2006 at 03:51:37PM -0400, Jeff Garzik wrote:
> You need to turn off the thing that generates work (DMA engine), before 
> turning off the thing that reaps work (irq handler).
...
> It should be completely obvious that the chip is still generating 
> work...

Yes, I agree it still generates work. ie we can still RX packets.
But those will get discarded anyway.
In other words, If work is generated and I don't know it and
don't care, was it really work? :)

>   You don't want to leave the hardware in a position where it has 
> unacknowledged events.

Ok. Let me repeat two questions I asked a while ago:
| Are you worried about a masked, pending interrupt causing
| problems when the driver is re-opened or resumed?

The answer to "Yes" it seems.
And we will disagree on that since I've proven it's not a problem.
And it can't be a problem anyone else has seen since it's been
this way for 5+ years.

| If firmware left the device in that state at boot time wouldn't
| the driver be required to handle it?

Can you comment on this please?

thanks,
grant
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-14 Thread Francois Romieu
Grant Grundler <[EMAIL PROTECTED]> :
[...]
> I'm not keen on adding more code to tulip_interrupt() routine
> for something that rarely happens (compared to IRQs) and is handled
> outside the interrupt routine.  I'm pretty sure stopping interrupts
> before stopping DMA is sufficient.
> Can you show an example where it doesn't work?

Shared irq. 

The device has not quiesced, the kernel stop listening to it and the
neighbor device receives a late interruption from the network device.

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-14 Thread Jeff Garzik

Grant Grundler wrote:

On Wed, Jun 14, 2006 at 11:03:48AM -0400, Jeff Garzik wrote:

Grant Grundler wrote:

Switching the order to be:
   tulip_stop_rxtx(tp);/* Stop DMA */
   free_irq (dev->irq, dev);   /* no more races after this */

still leaves us open to IRQs being delivered _after_ we've stopped DMA.

Correct.  And that is the preferred, natural, logical, obvious order:

1) Turn things off.
2) Wait for activity to cease.


Patch v3 does this in two stages:
1) turn off tulip interrupts
2) free_irq() calls syncronize_irq() to handle pending IRQs

then calls tulip_stop_rxtx() which:
1) tells tulip to stop DMA
2) poll until DMA completes

After this we can free remaining resources.


You need to turn off the thing that generates work (DMA engine), before 
turning off the thing that reaps work (irq handler).




That in turn allows the interrupt handler to re-enable DMA again.
Then that would be a problem to solve...  Some interrupt handlers will 
test netif_running() or a driver-specific shutting-down flag, 
specifically to avoid such behaviors.


I'm not keen on adding more code to tulip_interrupt() routine
for something that rarely happens (compared to IRQs) and is handled
outside the interrupt routine.  I'm pretty sure stopping interrupts
before stopping DMA is sufficient.
Can you show an example where it doesn't work?


It should be completely obvious that the chip is still generating 
work...  You don't want to leave the hardware in a position where it has 
unacknowledged events.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-14 Thread Grant Grundler
On Wed, Jun 14, 2006 at 11:03:48AM -0400, Jeff Garzik wrote:
> Grant Grundler wrote:
> >Switching the order to be:
> >tulip_stop_rxtx(tp);/* Stop DMA */
> >free_irq (dev->irq, dev);   /* no more races after this */
> >
> >still leaves us open to IRQs being delivered _after_ we've stopped DMA.
> 
> Correct.  And that is the preferred, natural, logical, obvious order:
> 
> 1) Turn things off.
> 2) Wait for activity to cease.

Patch v3 does this in two stages:
1) turn off tulip interrupts
2) free_irq() calls syncronize_irq() to handle pending IRQs

then calls tulip_stop_rxtx() which:
1) tells tulip to stop DMA
2) poll until DMA completes

After this we can free remaining resources.

> >That in turn allows the interrupt handler to re-enable DMA again.
> 
> Then that would be a problem to solve...  Some interrupt handlers will 
> test netif_running() or a driver-specific shutting-down flag, 
> specifically to avoid such behaviors.

I'm not keen on adding more code to tulip_interrupt() routine
for something that rarely happens (compared to IRQs) and is handled
outside the interrupt routine.  I'm pretty sure stopping interrupts
before stopping DMA is sufficient.
Can you show an example where it doesn't work?

This is important since I'm going to propose a new Documentation/pci.txt
based on this experience.

thanks,
grant
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-14 Thread Jeff Garzik

Grant Grundler wrote:

On Tue, Jun 13, 2006 at 08:33:22PM -0400, Jeff Garzik wrote:

Grant Grundler wrote:

o tulip_stop_rxtx() has to be called _after_ free_irq().
 ie. v2 patch didn't fix the original race condition
 and when under test, dies about as fast as the original code.
You made the race window smaller, but it's still there.  The chip's DMA 
engines should be stopped before you unregister the interrupt handler.


Switching the order to be:
tulip_stop_rxtx(tp);/* Stop DMA */
free_irq (dev->irq, dev);   /* no more races after this */

still leaves us open to IRQs being delivered _after_ we've stopped DMA.


Correct.  And that is the preferred, natural, logical, obvious order:

1) Turn things off.
2) Wait for activity to cease.



That in turn allows the interrupt handler to re-enable DMA again.


Then that would be a problem to solve...  Some interrupt handlers will 
test netif_running() or a driver-specific shutting-down flag, 
specifically to avoid such behaviors.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-14 Thread Grant Grundler
On Wed, Jun 14, 2006 at 09:05:06AM -0400, Kyle McMartin wrote:
> I think the correct sequence would be:
> 
>   reset tulip interrupt mask
>   flush posted write
> 
>   synchronize irq /* make sure we got 'em all */

>   tulip_stop_rxtx /* turn off dma */
>   free irq/* bye bye */
> 
> The synchronize irq guarantees we shouldn't see another irq
> generated by the card because it was held up somewhere.

Kyle,
syncronize_irq() only guarantees currently executing interrupt handler
completes before handing control back to the caller.
It does not guarantee IRQ signals still inflight are "flushed".
Remember that IRQ lines are a "sideband" signal and not subject
to PCI data ordering rules.

thanks,
grant
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-14 Thread Kyle McMartin
On Tue, Jun 13, 2006 at 10:44:12PM -0600, Grant Grundler wrote:
> On Tue, Jun 13, 2006 at 08:33:22PM -0400, Jeff Garzik wrote:
> > Grant Grundler wrote:
> > >o tulip_stop_rxtx() has to be called _after_ free_irq().
> > >  ie. v2 patch didn't fix the original race condition
> > >  and when under test, dies about as fast as the original code.
> > 
> > You made the race window smaller, but it's still there.  The chip's DMA 
> > engines should be stopped before you unregister the interrupt handler.
> 
> Switching the order to be:
> tulip_stop_rxtx(tp);/* Stop DMA */
> free_irq (dev->irq, dev);   /* no more races after this */
> 

I think the correct sequence would be:

reset tulip interrupt mask
flush posted write

synchronize irq /* make sure we got 'em all */
tulip_stop_rxtx /* turn off dma */
free irq/* bye bye */

The synchronize irq guarantees we shouldn't see another irq
generated by the card because it was held up somewhere.

Cheers,
Kyle M.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-13 Thread Grant Grundler
On Tue, Jun 13, 2006 at 08:33:22PM -0400, Jeff Garzik wrote:
> Grant Grundler wrote:
> >o tulip_stop_rxtx() has to be called _after_ free_irq().
> >  ie. v2 patch didn't fix the original race condition
> >  and when under test, dies about as fast as the original code.
> 
> You made the race window smaller, but it's still there.  The chip's DMA 
> engines should be stopped before you unregister the interrupt handler.

Switching the order to be:
tulip_stop_rxtx(tp);/* Stop DMA */
free_irq (dev->irq, dev);   /* no more races after this */

still leaves us open to IRQs being delivered _after_ we've stopped DMA.
That in turn allows the interrupt handler to re-enable DMA again.

Or are you worried about a masked, pending interrupt causing
problems when the driver is re-opened or resumed?
If firmware left the device in a that state at boot time wouldn't
the driver be required to handle it?

thanks,
grant



> 
>   Jeff
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-13 Thread Jeff Garzik

Grant Grundler wrote:

o tulip_stop_rxtx() has to be called _after_ free_irq().
  ie. v2 patch didn't fix the original race condition
  and when under test, dies about as fast as the original code.


You made the race window smaller, but it's still there.  The chip's DMA 
engines should be stopped before you unregister the interrupt handler.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-13 Thread Valerie Henson
On Tue, Jun 13, 2006 at 05:55:31PM -0600, Grant Grundler wrote:
> On Thu, Jun 08, 2006 at 11:01:20AM -0600, Grant Grundler wrote:
> > Here is a new patch that moves free_irq() into tulip_down().
> > The resulting code is structured the same as cp_close().
> 
> Val,
> Two details are wrong in version 2 and are fixed in v3 (appended below):
> 
> o we don't need synchronize_irq() before calling free_irq().
>   (It should be removed from cp_close() too)
>   Thanks to willy for pointing me at kernel/irq/manage.c.
> 
> o tulip_stop_rxtx() has to be called _after_ free_irq().
>   ie. v2 patch didn't fix the original race condition
>   and when under test, dies about as fast as the original code.
> 
> Tested on rx4640 (HP IA64) for several hours.
> Please apply.

Thanks, I'll take a look next week (after the workshop and moving).

-VAL
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCHv3 2.6.17-rc5 tulip free_irq() called too late

2006-06-13 Thread Grant Grundler
On Thu, Jun 08, 2006 at 11:01:20AM -0600, Grant Grundler wrote:
> Here is a new patch that moves free_irq() into tulip_down().
> The resulting code is structured the same as cp_close().

Val,
Two details are wrong in version 2 and are fixed in v3 (appended below):

o we don't need synchronize_irq() before calling free_irq().
  (It should be removed from cp_close() too)
  Thanks to willy for pointing me at kernel/irq/manage.c.

o tulip_stop_rxtx() has to be called _after_ free_irq().
  ie. v2 patch didn't fix the original race condition
  and when under test, dies about as fast as the original code.

Tested on rx4640 (HP IA64) for several hours.
Please apply.

thanks,
grant

Change Log:
IRQs are racing with tulip_down(). DMA can be restarted _after_
we call tulip_stop_rxtx() and the DMA buffers are unmapped.
The result is an MCA (hard crash on ia64) because of an
IO TLB miss.

Signed-off-by: Grant Grundler <[EMAIL PROTECTED]>

--- a/drivers/net/tulip/tulip_core.c
+++ b/drivers/net/tulip/tulip_core.c
@@ -18,11 +18,11 @@
 
 #define DRV_NAME   "tulip"
 #ifdef CONFIG_TULIP_NAPI
-#define DRV_VERSION"1.1.13-NAPI" /* Keep at least for test */
+#define DRV_VERSION"1.1.14-NAPI" /* Keep at least for test */
 #else
-#define DRV_VERSION"1.1.13"
+#define DRV_VERSION"1.1.14"
 #endif
-#define DRV_RELDATE"December 15, 2004"
+#define DRV_RELDATE"May 6, 2006"
 
 
 #include 
@@ -741,21 +741,20 @@ static void tulip_down (struct net_devic
 
/* Disable interrupts by clearing the interrupt mask. */
iowrite32 (0x, ioaddr + CSR7);
+   ioread32 (ioaddr + CSR7);   /* flush posted write */
 
-   /* Stop the Tx and Rx processes. */
-   tulip_stop_rxtx(tp);
+   spin_unlock_irqrestore (&tp->lock, flags);
 
-   /* prepare receive buffers */
-   tulip_refill_rx(dev);
+   free_irq (dev->irq, dev);   /* no more races after this */
+   tulip_stop_rxtx(tp);/* Stop DMA */
 
-   /* release any unconsumed transmit buffers */
-   tulip_clean_tx_ring(tp);
+   /* Put driver back into the state we start with */
+   tulip_refill_rx(dev);   /* prepare RX buffers */
+   tulip_clean_tx_ring(tp);/* clean up unsent TX buffers */
 
if (ioread32 (ioaddr + CSR6) != 0x)
tp->stats.rx_missed_errors += ioread32 (ioaddr + CSR8) & 0x;
 
-   spin_unlock_irqrestore (&tp->lock, flags);
-
init_timer(&tp->timer);
tp->timer.data = (unsigned long)dev;
tp->timer.function = tulip_tbl[tp->chip_id].media_timer;
@@ -774,7 +773,6 @@ static int tulip_close (struct net_devic
printk (KERN_DEBUG "%s: Shutting down ethercard, status was 
%2.2x.\n",
dev->name, ioread32 (ioaddr + CSR5));
 
-   free_irq (dev->irq, dev);
 
/* Free all the skbuffs in the Rx queue. */
for (i = 0; i < RX_RING_SIZE; i++) {
@@ -1752,7 +1752,6 @@ static int tulip_suspend (struct pci_dev
tulip_down(dev);
 
netif_device_detach(dev);
-   free_irq(dev->irq, dev);
 
pci_save_state(pdev);
pci_disable_device(pdev);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html