Re: [PATCH] synchronize_irq needs a barrier
.../... This patch (mostly written by Linus) fixes this by using spin locks instead of memory barries on the synchronize_irq() path. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Good for me. Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Hrm... not on yet. Herbert, care to resend, looks like it fell down the wrong hole in Linus mailbox :-) Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Sat, 2007-10-20 at 08:06 +0200, Maxim Levitsky wrote: /* Disable interrupts, DMA, and rest of the chip*/ saa_writel(SAA7134_IRQ1, 0); saa_writel(SAA7134_IRQ2, 0); saa_writel(SAA7134_MAIN_CTRL, 0); dev-insuspend = 1; synchronize_irq(pci_dev-irq); /* ACK pending interrupts just in case*/ saa_writel(SAA7134_IRQ_REPORT,saa_readl(SAA7134_IRQ_REPORT)); .. This should be bullet-proof. Hopefully :-) Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thursday 18 October 2007 03:25:42 Benjamin Herrenschmidt wrote: synchronize_irq needs at the very least a compiler barrier and a read barrier on SMP, but there are enough cases around where a write barrier is also needed and it's not a hot path so I prefer using a full smp_mb() here. It will degrade to a compiler barrier on !SMP. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Index: linux-work/kernel/irq/manage.c === --- linux-work.orig/kernel/irq/manage.c 2007-10-18 11:22:16.0 +1000 +++ linux-work/kernel/irq/manage.c2007-10-18 11:22:20.0 +1000 @@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq) if (irq = NR_IRQS) return; + smp_mb(); while (desc-status IRQ_INPROGRESS) cpu_relax(); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Hi, I have read this thread and I concluded few things: 1) It is impossible to know that the card won't send more interrupts: Even if I do a read from the device, the IRQ can be pending in the bus/APIC It is even possible (and likely) that the IRQ line will be shared, thus the handler can be called by non-relevant device. 2) the synchronize_irq(); in .suspend is useless: an IRQ can happen immediately after this synchronize_irq(); and interrupt even the .suspend() (At least theoretically) Thus I now understand that .suspend() should do: saa_writel(SAA7134_IRQ1, 0); saa_writel(SAA7134_IRQ2, 0); saa_writel(SAA7134_MAIN_CTRL, 0); dev-insuspend = 1; smp_wmb(); /* at that point the _request to disable card's IRQs was issued, we don't know that there will be no irqs anymore. the smp_mb(); guaranties that the IRQ handler will bail out in that case. */ /* ...*/ pci_save_state(pci_dev); pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state)); return 0; and the interrupt handler: smp_rmb(); if (dev-insuspend) goto out; Am I right? Best regards, Maxim Levitsky ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Saturday 20 October 2007 04:25:34 Linus Torvalds wrote: On Sat, 20 Oct 2007, Maxim Levitsky wrote: and the interrupt handler: smp_rmb(); if (dev-insuspend) goto out; Something like that can work, yes. However, you need to make sure that: - even when you ignore the interrupt (because the driver doesn't care, it's suspending), you need to make sure the hardware gets shut up by reading (or writing) the proper interrupt status register. I agree, but while device is powered off, its registers can't be accessed Thus, if I ack the IRQ every time the handler is called, I will access the powered off device (this is probably won't hurt a lot, but a bit incorrectly) Otherwise, with a level interrupt, the interrupt will continue to be held active (screaming) and the CPU will get interrupted over and over again, until the irq subsystem will just turn the irq off entirely. - when you resume, make sure that you get the engine going again, with the understanding that some interrupts may have gotten ignored. Here it isn't a problem, this is a video capture card, and I suspend it by just stopping dma on all active buffers even if in the middle of capture, and I send those buffers to card again to fill them from the beginning during the resume. Also, in general, these kinds of things shouldn't always even be neicessary. If you use the suspend_late()/resume_early() hooks, those will be called with interrupts off, and while they can be harder to debug (and may be worth avoiding for non-critical drivers), they do allow for simpler models partly exactly because they don't need to worry about interrupts etc. Exactly, I am aware of suspend_late , but I don't want to abuse it. Linus Best regards, Maxim Levitsky ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Sat, Oct 20, 2007 at 02:02:42AM +, Maxim Levitsky wrote: Thus I now understand that .suspend() should do: saa_writel(SAA7134_IRQ1, 0); saa_writel(SAA7134_IRQ2, 0); saa_writel(SAA7134_MAIN_CTRL, 0); dev-insuspend = 1; smp_wmb(); If we patch synchronize_irq() as discussed here then you can use it in place of smp_wmb() and remove the smp_rmb from the interrupt handler (the latter is the path that matters). Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Sat, 20 Oct 2007, Maxim Levitsky wrote: and the interrupt handler: smp_rmb(); if (dev-insuspend) goto out; Something like that can work, yes. However, you need to make sure that: - even when you ignore the interrupt (because the driver doesn't care, it's suspending), you need to make sure the hardware gets shut up by reading (or writing) the proper interrupt status register. Otherwise, with a level interrupt, the interrupt will continue to be held active (screaming) and the CPU will get interrupted over and over again, until the irq subsystem will just turn the irq off entirely. - when you resume, make sure that you get the engine going again, with the understanding that some interrupts may have gotten ignored. Also, in general, these kinds of things shouldn't always even be neicessary. If you use the suspend_late()/resume_early() hooks, those will be called with interrupts off, and while they can be harder to debug (and may be worth avoiding for non-critical drivers), they do allow for simpler models partly exactly because they don't need to worry about interrupts etc. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
- even when you ignore the interrupt (because the driver doesn't care, it's suspending), you need to make sure the hardware gets shut up by reading (or writing) the proper interrupt status register. Otherwise, with a level interrupt, the interrupt will continue to be held active (screaming) and the CPU will get interrupted over and over again, until the irq subsystem will just turn the irq off entirely. His suspend routine wrote to the IRQ mask (or equivalent) register in his code example, thus the HW should shut up eventually, thus that isn't strictly necessary, the IRQ in that case is just a short interrupt (noticed by the PIC and delivered but possibly not asserted anymore at this stage or about to go down). In fact, he -must not- ack it. Because is the HW is really down (in D3), got knows what accessing the ACK register will do. I can give you ideas... from nothing on most x86 desktops to machine checks on most powerpc machines, though I could imagine some cards bad enough to lock your bus up if you try to access a register while they are in D3 (I've seen some of those critters and it seems not all bridges timeout on cards that just keep sending retries). Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
- even when you ignore the interrupt (because the driver doesn't care, it's suspending), you need to make sure the hardware gets shut up by reading (or writing) the proper interrupt status register. I agree, but while device is powered off, its registers can't be accessed Thus, if I ack the IRQ every time the handler is called, I will access the powered off device (this is probably won't hurt a lot, but a bit incorrectly) It will actually crash your machine on some platforms. So no, best is to -not- ack. The masking is enough, the IRQ will go down eventually. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Fri, 2007-10-19 at 19:25 -0700, Linus Torvalds wrote: On Sat, 20 Oct 2007, Maxim Levitsky wrote: and the interrupt handler: smp_rmb(); if (dev-insuspend) goto out; Something like that can work, yes. However, you need to make sure that: - even when you ignore the interrupt (because the driver doesn't care, it's suspending), you need to make sure the hardware gets shut up by reading (or writing) the proper interrupt status register. Otherwise, with a level interrupt, the interrupt will continue to be held active (screaming) and the CPU will get interrupted over and over again, until the irq subsystem will just turn the irq off entirely. His suspend routine wrote to the IRQ mask (or equivalent) register in his code example, thus the HW should shut up eventually, thus that isn't strictly necessary, the IRQ in that case is just a short interrupt (noticed by the PIC and delivered but possibly not asserted anymore at this stage or about to go down). We have many scenario generating such short interrupts (pretty much any time we use interrupt disable/enable on the chip, for example it's common with NAPI). This is one of the reasons why we used to have a timebomb causing an IRQ to erroneously get disabled by the IRQ core assuming the IRQ was stuck, just because we accumulated too many short interrupts on that line. A recent patch by Alan fixed that to use some more sensible algorithm checking the time since the last spurrious interrupt, though I suspect he's being too optimistic on his timeout value and some scenario, such as NAPI with just the wrong packet rate, can still trigger the bug. I need to find a piece of HW slow enough in de-asserting the IRQ to try to make a repro-case one of these days. - when you resume, make sure that you get the engine going again, with the understanding that some interrupts may have gotten ignored. And you forgot that he still needs synchronize_irq(), or his IRQ handler might well be in the middle of doing something on another CPU, past the point where it tested insuspend, which will do no good when a simultaneous pci_set_power_state() puts the chip into D3. On some machines, this will machine check. Either that or he needs a spinlock in his IRQ handler that he also takes around the code that sets insuspend. Also, in general, these kinds of things shouldn't always even be neicessary. If you use the suspend_late()/resume_early() hooks, those will be called with interrupts off, and while they can be harder to debug (and may be worth avoiding for non-critical drivers), they do allow for simpler models partly exactly because they don't need to worry about interrupts etc. Yes, this is a easier way to deal with devices that are on a directly mapped bus (path to them isn't gone by the time you hit suspend_late) and don't need to do fancy things involing waiting for a queue to drain using interrupts etc... (at one stage of HW complexity, having to re-implement polled versions of everything is just not worth the benefit). In general, suspend_late should cover the need of most PCI devices I suppose. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
I have read this thread and I concluded few things: 1) It is impossible to know that the card won't send more interrupts: Even if I do a read from the device, the IRQ can be pending in the bus/APIC It is even possible (and likely) that the IRQ line will be shared, thus the handler can be called by non-relevant device. 2) the synchronize_irq(); in .suspend is useless: an IRQ can happen immediately after this synchronize_irq(); and interrupt even the .suspend() (At least theoretically) It's not totally useless not. It guarantees that by the time your are out of your suspend(), a simultaneous instance of the IRQ handler is either finished, or it (or any subsequent occurence) have seen your insuspend flag set to 1. Thus I now understand that .suspend() should do: saa_writel(SAA7134_IRQ1, 0); saa_writel(SAA7134_IRQ2, 0); saa_writel(SAA7134_MAIN_CTRL, 0); dev-insuspend = 1; smp_wmb(); /* at that point the _request to disable card's IRQs was issued, we don't know that there will be no irqs anymore. the smp_mb(); guaranties that the IRQ handler will bail out in that case. */ /* ...*/ pci_save_state(pci_dev); pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state)); return 0; The above doesn't handle the case where the IRQ handle just passed the if (insuspend) test. You may end up calling pci_set_power_state() while in the middle of some further HW accesses in the handler. You still need synchronize_irq() for that reason. Or use a spinlock. and the interrupt handler: smp_rmb(); if (dev-insuspend) goto out; Am I right? Not quite :-) Also not that the work we are doing with synchronize_irq, if it gets merged, renders it unnecessary for you to add those two memory barriers, synchronize_irq will pretty much do the ordering for you. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Saturday 20 October 2007 05:56:01 Benjamin Herrenschmidt wrote: I have read this thread and I concluded few things: 1) It is impossible to know that the card won't send more interrupts: Even if I do a read from the device, the IRQ can be pending in the bus/APIC It is even possible (and likely) that the IRQ line will be shared, thus the handler can be called by non-relevant device. 2) the synchronize_irq(); in .suspend is useless: an IRQ can happen immediately after this synchronize_irq(); and interrupt even the .suspend() (At least theoretically) It's not totally useless not. It guarantees that by the time your are out of your suspend(), a simultaneous instance of the IRQ handler is either finished, or it (or any subsequent occurence) have seen your insuspend flag set to 1. Thus I now understand that .suspend() should do: saa_writel(SAA7134_IRQ1, 0); saa_writel(SAA7134_IRQ2, 0); saa_writel(SAA7134_MAIN_CTRL, 0); dev-insuspend = 1; smp_wmb(); /* at that point the _request to disable card's IRQs was issued, we don't know that there will be no irqs anymore. the smp_mb(); guaranties that the IRQ handler will bail out in that case. */ /* ...*/ pci_save_state(pci_dev); pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state)); return 0; The above doesn't handle the case where the IRQ handle just passed the if (insuspend) test. You may end up calling pci_set_power_state() while in the middle of some further HW accesses in the handler. You still need synchronize_irq() for that reason. Or use a spinlock. and the interrupt handler: smp_rmb(); if (dev-insuspend) goto out; Am I right? Not quite :-) Also not that the work we are doing with synchronize_irq, if it gets merged, renders it unnecessary for you to add those two memory barriers, synchronize_irq will pretty much do the ordering for you. Cheers, Ben. Fine, I Agree, and this is why I used it in first place, I just forgot. To wait for already running IRQ handler, that tested the dev-insuspend. (And I assumed that interrupts are disabled on the cpu that runs .suspend, but I know now that this isn't true.) Besides that can I ask few more .suspend/resume questions: 1) some drivers use pci_disable_device(), and pci_enable_device(). should I use it too? 2) I accidentally did this: pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state)); pci_save_state(pci_dev); I somehow thought that this is correct, that I should save the pci config state after the power-down, but now I know that it isn't correct. I now need to send a patch for dmfe.c network driver that has the same commands written by me. (but it works perfectly anyway) Is it possible to access pci configuration space in D3? And lastly speaking of network drivers, one issue came to my mind: most network drivars has a packet queue and in case of dmfe it is located in main memory, and card does dma from it. in .suspend I ignore that some packets may be in that queue, and I want to ask, whenever there are better ways to do that. this is my dmfe .suspend routine. /* Disable upper layer interface */ netif_device_detach(dev); /* Disable Tx/Rx */ db-cr6_data = ~(CR6_RXSC | CR6_TXSC); update_cr6(db-cr6_data, dev-base_addr); /* Disable Interrupt */ outl(0, dev-base_addr + DCR7); outl(inl (dev-base_addr + DCR5), dev-base_addr + DCR5); /* Fre RX buffers */ dmfe_free_rxbuffer(db); /* Enable WOL */ pci_read_config_dword(pci_dev, 0x40, tmp); tmp = ~(DMFE_WOL_LINKCHANGE|DMFE_WOL_MAGICPACKET); if (db-wol_mode WAKE_PHY) tmp |= DMFE_WOL_LINKCHANGE; if (db-wol_mode WAKE_MAGIC) tmp |= DMFE_WOL_MAGICPACKET; pci_write_config_dword(pci_dev, 0x40, tmp); pci_enable_wake(pci_dev, PCI_D3hot, 1); pci_enable_wake(pci_dev, PCI_D3cold, 1); /* Power down device*/ pci_set_power_state(pci_dev, pci_choose_state (pci_dev,state)); pci_save_state(pci_dev); I guess, everybody makes mistakes... :-) Other network drivers has a bit more complicated .suspend/.resume routines, but I didn't see a driver waiting for output queue to finish Best regards, Maxim Levitsky ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
1) some drivers use pci_disable_device(), and pci_enable_device(). should I use it too? I generally don't do the former, and I would expect the late to be done by pci_restore_state() for you. pci_disable_device(), last I looked, only cleared the bus master bit though, which might be a good idea to do. People in ACPI/x86 land, are there more good reasons to do one or the other ? That reminds me that I volunteered to write a documentation on how drivers should do all that stuff at KS and didn't get to actually doing it yet. shame ... I'll try to start something asap. 2) I accidentally did this: pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state)); pci_save_state(pci_dev); I somehow thought that this is correct, that I should save the pci config state after the power-down, but now I know that it isn't correct. Right, you need to do the save_state before the power down. You need to avoid pretty much any access to the device after the save state other than the pending set_power_state on resume. I now need to send a patch for dmfe.c network driver that has the same commands written by me. (but it works perfectly anyway) On x86 desktop... might have surprises on others. Is it possible to access pci configuration space in D3? It's only really safe to access the PM register itself, though I suppose you should be able to walk the capability chain to do that. But I wouldnt recommend doing anything else. And lastly speaking of network drivers, one issue came to my mind: most network drivars has a packet queue and in case of dmfe it is located in main memory, and card does dma from it. Note that the network stack nowadays does a fair bit of cleaning up for you before your suspend routine is called in .suspend I ignore that some packets may be in that queue, and I want to ask, whenever there are better ways to do that. this is my dmfe .suspend routine. /* Disable upper layer interface */ netif_device_detach(dev); The above -might- not be needed any more, I have to check. I used to do it too on my drivers. /* Disable Tx/Rx */ db-cr6_data = ~(CR6_RXSC | CR6_TXSC); update_cr6(db-cr6_data, dev-base_addr); /* Disable Interrupt */ outl(0, dev-base_addr + DCR7); outl(inl (dev-base_addr + DCR5), dev-base_addr + DCR5); /* Fre RX buffers */ dmfe_free_rxbuffer(db); /* Enable WOL */ pci_read_config_dword(pci_dev, 0x40, tmp); tmp = ~(DMFE_WOL_LINKCHANGE|DMFE_WOL_MAGICPACKET); if (db-wol_mode WAKE_PHY) tmp |= DMFE_WOL_LINKCHANGE; if (db-wol_mode WAKE_MAGIC) tmp |= DMFE_WOL_MAGICPACKET; pci_write_config_dword(pci_dev, 0x40, tmp); pci_enable_wake(pci_dev, PCI_D3hot, 1); pci_enable_wake(pci_dev, PCI_D3cold, 1); /* Power down device*/ pci_set_power_state(pci_dev, pci_choose_state (pci_dev,state)); pci_save_state(pci_dev); Looks allright on a quick glance appart from the bits we already discussed. I guess, everybody makes mistakes... :-) Other network drivers has a bit more complicated .suspend/.resume routines, but I didn't see a driver waiting for output queue to finish I think the network stack does that nowadays but we'll have to double check, that's based on what DaveM told me at KS. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
I probably need to add this synchronize_irq() logic in dmfe.c too, but I probably do it later, I think I am overestimating this race, since most drivers don't do dev-insuspend checks in IRQ handler. Maybe even just use free_irq() after all Most drivers are probably underestimating the race :-) free_irq() would work provided that you did the masking on chip before (and unmask only after request_irq on the way back in). But it's a bit like using a 10 tons truck to crush an ant... Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Saturday 20 October 2007 07:04:35 Benjamin Herrenschmidt wrote: 1) some drivers use pci_disable_device(), and pci_enable_device(). should I use it too? I generally don't do the former, and I would expect the late to be done by pci_restore_state() for you. pci_disable_device(), last I looked, only cleared the bus master bit though, which might be a good idea to do. People in ACPI/x86 land, are there more good reasons to do one or the other ? That reminds me that I volunteered to write a documentation on how drivers should do all that stuff at KS and didn't get to actually doing it yet. shame ... I'll try to start something asap. 2) I accidentally did this: pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state)); pci_save_state(pci_dev); I somehow thought that this is correct, that I should save the pci config state after the power-down, but now I know that it isn't correct. Right, you need to do the save_state before the power down. You need to avoid pretty much any access to the device after the save state other than the pending set_power_state on resume. I now need to send a patch for dmfe.c network driver that has the same commands written by me. (but it works perfectly anyway) On x86 desktop... might have surprises on others. Is it possible to access pci configuration space in D3? It's only really safe to access the PM register itself, though I suppose you should be able to walk the capability chain to do that. But I wouldnt recommend doing anything else. And lastly speaking of network drivers, one issue came to my mind: most network drivars has a packet queue and in case of dmfe it is located in main memory, and card does dma from it. Note that the network stack nowadays does a fair bit of cleaning up for you before your suspend routine is called in .suspend I ignore that some packets may be in that queue, and I want to ask, whenever there are better ways to do that. this is my dmfe .suspend routine. /* Disable upper layer interface */ netif_device_detach(dev); Looks allright on a quick glance appart from the bits we already discussed. I guess, everybody makes mistakes... :-) Other network drivers has a bit more complicated .suspend/.resume routines, but I didn't see a driver waiting for output queue to finish I think the network stack does that nowadays but we'll have to double check, that's based on what DaveM told me at KS. Ben. Hi, Thanks a lot. I fix the order of calls in dmfe.c and in saa7134-core.c. I probably need to add this synchronize_irq() logic in dmfe.c too, but I probably do it later, I think I am overestimating this race, since most drivers don't do dev-insuspend checks in IRQ handler. Maybe even just use free_irq() after all Best regards, Maxim Levitsky ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: Note that some kind of read barrier or compiler barrier should be needed regardless, or we are just not sync'ing with anything at all (we may have loaded the value ages ago and thus operate on a totally stale value). I prefer a full barrier to also ensure all previous stores are pushed out. We already have a compiler barrier there in the form of cpu_relax. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: Take a real life example: drivers/message/fusion/mptbase.c /* Disable interrupts! */ CHIPREG_WRITE32(ioc-chip-IntMask, 0x); ioc-active = 0; synchronize_irq(pdev-irq); And we aren't in a spinlock here. That's just a random example grepped I think I see a few more. Then, some drivers like tg3 actually do an smp_mb() before calling synchronize_irq(). But then, some don't. I really don't see what the point of the barrier would be here. Barriers are generally useless unless you have a counter-part on the other side. The counterpart here is presumably the interrupt handler, which should be terminated by the IO write above regardless of the memory barrier. Of course I might have missed your point. If so please give a description like this for the race that you see: CPU1CPU2 disable IRQ whatever the race is synchronize_irq I think trying to have all drivers be correct here is asking for trouble, we'd rather have synchronize_irq() be uber-safe. It's not like it was used in hot path anyway. While in general I'd agree with you about give latitude to drivers, memory barriers I think is something that we can't afford to. The reason is that memory barries tend to come in pairs, e.g., CPU1CPU2 write A wmb write B read B rmb read A Taking away either barrier would render the other useless. So whenever we add only one barrier for the benefit of driver writers who don't bother to think about barriers we may not be helping them at all. In any case, such writers should use easier tools like spin locks rather than trying to go lockless. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thu, 2007-10-18 at 22:35 +0800, Herbert Xu wrote: Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: Note that some kind of read barrier or compiler barrier should be needed regardless, or we are just not sync'ing with anything at all (we may have loaded the value ages ago and thus operate on a totally stale value). I prefer a full barrier to also ensure all previous stores are pushed out. We already have a compiler barrier there in the form of cpu_relax. Isn't it too late ? The barrier should be before the test_bit, to prevent it from moving up. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thu, 2007-10-18 at 22:56 +0800, Herbert Xu wrote: Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: Take a real life example: drivers/message/fusion/mptbase.c /* Disable interrupts! */ CHIPREG_WRITE32(ioc-chip-IntMask, 0x); ioc-active = 0; synchronize_irq(pdev-irq); And we aren't in a spinlock here. That's just a random example grepped I think I see a few more. Then, some drivers like tg3 actually do an smp_mb() before calling synchronize_irq(). But then, some don't. I really don't see what the point of the barrier would be here. Barriers are generally useless unless you have a counter-part on the other side. The barrier would guarantee that ioc-active (and in fact the write to the chip too above) are globally visible before the read of the irq status. There are two different issues not to mixup here. Any previous store vs. a read (general case) and MMIO store of IRQ mask which has issues of it's own. The counterpart here is presumably the interrupt handler, which should be terminated by the IO write above regardless of the memory barrier. Of course I might have missed your point. If so please give a description like this for the race that you see: CPU1 CPU2 disable IRQ whatever the race is synchronize_irq Let's ignore for a moment the generic problem of loads crossing previous stores and focus on the pure issue of using MMIO writes to mask interrupts. Writing to a chip to disable an IRQ is generally never going to provide any synchronisation guarantee. The MMIO write itself is asynchronous and not ordered vs. a subsequent read from main storage (ie memory). So unless you do an MMIO read to flush it, you basically haven't yet disabled your IRQ don't know when it will be. That's one thing. Another one is that even if you do an MMIO read to flush, your IRQ may already have been latched by your PIC, or may be an MSI already issued and still travelling along busses, and thus might well occur in a few instructions. In that later case, note that ignoring it is perfectly fine since the IRQ line will eventually go down (for a level IRQ that is, for an edge IRQ, ignoring is always fine). That's what we cause short interrupts, they can be common, though linux has historical issues with them (unrelated to this discussion). But your interrupt handler _will_ be called, and thus should be aware that your intend is to ignore it. So for both of the reasons above, the MMIO write doesn't mean you IRQ won't happen and in fact, synchronize_irq() here is totally useless since it won't prevent the IRQ from actually still happening just after the test_bit of IRQ_INPROGRESS. Now, the way to actually do that properly is to _also_ have a flag to indicate your handler you don't want to deal with that interrupt. That could be something along the lines of: writel(irq mask); wmb(); chip-masked = 1; smp_mb(); synchronize_irq(); Now, the IRQ handler can just do if (chip-masked == 1) return whatever; Another way is to use spinlocks of course, but we are talking about high perf stuff that tries to avoid spinlocks on every IRQs, which is fair enough, thus putting the synchronization burden on the slow path. In the above example, you need that smp_mb() to make sure that the effect of chip-masked = 1 is globally visible before the read in synchronize_irq() is performed. Without that, that read could cross the previous write, in fact, it may even travel all the way to before the MMIO in some cases, and thus cause synchronize_irq() to operate on a completely stale version of irq_desc-status, thus possibly bailing out early while the IRQ is still happening and chip-masked not yet visible to the other CPUs. In general, synchronize_irq() alone is always bogus, because that read can travel up. That's why I believe it would simply make things simpler and avoid problems to put the smp_mb(); inside synchronize_irq() (and same goes with napi_synchronize() for the exact same reasons). While in general I'd agree with you about give latitude to drivers, memory barriers I think is something that we can't afford to. The reason is that memory barries tend to come in pairs, e.g., CPU1 CPU2 write A wmb write B read B rmb read A Taking away either barrier would render the other useless. So whenever we add only one barrier for the benefit of driver writers who don't bother to think about barriers we may not be helping them at all. In the case of synchronize_irq and my above example, yes, indeed, there should be a pending barrier between setting IRQ_INPROGRESS and the reading of chip-masked by the driver. This is a very good point because I incorrectly assumed that the spin_unlock inside handle_*_irq before calling the handler would do it, but
Re: [PATCH] synchronize_irq needs a barrier
On Fri, 2007-10-19 at 12:48 +0800, Herbert Xu wrote: [IRQ]: Fix synchronize_irq races with IRQ handler As it is some callers of synchronize_irq rely on memory barriers to provide synchronisation against the IRQ handlers. For example, the tg3 driver does tp-irq_sync = 1; smp_mb(); synchronize_irq(); and then in the IRQ handler: if (!tp-irq_sync) netif_rx_schedule(dev, tp-napi); Unfortunately memory barriers only work well when they come in pairs. Because we don't actually have memory barriers on the IRQ path, the memory barrier before the synchronize_irq() doesn't actually protect us. In particular, synchronize_irq() may return followed by the result of netif_rx_schedule being made visible. This patch (mostly written by Linus) fixes this by using spin locks instead of memory barries on the synchronize_irq() path. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Good for me. Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
What may happen is that action can either float upwards to give spin_lock action set IRQ_INPROGRESS spin_unlock spin_lock clear IRQ_INPROGRESS spin_unlock or it can float downwards to give spin_lock set IRQ_INPROGRESS spin_unlock spin_lock clear IRQ_INPROGRESS action spin_unlock Well, we are generally safer here. That is, unless action is a store, it will not pass spin_lock, at least not on powerpc afaik. In fact, if action, as it is in our case, is something like if (foo) return; We cant execute the store inside spin_lock() without having loaded foo, there is an implicit dependency here. But anyway, Linus patch fixes that too if it was a problem. Now if we grep for while (test_bit and while (!test_bit I'm sure we'll find other similar surprises. That's also one of the reasons why I _love_ nick patches that add a proper lock/unlock API on bits, because at least those who are doing those hand-made pseudo-locks with bitops to save space will be getting a proper lock/unlock API, no more excuse. The network stack is doing more fancy things so it's harder (though I wonder sometimes if it's really worth the risks taken for not using spinlocks... maybe). Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Fri, 2007-10-19 at 12:20 +0800, Herbert Xu wrote: That's why I think this patch is in fact the only one that solves all the races in this thread. The case that it solves which the lock/unlock patch does not is the one where action flows downwards past the clearing of IRQ_INPROGRESS. I missed this case earlier. In fact this bug exists elsewhere too. For example, the network stack does this in net/sched/sch_generic.c: /* Wait for outstanding qdisc_run calls. */ while (test_bit(__LINK_STATE_QDISC_RUNNING, dev-state)) yield(); This has the same problem as the current synchronize_irq code. The network stack always made be nervous with it's bitops use as locks. Any loop of test_bit() alone as a synchronisation method, without locks or barriers is, imho, inherently buggy. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
The whole lock/set IRQ_INPROGRESS/unlock path can then only happen before the locked section above, in which case we see and wait nicely and all is good, or after, in which case the store to foo will be visible to the IRQ handler as it will be ordered with the unlock in the code above. Note that napi_synchronize needs a slightly different treatement. Here, the situation boils down to: one CPU does: foo = 1; while(test_bit(bar)) barrier(); and the other: if (!test_and_set_bit(bar)) { read use foo smp_mb__before_clear_bit(); clear_bit(bar); } The good thing here is that read use foo is part of the critical section (I hate hand-made locks ...) defined by bar which makes things somewhat easier than the synchronize_irq() case. I think a simple smp_mb(); here after foo = 1; is enough, which means basically just having an smp_mp(); inside napi_synchronize(), before the test_bit(). Or do I miss something ? Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thu, Oct 18, 2007 at 08:26:45PM -0700, Linus Torvalds wrote: On Thu, 18 Oct 2007, Linus Torvalds wrote: I *think* it should work with something like for (;;) { smp_rmb(); if (!spin_is_locked(desc-lock)) { smp_rmb(); if (!(desc-status IRQ_INPROGRESS) break; } cpu_relax(); } I'm starting to doubt this. One of the issues is that we still need the smp_mb() in front of the loop (because we want to serialize the loop with any writes in the caller). Right. I think if we accept the definition of a spin lock/unlock that Nick outlined earlier, then we can see that on the IRQ path there simply isn't a memory barrier between the changing of the status field and the execution of the action: spin_lock set IRQ_INPROGRESS spin_unlock action spin_lock clear IRQ_INPROGRESS spin_unlock What may happen is that action can either float upwards to give spin_lock action set IRQ_INPROGRESS spin_unlock spin_lock clear IRQ_INPROGRESS spin_unlock or it can float downwards to give spin_lock set IRQ_INPROGRESS spin_unlock spin_lock clear IRQ_INPROGRESS action spin_unlock As a result we can add as many barriers as we want on the slow (synchronize_irq) path and it just won't do any good (not unless we add some barriers on the IRQ path == fast path). What we do have on the right though is the fact in some ways action behaves as if it's inside the spin lock even though it's not. In particular, action cannot float up past the first spin lock nor can it float down past the last spin unlock. See commit fa490cfd15d7ce0900097cc4e60cfd7a76381138 and ponder. Maybe we should take the same approach here, and do something like repeat: /* Optimistic, no-locking loop */ while (desc-status IRQ_INPROGRESS) cpu_relax(); /* Ok, that indicated we're done: double-check carefully */ spin_lock_irqsave(desc-lock, flags); status = desc-status; spin_unlock_irqrestore(desc-lock, flags); /* Oops, that failed? */ if (status IRQ_INPROGRESS) goto repeat; That's why I think this patch is in fact the only one that solves all the races in this thread. The case that it solves which the lock/unlock patch does not is the one where action flows downwards past the clearing of IRQ_INPROGRESS. I missed this case earlier. In fact this bug exists elsewhere too. For example, the network stack does this in net/sched/sch_generic.c: /* Wait for outstanding qdisc_run calls. */ while (test_bit(__LINK_STATE_QDISC_RUNNING, dev-state)) yield(); This has the same problem as the current synchronize_irq code. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
repeat: /* Optimistic, no-locking loop */ while (desc-status IRQ_INPROGRESS) cpu_relax(); /* Ok, that indicated we're done: double-check carefully */ spin_lock_irqsave(desc-lock, flags); status = desc-status; spin_unlock_irqrestore(desc-lock, flags); /* Oops, that failed? */ if (status IRQ_INPROGRESS) goto repeat; Hmm? Paulus and I convinced ourselves that this would work. If we call our variable that gets set before synchronize_irq and read in the IRQ handler foo, we get to: - writing foo can travel down at most to just before the unlock in the code above - reading foo can travel up out of the IRQ handler at most just after the lock in the code that sets IRQ_INPROGRESS. The whole lock/set IRQ_INPROGRESS/unlock path can then only happen before the locked section above, in which case we see and wait nicely and all is good, or after, in which case the store to foo will be visible to the IRQ handler as it will be ordered with the unlock in the code above. Pfiew ! So Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Thanks ! Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
Nick Piggin [EMAIL PROTECTED] wrote: First of all let's agree on some basic assumptions: * A pair of spin lock/unlock subsumes the effect of a full mb. Not unless you mean a pair of spin lock/unlock as in 2 spin lock/unlock pairs (4 operations). *X = 10; spin_lock(lock); /* *Y speculatively loaded here */ /* store to *X leaves CPU store queue here */ spin_unlock(lock); y = *Y; Good point. Although in this case we're still safe because in the worst cases: CPU0CPU1 irq_sync = 1 synchronize_irq spin lock load IRQ_INPROGRESS irq_sync sync is visible spin unlock spin lock load irq_sync while (IRQ_INPROGRESS) wait return set IRQ_INPROGRESS spin unlock tg3_msi ack IRQ if (irq_sync) return spin lock clear IRQ_INPROGRESS spin unlock CPU0CPU1 spin lock load irq_sync irq_sync = 1 synchronize_irq set IRQ_INPROGRESS spin unlock spin lock load IRQ_INPROGRESS irq_sync sync is visible spin unlock while (IRQ_INPROGRESS) wait tg3_msi ack IRQ if (irq_sync) return do work spin lock clear IRQ_INPROGRESS spin unlock return So because we're using the same lock on both sides, it does do the right thing even without the memory barrier. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thu, 18 Oct 2007, Linus Torvalds wrote: I *think* it should work with something like for (;;) { smp_rmb(); if (!spin_is_locked(desc-lock)) { smp_rmb(); if (!(desc-status IRQ_INPROGRESS) break; } cpu_relax(); } I'm starting to doubt this. One of the issues is that we still need the smp_mb() in front of the loop (because we want to serialize the loop with any writes in the caller). The other issue is that I don't think it's enough that we saw the descriptor lock unlocked, and then the IRQ_INPROGRESS bit clear. It might have been unlocked *while* the IRQ was in progress, but the interrupt handler is now in its last throes, and re-takes the spinlock and clears the IRQ_INPROGRESS thing. But we're not actually happy until we've seen the IRQ_INPROGRESS bit clear and the spinlock has been released *again*. So those two tests should actually be the other way around: we want to see the IRQ_INPROGRESS bit clear first. It's all just too damn subtle and clever. Something like this should not need to be that subtle. Maybe the rigth thing to do is to not rely on *any* ordering what-so-ever, and just make the rule be: if you look at the IRQ_INPROGRESS bit, you'd better hold the descriptor spinlock, and not have any subtle ordering issues at all. But that makes us have a loop with getting/releasing the lock all the time, and then we get back to horrid issues with cacheline bouncing and unfairness of cache accesses across cores (ie look at the issues we had with the runqueue starvation in wait_task_inactive()). Those were fixed by starting out with the non-locked and totally unsafe versions, but then having one last check with lock held, and repeat only if that says things went south. See commit fa490cfd15d7ce0900097cc4e60cfd7a76381138 and ponder. Maybe we should take the same approach here, and do something like repeat: /* Optimistic, no-locking loop */ while (desc-status IRQ_INPROGRESS) cpu_relax(); /* Ok, that indicated we're done: double-check carefully */ spin_lock_irqsave(desc-lock, flags); status = desc-status; spin_unlock_irqrestore(desc-lock, flags); /* Oops, that failed? */ if (status IRQ_INPROGRESS) goto repeat; Hmm? Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Fri, 19 Oct 2007, Herbert Xu wrote: In other words I think this patch is great :) Hey, I appreciate it, but I do think that the spinlock only to immediately unlock it again is pretty horrid. I'm convinced that there should be some way to do this without actually taking the lock. I *think* it should work with something like for (;;) { smp_rmb(); if (!spin_is_locked(desc-lock)) { smp_rmb(); if (!(desc-status IRQ_INPROGRESS) break; } cpu_relax(); } instead. Which basically requires that we see the descriptor lock being not held, before we see the IRQ_INPROGRESS bit being clear. Put another way: it loops until it sees the lock having been released, and the IRQ_INPROGRESS bit being clear after that. The above requires no serializing instructions on x86, which is a good goal (now that smp_rmb() is just a compiler barrier). And it doesn't actually have to bounce any cachelines. And it doesn't have that ugly get lock only to release it, which just makes me go Eww!. But it's a bit subtler. It basically depends on the fact that spin_unlock() obviously has to make sure that there is a release barrier in the unlock, so any writes done (to the IRQ_INPROGRESS bit) within the locked region *must* be visible before the spinlock itself has been released. So somebody should: - use another pair of eyes and brains to back me up on this. - write up some coherent changelog entry, using the emails that have passed back and forth. - actually turn the above into a tested patch with a comment. And I'm pushing for that somebody being somebody else than me ;) Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thu, Oct 18, 2007 at 04:39:59PM -0700, Linus Torvalds wrote: Basic notion: the only thing that serializes the IRQ_INPROGRESS flag is the descriptor lock. And we don't have to (or even want to!) hold it while waiting for the thing, but we want to *have*held*it* in between whatever we're synchronizing with. Thinking about this again and checking code I have come to the conclusion that 1) There is a race (in some drivers) even with the full mb. 2) Linus's patch fixes it. 3) It's difficult to fix it elegantly in the driver. In other words I think this patch is great :) First of all let's agree on some basic assumptions: * A pair of spin lock/unlock subsumes the effect of a full mb. * A spin lock in general only equates to (SS/SL/LL). * A spin unlock in general only equates to (SS/LS). In particular, a load after a spin unlock may pass before the spin unlock. Here is the race (with tg3 as the only example that I know of). The driver attempts to quiesce interrupts such that after the call to synchronize_irq it is intended that no further IRQ handler calls for that device will do any work besides acking the IRQ. Here is how it does it: CPU0CPU1 spin lock load irq_sync irq_sync = 1 smp_mb synchronize_irq() while (IRQ_INPROGRESS) wait return set IRQ_INPROGRESS spin unlock tg3_msi ack IRQ if (irq_sync) return do work The problem here is that load of irq_sync in the handler has passed above the setting of IRQ_INPROGRESS. Linus's patch fixes it because this becomes: CPU0CPU1 spin lock load irq_sync irq_sync = 1 smp_mb synchronize_irq set IRQ_INPROGRESS spin unlock spin lock spin unlock tg3_msi ack IRQ if (irq_sync) return do work while (IRQ_INPROGRESS) wait spin lock clear IRQ_INPROGRESS spin unlock return Even though we still do the work on the right we will now notice the INPROGRESS flag on the left and wait. It's hard to fix this in the drivers because they'd either have to access the desc lock or add a full mb to the fast path on the right. Once this goes in we can also remove the smp_mb from tg3.c. BTW, a lot of drivers (including the fusion example Ben quoted) call synchronize_irq before free_irq. This is unnecessary because the latter already calls it anyway. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
So how about something like this (untested! not necessarily very well thought through either!) Basic notion: the only thing that serializes the IRQ_INPROGRESS flag is the descriptor lock. And we don't have to (or even want to!) hold it while waiting for the thing, but we want to *have*held*it* in between whatever we're synchronizing with. The internal irq handler functions already held the lock when they did whatever they need to serialize - and they are possibly performance critical too - so they use the internal function that doesn't get the lock unnecessarily again. That may do the trick as the read can't cross the spin_lock (it can cross spin_unlock but not lock). Advantage over adding a barrier to handle_IRQ_event() is that it keeps the overhead to the slow path (synchronize_irq). Note that I didn't actually experience a problem here. I just came upon that by accident while thinking about a similar issue I have with napi_synchronize(). Looks good to me on a first glance (unfortunately a bit ugly but heh). Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Fri, 19 Oct 2007, Benjamin Herrenschmidt wrote: I agree and you can see that in fact, we don't have enough barrier on the other side since spin_unlock doesn't prevent subsequent loads from crossing a previous store... I wonder if that's worth trying to address, adding a barrier in handle_IRQ_event for example, or we can continue ignoring the barrier and let some drivers do their own fixes in fancy ways. So how about something like this (untested! not necessarily very well thought through either!) Basic notion: the only thing that serializes the IRQ_INPROGRESS flag is the descriptor lock. And we don't have to (or even want to!) hold it while waiting for the thing, but we want to *have*held*it* in between whatever we're synchronizing with. The internal irq handler functions already held the lock when they did whatever they need to serialize - and they are possibly performance critical too - so they use the internal function that doesn't get the lock unnecessarily again. Hmm? Linus --- kernel/irq/manage.c | 22 ++ 1 files changed, 18 insertions(+), 4 deletions(-) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 80eab7a..f3e9575 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -14,6 +14,18 @@ #include internals.h +/* + * Internally wait for IRQ_INPROGRESS to go away on other CPU's, + * after having serialized with the descriptor lock. + */ +static inline void do_synchronize_irq(struct irq_desc *desc) +{ +#ifdef CONFIG_SMP + while (desc-status IRQ_INPROGRESS) + cpu_relax(); +#endif +} + #ifdef CONFIG_SMP /** @@ -28,13 +40,15 @@ */ void synchronize_irq(unsigned int irq) { + unsigned long flags; struct irq_desc *desc = irq_desc + irq; if (irq = NR_IRQS) return; - while (desc-status IRQ_INPROGRESS) - cpu_relax(); + spin_lock_irqsave(desc-lock, flags); + spin_unlock_irqrestore(desc-lock, flags); + do_synchronize_irq(desc); } EXPORT_SYMBOL(synchronize_irq); @@ -129,7 +143,7 @@ void disable_irq(unsigned int irq) disable_irq_nosync(irq); if (desc-action) - synchronize_irq(irq); + do_synchronize_irq(desc); } EXPORT_SYMBOL(disable_irq); @@ -443,7 +457,7 @@ void free_irq(unsigned int irq, void *dev_id) unregister_handler_proc(irq, action); /* Make sure it's not being used on another CPU */ - synchronize_irq(irq); + do_synchronize_irq(desc); #ifdef CONFIG_DEBUG_SHIRQ /* * It's a shared IRQ -- the driver ought to be ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thu, 2007-10-18 at 15:52 -0700, Linus Torvalds wrote: On Fri, 19 Oct 2007, Benjamin Herrenschmidt wrote: The barrier would guarantee that ioc-active (and in fact the write to the chip too above) are globally visible No, it doesn't really guarantee that. The thing is, there is no such thing as globally visible. There is a ordering of visibility wrt CPU's, but it's not global, it's quite potentially per-CPU. So a barrier on one CPU doesn't guarantee anything at all without a barrier on the *other* CPU. That said, the interrupt handling itself contains various barriers on the CPU's that receive interrupts, thanks to the spinlocking. But I do agree with Herbert that adding a smb_mb() is certainly in no way obviously correct, because it doesn't talk about what the other side does wrt barriers and that word in memory. I agree and you can see that in fact, we don't have enough barrier on the other side since spin_unlock doesn't prevent subsequent loads from crossing a previous store... I wonder if that's worth trying to address, adding a barrier in handle_IRQ_event for example, or we can continue ignoring the barrier and let some drivers do their own fixes in fancy ways. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Fri, 19 Oct 2007, Benjamin Herrenschmidt wrote: The barrier would guarantee that ioc-active (and in fact the write to the chip too above) are globally visible No, it doesn't really guarantee that. The thing is, there is no such thing as globally visible. There is a ordering of visibility wrt CPU's, but it's not global, it's quite potentially per-CPU. So a barrier on one CPU doesn't guarantee anything at all without a barrier on the *other* CPU. That said, the interrupt handling itself contains various barriers on the CPU's that receive interrupts, thanks to the spinlocking. But I do agree with Herbert that adding a smb_mb() is certainly in no way obviously correct, because it doesn't talk about what the other side does wrt barriers and that word in memory. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Friday 19 October 2007 12:32, Herbert Xu wrote: First of all let's agree on some basic assumptions: * A pair of spin lock/unlock subsumes the effect of a full mb. Not unless you mean a pair of spin lock/unlock as in 2 spin lock/unlock pairs (4 operations). *X = 10; spin_lock(lock); /* *Y speculatively loaded here */ /* store to *X leaves CPU store queue here */ spin_unlock(lock); y = *Y; * A spin lock in general only equates to (SS/SL/LL). * A spin unlock in general only equates to (SS/LS). I don't use the sparc barriers, so they don't come naturally to me ;) I think both loads and stores can pass into the critical section by having the spin_lock pass earlier ops, or by having spin_unlock be passed by later ones. In particular, a load after a spin unlock may pass before the spin unlock. Here is the race (with tg3 as the only example that I know of). The driver attempts to quiesce interrupts such that after the call to synchronize_irq it is intended that no further IRQ handler calls for that device will do any work besides acking the IRQ. Here is how it does it: CPU0 CPU1 spin lock load irq_sync irq_sync = 1 smp_mb synchronize_irq() while (IRQ_INPROGRESS) wait return set IRQ_INPROGRESS spin unlock tg3_msi ack IRQ if (irq_sync) return do work The problem here is that load of irq_sync in the handler has passed above the setting of IRQ_INPROGRESS. Linus's patch fixes it because this becomes: CPU0 CPU1 spin lock load irq_sync irq_sync = 1 smp_mb synchronize_irq set IRQ_INPROGRESS spin unlock spin lock spin unlock tg3_msi ack IRQ if (irq_sync) return do work while (IRQ_INPROGRESS) wait spin lock clear IRQ_INPROGRESS spin unlock return Even though we still do the work on the right we will now notice the INPROGRESS flag on the left and wait. It's hard to fix this in the drivers because they'd either have to access the desc lock or add a full mb to the fast path on the right. Once this goes in we can also remove the smp_mb from tg3.c. BTW, a lot of drivers (including the fusion example Ben quoted) call synchronize_irq before free_irq. This is unnecessary because the latter already calls it anyway. Cheers, ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Friday 19 October 2007 13:28, Herbert Xu wrote: Nick Piggin [EMAIL PROTECTED] wrote: First of all let's agree on some basic assumptions: * A pair of spin lock/unlock subsumes the effect of a full mb. Not unless you mean a pair of spin lock/unlock as in 2 spin lock/unlock pairs (4 operations). *X = 10; spin_lock(lock); /* *Y speculatively loaded here */ /* store to *X leaves CPU store queue here */ spin_unlock(lock); y = *Y; Good point. Although in this case we're still safe because in the worst cases: CPU0 CPU1 irq_sync = 1 synchronize_irq spin lock load IRQ_INPROGRESS irq_sync sync is visible spin unlock spin lock load irq_sync while (IRQ_INPROGRESS) wait return set IRQ_INPROGRESS spin unlock tg3_msi ack IRQ if (irq_sync) return spin lock clear IRQ_INPROGRESS spin unlock CPU0 CPU1 spin lock load irq_sync irq_sync = 1 synchronize_irq set IRQ_INPROGRESS spin unlock spin lock load IRQ_INPROGRESS irq_sync sync is visible spin unlock while (IRQ_INPROGRESS) wait tg3_msi ack IRQ if (irq_sync) return do work spin lock clear IRQ_INPROGRESS spin unlock return So because we're using the same lock on both sides, it does do the right thing even without the memory barrier. Yeah, if you've got the lock on both sides there, then I agree it will be correct. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] synchronize_irq needs a barrier
synchronize_irq needs at the very least a compiler barrier and a read barrier on SMP, but there are enough cases around where a write barrier is also needed and it's not a hot path so I prefer using a full smp_mb() here. It will degrade to a compiler barrier on !SMP. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Index: linux-work/kernel/irq/manage.c === --- linux-work.orig/kernel/irq/manage.c 2007-10-18 11:22:16.0 +1000 +++ linux-work/kernel/irq/manage.c 2007-10-18 11:22:20.0 +1000 @@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq) if (irq = NR_IRQS) return; + smp_mb(); while (desc-status IRQ_INPROGRESS) cpu_relax(); } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thu, 18 Oct 2007 11:25:42 +1000 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: synchronize_irq needs at the very least a compiler barrier and a read barrier on SMP, Why? but there are enough cases around where a write barrier is also needed and it's not a hot path so I prefer using a full smp_mb() here. It will degrade to a compiler barrier on !SMP. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Index: linux-work/kernel/irq/manage.c === --- linux-work.orig/kernel/irq/manage.c 2007-10-18 11:22:16.0 +1000 +++ linux-work/kernel/irq/manage.c2007-10-18 11:22:20.0 +1000 @@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq) if (irq = NR_IRQS) return; + smp_mb(); while (desc-status IRQ_INPROGRESS) cpu_relax(); } Anyone reading this code is going to ask wtf is that for. It needs a comment telling them. mb() is the new lock_kernel(). Sigh. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
Index: linux-work/kernel/irq/manage.c === --- linux-work.orig/kernel/irq/manage.c 2007-10-18 11:22:16.0 +1000 +++ linux-work/kernel/irq/manage.c 2007-10-18 11:22:20.0 +1000 @@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq) if (irq = NR_IRQS) return; + smp_mb(); while (desc-status IRQ_INPROGRESS) cpu_relax(); } Anyone reading this code is going to ask wtf is that for. It needs a comment telling them. mb() is the new lock_kernel(). Sigh. Ugh ? That sounds fairly obvious to me :-) we are reading a value, that is totally unordered, nothing to do about lock kernel or whatever, if we want the above statement to make any sense in any kind of usage scenario, it needs to be ordered vs. what happens before. For example, take a construct like: device-my_hw_is_off = 1; synchronize_irq(); turn_off_hardware(); That basically makes sure the irq either sees device-my_hw_is_off being set to 1, or if an irq handler is already in progress and hasn't seen it, we wait for it to complete. (You can replace hw_is_off with anything that we want to set and make sure the IRQ handler sees it before proceeding. It could be clearing a pointer to something and make sure the irq sees it before freeing the data, etc...). I think pretty much any use of synchronize_irq() I can imagine needs such kind of ordering... or it simply doesn't synchronize anything :-) Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Thu, 18 Oct 2007, Benjamin Herrenschmidt wrote: + smp_mb(); while (desc-status IRQ_INPROGRESS) cpu_relax(); So, what exactly does it protect against? At a minimum, this needs a comment in the changelog, and probably preferably in the source code too. The thing is, synchronize_irq() can only protect against interrupts that are *already*running* on another CPU, and the caller must have made sure that no new interrupts are coming in (or at least that whatever new interrupts that come in will not pick up a certain piece of data). So I can imagine that the smb_mb() is there so that the caller - who has cleared some list or done something like that - will have any preceding writes that it did be serialized with actually checking the old state of desc-status. Fair enough - I can see that a smp_mb() is useful. But I think it needs documenting as such, and preferably with an example of how this actually happened in the first place (do you have one?) The synchronize_irq() users that are really fundamental (ie the irq datastructures themselves) all should use the irq descriptor spinlock, and should *not* be needing any of this, since they would have serialized with whoever actually set the IRQ_INPROGRESS bit in the first place. So in *that* sense, I think the memory barrier is useless, and I can't make up my mind whether it's good or bad. Which is why I'd really like to have an example scenario spelled out where it makes a difference.. Ok? Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
On Wed, 2007-10-17 at 19:12 -0700, Linus Torvalds wrote: So, what exactly does it protect against? At a minimum, this needs a comment in the changelog, and probably preferably in the source code too. I replied to Andrew, but I agree, it's worth a comment, I'll add one. The thing is, synchronize_irq() can only protect against interrupts that are *already*running* on another CPU, and the caller must have made sure that no new interrupts are coming in (or at least that whatever new interrupts that come in will not pick up a certain piece of data). So I can imagine that the smb_mb() is there so that the caller - who has cleared some list or done something like that - will have any preceding writes that it did be serialized with actually checking the old state of desc-status. Yes. Fair enough - I can see that a smp_mb() is useful. But I think it needs documenting as such, and preferably with an example of how this actually happened in the first place (do you have one?) One could argue that it's the caller that should do the mb() but that would really be asking for trouble. Since most usage scenario require it, and it's not a hot path, I prefer having it here. Note that some kind of read barrier or compiler barrier should be needed regardless, or we are just not sync'ing with anything at all (we may have loaded the value ages ago and thus operate on a totally stale value). I prefer a full barrier to also ensure all previous stores are pushed out. The synchronize_irq() users that are really fundamental (ie the irq datastructures themselves) all should use the irq descriptor spinlock, and should *not* be needing any of this, since they would have serialized with whoever actually set the IRQ_INPROGRESS bit in the first place. That isn't always the case. You can have very efficient lock-less fast path and use synchronize_irq as a kind of RCU-like way to have the slow path do the right thing. So in *that* sense, I think the memory barrier is useless, and I can't make up my mind whether it's good or bad. Which is why I'd really like to have an example scenario spelled out where it makes a difference.. Well, typically, I can clear a pointer and want to make sure my IRQ handler isn't using it anymore before freeing the memory. Or set a HW is off flag. Having spinlocks all over isn't always the best approach on things that are really hot path, it's fair enough to use lockless approaches like that by moving the burden to the slow path that does synchronize_irq. In general, I tend to think that for this function to make any sense (that is, to synchronize anything at all), it needs a barrier or you are just making a decision based on a totally random value of desc-status since it can have been re-ordered, speculatively loaded, pre-fetched, whatever'ed... :-). Want a respin with a comment ? Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] synchronize_irq needs a barrier
In general, I tend to think that for this function to make any sense (that is, to synchronize anything at all), it needs a barrier or you are just making a decision based on a totally random value of desc-status since it can have been re-ordered, speculatively loaded, pre-fetched, whatever'ed... :-). Take a real life example: drivers/message/fusion/mptbase.c /* Disable interrupts! */ CHIPREG_WRITE32(ioc-chip-IntMask, 0x); ioc-active = 0; synchronize_irq(pdev-irq); And we aren't in a spinlock here. That's just a random example grepped I think I see a few more. Then, some drivers like tg3 actually do an smp_mb() before calling synchronize_irq(). But then, some don't. I think trying to have all drivers be correct here is asking for trouble, we'd rather have synchronize_irq() be uber-safe. It's not like it was used in hot path anyway. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev