Re: RFC on writel and writel_relaxed
On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote: >> On 3/28/2018 11:55 AM, David Miller wrote: >>> From: Benjamin Herrenschmidt >>> Date: Thu, 29 Mar 2018 02:13:16 +1100 >>> Let's fix all archs, it's way easier than fixing all drivers. Half of the archs are unused or dead anyway. >>> >>> Agreed. >>> >> >> I pinged most of the maintainers yesterday. >> Which arches do we care about these days? >> I have not been paying attention any other architecture besides arm64. > > Thanks for going through that exercise ! > > Once sparc, s390, microblaze and mips reply, I think we'll have a good > coverage, maybe riscv is to put in that lot too. I posted the following two patches for supporting microblaze and unicore32. [PATCH v2 1/2] io: prevent compiler reordering on the default writeX() implementation [PATCH v2 2/2] io: prevent compiler reordering on the default readX() implementation The rest of the arches except mips and alpha seem OK. I sent a question email on Friday to mips and alpha mailing lists. I'll follow up with an actual patch today. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: RFC on writel and writel_relaxed
On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote: > On 3/28/2018 11:55 AM, David Miller wrote: > > From: Benjamin Herrenschmidt > > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > > the archs are unused or dead anyway. > > > > Agreed. > > > > I pinged most of the maintainers yesterday. > Which arches do we care about these days? > I have not been paying attention any other architecture besides arm64. Thanks for going through that exercise ! Once sparc, s390, microblaze and mips reply, I think we'll have a good coverage, maybe riscv is to put in that lot too. Cheers, Ben. > > arch status detail > --- > > alpha question sent > arc question sent ys...@users.sourceforge.jp will fix it. > arm no issues > arm64 no issues > blackfin question sent about to be removed > c6x question sent > cris question sent > frv > h8300 question sent > hexagon question sent > ia64 no issues confirmed by Tony Luck > m32r > m68k question sent > metag > microblazequestion sent > mips question sent > mn10300 question sent > nios2 question sent > openrisc no issues sho...@gmail.com says should no issues > pariscno issues grantgrund...@gmail.com says > most probably no problem but still looking > powerpc no issues > riscv question sent > s390 question sent > score question sent > shquestion sent > sparc question sent > tile question sent > unicore32 question sent > x86 no issues > xtensaquestion sent > >
Re: RFC on writel and writel_relaxed
On 3/29/2018 12:29 PM, Arnd Bergmann wrote: > On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya wrote: >> On 3/28/2018 11:55 AM, David Miller wrote: >>> From: Benjamin Herrenschmidt >>> Date: Thu, 29 Mar 2018 02:13:16 +1100 >>> Let's fix all archs, it's way easier than fixing all drivers. Half of the archs are unused or dead anyway. >>> >>> Agreed. >>> >> >> I pinged most of the maintainers yesterday. >> Which arches do we care about these days? >> I have not been paying attention any other architecture besides arm64. >> >> archstatus detail >> -- - >> alpha question sent Thanks for the detailed analysis. > > I'm guessing alpha has problems > > extern inline u32 readl(const volatile void __iomem *addr) > { > u32 ret = __raw_readl(addr); > mb(); > return ret; > } > extern inline void writel(u32 b, volatile void __iomem *addr) > { > __raw_writel(b, addr); > mb(); > } Looks like a problem to me too. I'll start a thread with the alpha people and CC you. > > There is a barrier in writel /after/ the acess but not before. > This is the consolidated list. I also heart back from m68k and corrected contacts for arc and h8300. archstatus detail -- - alpha question sent Arnd: alpha has problems arc question sent vineet.gup...@synopsys.com says he'll get to this in the next few days arm no issues arm64 no issues c6x no issues no PCI h8300 no issues no PCI: ys...@users.sourceforge.jp will fix it. hexagon no issues no PCI ia64no issues confirmed by Tony Luck m68kno issues ge...@linux-m68k.org says no problem metag no issues arnd: removed microblaze question sent arnd: some mips platforms have problems mipsquestion sent arnd: some mips platforms have problems nds32 question sent nios2 no issues no PCI openriscno issues sho...@gmail.com says should no issues parisc no issues grantgrund...@gmail.com says most probably no problem but still looking powerpc no issues riscv no issues arnd: riscv should be fine s390no issues arnd: Pretty sure this is also fine sh question sent sparc no issues da...@davemloft.net says always strongly ordered unicore32 question sent resent to g...@pku.edu.cn x86 no issues x86_64 no issues > > Arnd > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: RFC on writel and writel_relaxed
On Thu, Mar 29, 2018 at 02:58:34PM +, David Laight wrote: > From: Jason Gunthorpe > > Sent: 29 March 2018 15:45 > ... > > > > When talking about ordering between the devices, the relevant question > > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > > > from the DEVICE_FOO. 'ordered' means that in this case > > > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > > > by BAR. > > > > > > Yes, and that isn't the case for arm because the writes can still be > > > buffered. > > > > The statement is not about buffering, or temporal completion order, or > > the order of acks returning to the CPU. It is about pure transaction > > ordering inside the interconnect. > > > > Can write BAR -> FOO pass write CPU -> FOO? > > Almost certainly. > The first cpu write can almost certainly be 'stalled' at the shared PCIe > bridge. > The second cpu write then completes (to a different target). > That target then issues a peer to peer transfer that reaches the shared > bridge. > I doubt the order of the transactions is guaranteed when it becomes > 'un-stalled'. The PCI spec has very strong wording on ordering that covers this case. Stalled bridges have to follow the ordering rules, and posted writes cannot pass other posted writes. Since in PCI all three transactions: CPU -> FOO CPU -> BAR BAR -> FOO Must traverse a shared bus segment, they must be placed on that bus in the above order, and the bridge(s) toward FOO must preserve this order. ARM's AXI has similar rules, I just can't recall the tiny details right now :) > Of course, these are peer to peer transfers, and strange ones at that. > Normally you'd not be doing peer to peer transfers that access 'memory' > the cpu has just written to. It is the best situation I can think of where order of completion to different devices would matter to a generic Linux driver.. .. And there are patches circulating right now for NVMe that enable exactly this kind of transfer, and rely on these kind of semantics, so it is a relevant detail :) Jason
Re: RFC on writel and writel_relaxed
On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya wrote: > On 3/28/2018 11:55 AM, David Miller wrote: >> From: Benjamin Herrenschmidt >> Date: Thu, 29 Mar 2018 02:13:16 +1100 >> >>> Let's fix all archs, it's way easier than fixing all drivers. Half of >>> the archs are unused or dead anyway. >> >> Agreed. >> > > I pinged most of the maintainers yesterday. > Which arches do we care about these days? > I have not been paying attention any other architecture besides arm64. > > archstatus detail > -- - > alpha question sent I'm guessing alpha has problems extern inline u32 readl(const volatile void __iomem *addr) { u32 ret = __raw_readl(addr); mb(); return ret; } extern inline void writel(u32 b, volatile void __iomem *addr) { __raw_writel(b, addr); mb(); } There is a barrier in writel /after/ the acess but not before. > arc question sent ys...@users.sourceforge.jp will fix > it. > arm no issues > arm64 no issues > blackfinquestion sent about to be removed > c6x question sent no PCI, so it might not matter that much -- all drivers are platform specific in the end. > crisquestion sent > frv cris and frv are getting removed > h8300 question sent no PCI > hexagon question sent no PCI > ia64no issues confirmed by Tony Luck > m32r removed > m68kquestion sent > metag removed > microblaze question sent > mipsquestion sent I'm guessing that some mips platforms have problems, but others don't. > mn10300 question sent removed > nios2 question sent no PCI > openriscno issues sho...@gmail.com says should no issues > parisc no issues grantgrund...@gmail.com says most > probably no problem but still looking > powerpc no issues > riscv question sent riscv should be fine > s390question sent Pretty sure this is also fine > score question sent removed > sh question sent > sparc question sent > tilequestion sent removed > unicore32 question sent Note the maintainer's new email address in linux-next. > x86 no issues > xtensa question sent removed. Arnd
RE: RFC on writel and writel_relaxed
From: Jason Gunthorpe > Sent: 29 March 2018 15:45 ... > > > When talking about ordering between the devices, the relevant question > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > > from the DEVICE_FOO. 'ordered' means that in this case > > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > > by BAR. > > > > Yes, and that isn't the case for arm because the writes can still be > > buffered. > > The statement is not about buffering, or temporal completion order, or > the order of acks returning to the CPU. It is about pure transaction > ordering inside the interconnect. > > Can write BAR -> FOO pass write CPU -> FOO? Almost certainly. The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge. The second cpu write then completes (to a different target). That target then issues a peer to peer transfer that reaches the shared bridge. I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'. Of course, these are peer to peer transfers, and strange ones at that. Normally you'd not be doing peer to peer transfers that access 'memory' the cpu has just written to. Requiring extra barriers in this case, or different functions for WC accesses shouldn't really be an issue. Even requiring a barrier between a write to dma coherent memory and a write that starts dma isn't really onerous. Even if it is a nop on all current architectures it is a good comment in the code. It could even have a 'dev' argument. David
Re: RFC on writel and writel_relaxed
On Thu, Mar 29, 2018 at 10:19:41AM +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote: > > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > > > powerpc and ARM can't quite make them synchronous I think, but at > > > > > > least > > > > > > they should have the same semantics as writel. > > > > > > > > > > One thing that ARM does IIRC is that it only guarantees to order > > > > > writel() within > > > > > one device, and the memory mapped PCI I/O space window almost > > > > > certainly > > > > > counts as a separate device to the CPU. > > > > > > > > That sounds bogus. > > > > > > To elaborate, if you do the following on arm: > > > > > > writel(DEVICE_FOO); > > > writel(DEVICE_BAR); > > > > > > we generally cannot guarantee in which order those accesses will hit the > > > devices even if we add every barrier under the sun. You'd need something > > > in between, specific to DEVICE_FOO (probably a read-back) to really push > > > the first write out. This doesn't sound like it would be that uncommon to > > > me. > > > > The PCI posted write does not require the above to execute 'in order' > > only that any bus segment shared by the two devices have the writes > > issued in CPU order. ie at a shared PCI root port for instance. > > > > If I recall this is very similar to the ordering that ARM's on-chip > > AXI interconnect is supposed to provide.. So I'd be very surprised if > > a modern ARM64 has an meaningful difference from x86 here. > > From the architectural perspective, writes to different "peripherals" are > not ordered with respect to each other. The first writel will complete once > it gets its write acknowledgement, but this may not necessarily come from > the endpoint -- it could come from an intermediate buffer past the point of > serialisation (i.e. the write will then be ordered with respect to other > accesses to that same endpoint). The PCI root port would look like one > peripheral here. That is basically the same as PCI - PCI has no write ACK, so all writes are buffered by the PCI interconnect and complete in some undefined temporal order when multiple end points are involved. This does not seem very different from what happens in x86.. > > When talking about ordering between the devices, the relevant question > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > from the DEVICE_FOO. 'ordered' means that in this case > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > by BAR. > > Yes, and that isn't the case for arm because the writes can still be > buffered. The statement is not about buffering, or temporal completion order, or the order of acks returning to the CPU. It is about pure transaction ordering inside the interconnect. Can write BAR -> FOO pass write CPU -> FOO? Jason
Re: RFC on writel and writel_relaxed
From: Sinan Kaya Date: Thu, 29 Mar 2018 09:56:01 -0400 > sparc question sent Sparc never lets physical memory accesses pass MMIO, and vice versa. They are always strongly ordered amongst eachother. Therefore no explicit barrier instructions are necessary.
Re: RFC on writel and writel_relaxed
On 3/28/2018 11:55 AM, David Miller wrote: > From: Benjamin Herrenschmidt > Date: Thu, 29 Mar 2018 02:13:16 +1100 > >> Let's fix all archs, it's way easier than fixing all drivers. Half of >> the archs are unused or dead anyway. > > Agreed. > I pinged most of the maintainers yesterday. Which arches do we care about these days? I have not been paying attention any other architecture besides arm64. archstatus detail -- - alpha question sent arc question sent ys...@users.sourceforge.jp will fix it. arm no issues arm64 no issues blackfinquestion sent about to be removed c6x question sent crisquestion sent frv h8300 question sent hexagon question sent ia64no issues confirmed by Tony Luck m32r m68kquestion sent metag microblaze question sent mipsquestion sent mn10300 question sent nios2 question sent openriscno issues sho...@gmail.com says should no issues parisc no issues grantgrund...@gmail.com says most probably no problem but still looking powerpc no issues riscv question sent s390question sent score question sent sh question sent sparc question sent tilequestion sent unicore32 question sent x86 no issues xtensa question sent -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: RFC on writel and writel_relaxed
On Thu, Mar 29, 2018 at 08:31:32AM +1100, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote: > > This is a variation on the mandatory write barrier that causes writes to > > weakly > > ordered I/O regions to be partially ordered. Its effects may go beyond > > the > > CPU->Hardware interface and actually affect the hardware at some level. > > > > How can a driver writer possibly get that right? > > > > IIRC it was added for some big ia64 system that was really expensive > > to implement the proper wmb() semantics on. So wmb() semantics were > > quietly downgraded, then the subsequently broken drivers they cared > > about were fixed by adding the stronger mmiowb(). > > > > What should have happened was wmb and writel remained correct, sane, and > > expensive, and they add an mmio_wmb() to order MMIO stores made by the > > writel_relaxed accessors, then use that to speed up the few drivers they > > care about. > > > > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just > > make wmb ordering talk about stores to the device, not to some > > intermediate stage of the interconnect where it can be subsequently > > reordered wrt the device? Drivers can be converted back to using wmb > > or writel gradually. > > I was under the impression that mmiowb was specifically about ordering > writel's with a subsequent spin_unlock, without it, MMIOs from > different CPUs (within the same lock) would still arrive OO. > > If that's indeed the case, I would suggest ia64 switches to a similar > per-cpu flag trick powerpc uses. ... or we could remove ia64. /me runs for cover Will
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote: > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > > powerpc and ARM can't quite make them synchronous I think, but at > > > > > least > > > > > they should have the same semantics as writel. > > > > > > > > One thing that ARM does IIRC is that it only guarantees to order > > > > writel() within > > > > one device, and the memory mapped PCI I/O space window almost certainly > > > > counts as a separate device to the CPU. > > > > > > That sounds bogus. > > > > To elaborate, if you do the following on arm: > > > > writel(DEVICE_FOO); > > writel(DEVICE_BAR); > > > > we generally cannot guarantee in which order those accesses will hit the > > devices even if we add every barrier under the sun. You'd need something > > in between, specific to DEVICE_FOO (probably a read-back) to really push > > the first write out. This doesn't sound like it would be that uncommon to > > me. > > The PCI posted write does not require the above to execute 'in order' > only that any bus segment shared by the two devices have the writes > issued in CPU order. ie at a shared PCI root port for instance. > > If I recall this is very similar to the ordering that ARM's on-chip > AXI interconnect is supposed to provide.. So I'd be very surprised if > a modern ARM64 has an meaningful difference from x86 here. >From the architectural perspective, writes to different "peripherals" are not ordered with respect to each other. The first writel will complete once it gets its write acknowledgement, but this may not necessarily come from the endpoint -- it could come from an intermediate buffer past the point of serialisation (i.e. the write will then be ordered with respect to other accesses to that same endpoint). The PCI root port would look like one peripheral here. > When talking about ordering between the devices, the relevant question > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > from the DEVICE_FOO. 'ordered' means that in this case > writel(DEVICE_FOO) must be presented to FOO before anything generated > by BAR. Yes, and that isn't the case for arm because the writes can still be buffered. Will
Re: RFC on writel and writel_relaxed
On Thu, 29 Mar 2018 08:31:32 +1100 Benjamin Herrenschmidt wrote: > On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote: > > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) > > David Miller wrote: > > > > > From: Benjamin Herrenschmidt > > > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > > > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > > > the archs are unused or dead anyway. > > > > > > Agreed. > > > > While we're making decrees here, can we do something about mmiowb? > > The semantics are basically indecipherable. > > I was going to tackle that next :-) > > > This is a variation on the mandatory write barrier that causes writes to > > weakly > > ordered I/O regions to be partially ordered. Its effects may go beyond > > the > > CPU->Hardware interface and actually affect the hardware at some level. > > > > How can a driver writer possibly get that right? > > > > IIRC it was added for some big ia64 system that was really expensive > > to implement the proper wmb() semantics on. So wmb() semantics were > > quietly downgraded, then the subsequently broken drivers they cared > > about were fixed by adding the stronger mmiowb(). > > > > What should have happened was wmb and writel remained correct, sane, and > > expensive, and they add an mmio_wmb() to order MMIO stores made by the > > writel_relaxed accessors, then use that to speed up the few drivers they > > care about. > > > > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just > > make wmb ordering talk about stores to the device, not to some > > intermediate stage of the interconnect where it can be subsequently > > reordered wrt the device? Drivers can be converted back to using wmb > > or writel gradually. > > I was under the impression that mmiowb was specifically about ordering > writel's with a subsequent spin_unlock, without it, MMIOs from > different CPUs (within the same lock) would still arrive OO. Yes more or less, and I think that until mmiowb was introduced, wmb or writel was sufficient for this. Thanks, Nick
Re: RFC on writel and writel_relaxed
On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote: > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) > David Miller wrote: > > > From: Benjamin Herrenschmidt > > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > > the archs are unused or dead anyway. > > > > Agreed. > > While we're making decrees here, can we do something about mmiowb? > The semantics are basically indecipherable. I was going to tackle that next :-) > This is a variation on the mandatory write barrier that causes writes to > weakly > ordered I/O regions to be partially ordered. Its effects may go beyond the > CPU->Hardware interface and actually affect the hardware at some level. > > How can a driver writer possibly get that right? > > IIRC it was added for some big ia64 system that was really expensive > to implement the proper wmb() semantics on. So wmb() semantics were > quietly downgraded, then the subsequently broken drivers they cared > about were fixed by adding the stronger mmiowb(). > > What should have happened was wmb and writel remained correct, sane, and > expensive, and they add an mmio_wmb() to order MMIO stores made by the > writel_relaxed accessors, then use that to speed up the few drivers they > care about. > > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just > make wmb ordering talk about stores to the device, not to some > intermediate stage of the interconnect where it can be subsequently > reordered wrt the device? Drivers can be converted back to using wmb > or writel gradually. I was under the impression that mmiowb was specifically about ordering writel's with a subsequent spin_unlock, without it, MMIOs from different CPUs (within the same lock) would still arrive OO. If that's indeed the case, I would suggest ia64 switches to a similar per-cpu flag trick powerpc uses. Cheers, Ben. > Thanks, > Nick
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > > they should have the same semantics as writel. > > > > > > One thing that ARM does IIRC is that it only guarantees to order writel() > > > within > > > one device, and the memory mapped PCI I/O space window almost certainly > > > counts as a separate device to the CPU. > > > > That sounds bogus. > > To elaborate, if you do the following on arm: > > writel(DEVICE_FOO); > writel(DEVICE_BAR); > > we generally cannot guarantee in which order those accesses will hit the > devices even if we add every barrier under the sun. You'd need something > in between, specific to DEVICE_FOO (probably a read-back) to really push > the first write out. This doesn't sound like it would be that uncommon to > me. The PCI posted write does not require the above to execute 'in order' only that any bus segment shared by the two devices have the writes issued in CPU order. ie at a shared PCI root port for instance. If I recall this is very similar to the ordering that ARM's on-chip AXI interconnect is supposed to provide.. So I'd be very surprised if a modern ARM64 has an meaningful difference from x86 here. When talking about ordering between the devices, the relevant question is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA from the DEVICE_FOO. 'ordered' means that in this case writel(DEVICE_FOO) must be presented to FOO before anything generated by BAR. Jason
Re: RFC on writel and writel_relaxed
On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) David Miller wrote: > From: Benjamin Herrenschmidt > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > the archs are unused or dead anyway. > > Agreed. While we're making decrees here, can we do something about mmiowb? The semantics are basically indecipherable. This is a variation on the mandatory write barrier that causes writes to weakly ordered I/O regions to be partially ordered. Its effects may go beyond the CPU->Hardware interface and actually affect the hardware at some level. How can a driver writer possibly get that right? IIRC it was added for some big ia64 system that was really expensive to implement the proper wmb() semantics on. So wmb() semantics were quietly downgraded, then the subsequently broken drivers they cared about were fixed by adding the stronger mmiowb(). What should have happened was wmb and writel remained correct, sane, and expensive, and they add an mmio_wmb() to order MMIO stores made by the writel_relaxed accessors, then use that to speed up the few drivers they care about. Now that ia64 doesn't matter too much, can we deprecate mmiowb and just make wmb ordering talk about stores to the device, not to some intermediate stage of the interconnect where it can be subsequently reordered wrt the device? Drivers can be converted back to using wmb or writel gradually. Thanks, Nick
RE: RFC on writel and writel_relaxed
From: Benjamin Herrenschmidt > Sent: 28 March 2018 16:13 ... > > I've always wondered exactly what the twi;isync were for - always seemed > > very heavy handed for most mmio reads. > > Particularly if you are doing mmio reads from a data fifo. > > If you do that you should use the "s" version of the accessors. Those > will only do the above trick at the end of the access series. Also a > FIFO needs special care about endianness anyway, so you should use > those accessors regardless. (Hint: you never endian swap a FIFO even on > BE on a LE device, unless something's been wired very badly in HW). That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface. Reading the high address 'clocked' the fifo. So the first 3 reads could happen in any order, but the 4th had to be last. This is a small ppc and we shovel a lot of data through that fifo. Whether it needed byteswapping depended completely on how our hardware people had built the pcb (not made easy by some docs using the ibm bit numbering). In fact it didn't While that driver only had to run on a very specific small ppc, generic drivers might have similar issues. I suspect that writel() is always (or should always be): barrier_before_writel() writel_relaxed() barrier_after_writel() So if a driver needs to do multiple writes (without strong ordering) it should be able to repeat the writel_relaxed() with only one set of barriers. Similarly for readl(). In addition a lesser barrier is probably enough between a readl_relaxed() and a writel_relaxed() that is conditional on the read value. David
Re: RFC on writel and writel_relaxed
From: Benjamin Herrenschmidt Date: Thu, 29 Mar 2018 02:13:16 +1100 > Let's fix all archs, it's way easier than fixing all drivers. Half of > the archs are unused or dead anyway. Agreed.
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 07:41 -0400, ok...@codeaurora.org wrote: > Yes, we want to get there indeed. It is because of some arch not > implementing writel properly. Maintainers want to play safe. > > That is why I asked if IA64 and other well known archs follow the > strongly ordered rule at this moment like PPC and ARM. > > Or should we go and inform every arch about this before yanking wmb()? > > Maintainers are afraid of introducing a regression. Let's fix all archs, it's way easier than fixing all drivers. Half of the archs are unused or dead anyway. > > > > The above code makes no sense, and just looks stupid to me. It also > > generates pointlessly bad code on x86, so it's bad there too. > > > > Linus
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 11:30 +, David Laight wrote: > From: Benjamin Herrenschmidt > > Sent: 28 March 2018 10:56 > > ... > > For example, let's say I have a device with a reset bit and the spec > > says the reset bit needs to be set for at least 10us. > > > > This is wrong: > > > > writel(1, RESET_REG); > > usleep(10); > > writel(0, RESET_REG); > > > > Because of write posting, the first write might arrive to the device > > right before the second one. > > > > The typical "fix" is to turn that into: > > > > writel(1, RESET_REG); > > readl(RESET_REG); /* Flush posted writes */ > > Would a writel(1, RESET_REG) here provide enough synchronsiation? Probably yes. It's one of those things where you try to deal with the fact that 90% of driver writers barely understand the basic stuff and so you need the "default" accessors to be hardened as much as possible. We still need to get a reasonably definition of the semantics of the relaxed ones vs. WC memory but let's get through that exercise first and hopefully for the last time. > > usleep(10); > > writel(0, RESET_REG); > > > > *However* the issue here, at least on power, is that the CPU can issue > > that readl but doesn't necessarily wait for it to complete (ie, the > > data to return), before proceeding to the usleep. Now a usleep contains > > a bunch of loads and stores and is probably fine, but a udelay which > > just loops on the timebase may not be. > > > > Thus we may still violate the timing requirement. > > I've seem that sort of code (with udelay() and no read back) quite often. > How many were in linux I don't know. > > For small delays I usually fix it by repeated writes (of the same value) > to the device register. That can guarantee very short intervals. As long as you know the bus frequency... > The only time I've actually seen buffered writes break timing was > between a 286 and an 8859 interrupt controller. :-) The problem for me is not so much what I've seen, I realize that most of the issues we are talking about are the kind that will hit once in a thousand times or less. But we *can* reason about them in a way that can effectively prevent the problem completely and when your cluster has 1 machine, 1/1000 starts becoming significant. These days the vast majority of IO devices either are 99% DMA driven so that a bit of overhead on MMIOs is irrelevant, or have one fast path (low latency IB etc...) that needs some finer control, and the rest is all setup which can be paranoid at will. So I think we should aim for the "default" accessors most people use to be as hadened as we can think of. I favor correctness over performance in all cases. But then we also define a reasonable semantic for the relaxed ones (well, we sort-of do have one, we might have to make it a bit more precise in some areas) that allows the few MMIO fast path that care to be optimized. > If you wrote to the mask then enabled interrupts the first IACK cycle > could be too close to write and break the cycle recovery time. > That clobbered most of the interrupt controller registers. > That probably affected every 286 board ever built! > Not sure how much software added the required extra bus cycle. > > > What we did inside readl, with the twi;isync sequence (which basically > > means, trap on return value with "trap never" as a condition, followed > > by isync that ensures all excpetion conditions are resolved), is force > > the CPU to "consume" the data from the read before moving on. > > > > This effectively makes readl fully synchronous (we would probably avoid > > that if we were to implement a readl_relaxed). > > I've always wondered exactly what the twi;isync were for - always seemed > very heavy handed for most mmio reads. > Particularly if you are doing mmio reads from a data fifo. If you do that you should use the "s" version of the accessors. Those will only do the above trick at the end of the access series. Also a FIFO needs special care about endianness anyway, so you should use those accessors regardless. (Hint: you never endian swap a FIFO even on BE on a LE device, unless something's been wired very badly in HW). > Perhaps there should be a writel_wait() that is allowed to do a read back > for such code paths? I think what we have is fine, we just define that the standard writel/readl as fairly simple and hardened, and we look at providing a somewhat reasonable set of relaxed variants for optimizing fast path. We pretty much already are there, we just need to be better at defining the semantics. And for the super high perf case, which thankfully is either seldom (server high perf network stuff) or very arch specific (ARM SoC stuff), then arch specific driver hacks will always remain the norm. Cheers, Ben. > David >
Re: RFC on writel and writel_relaxed
On 2018-03-28 02:14, Linus Torvalds wrote: On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya wrote: Basically changing it to dma_buffer->foo = 1;/* WB */ wmb() writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */ mmiowb() Why? Why not just remove the wmb(), and keep the barrier in the writel()? Yes, we want to get there indeed. It is because of some arch not implementing writel properly. Maintainers want to play safe. That is why I asked if IA64 and other well known archs follow the strongly ordered rule at this moment like PPC and ARM. Or should we go and inform every arch about this before yanking wmb()? Maintainers are afraid of introducing a regression. The above code makes no sense, and just looks stupid to me. It also generates pointlessly bad code on x86, so it's bad there too. Linus
RE: RFC on writel and writel_relaxed
From: Benjamin Herrenschmidt > Sent: 28 March 2018 10:56 ... > For example, let's say I have a device with a reset bit and the spec > says the reset bit needs to be set for at least 10us. > > This is wrong: > > writel(1, RESET_REG); > usleep(10); > writel(0, RESET_REG); > > Because of write posting, the first write might arrive to the device > right before the second one. > > The typical "fix" is to turn that into: > > writel(1, RESET_REG); > readl(RESET_REG); /* Flush posted writes */ Would a writel(1, RESET_REG) here provide enough synchronsiation? > usleep(10); > writel(0, RESET_REG); > > *However* the issue here, at least on power, is that the CPU can issue > that readl but doesn't necessarily wait for it to complete (ie, the > data to return), before proceeding to the usleep. Now a usleep contains > a bunch of loads and stores and is probably fine, but a udelay which > just loops on the timebase may not be. > > Thus we may still violate the timing requirement. I've seem that sort of code (with udelay() and no read back) quite often. How many were in linux I don't know. For small delays I usually fix it by repeated writes (of the same value) to the device register. That can guarantee very short intervals. The only time I've actually seen buffered writes break timing was between a 286 and an 8859 interrupt controller. If you wrote to the mask then enabled interrupts the first IACK cycle could be too close to write and break the cycle recovery time. That clobbered most of the interrupt controller registers. That probably affected every 286 board ever built! Not sure how much software added the required extra bus cycle. > What we did inside readl, with the twi;isync sequence (which basically > means, trap on return value with "trap never" as a condition, followed > by isync that ensures all excpetion conditions are resolved), is force > the CPU to "consume" the data from the read before moving on. > > This effectively makes readl fully synchronous (we would probably avoid > that if we were to implement a readl_relaxed). I've always wondered exactly what the twi;isync were for - always seemed very heavy handed for most mmio reads. Particularly if you are doing mmio reads from a data fifo. Perhaps there should be a writel_wait() that is allowed to do a read back for such code paths? David
Re: Aw: Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 12:13 +0200, Lino Sanfilippo wrote: > Hi, > > > > > > Yeah so that other trick I'm talking about is also used for timing > > accuracy. > > > > For example, let's say I have a device with a reset bit and the spec > > says the reset bit needs to be set for at least 10us. > > > > This is wrong: > > > > writel(1, RESET_REG); > > usleep(10); > > writel(0, RESET_REG); > > > > Because of write posting, the first write might arrive to the device > > right before the second one. > > > > Does not write posting only concern PCI? This seems to be a different topic. > Furthermore > write posting should not include write reordering... Nobody's talking about re-ordering and no, write posting is rather common practice on a whole lot of different busses, not just PCI(e). Cheers, Ben.
Aw: Re: RFC on writel and writel_relaxed
Hi, > > Yeah so that other trick I'm talking about is also used for timing > accuracy. > > For example, let's say I have a device with a reset bit and the spec > says the reset bit needs to be set for at least 10us. > > This is wrong: > > writel(1, RESET_REG); > usleep(10); > writel(0, RESET_REG); > > Because of write posting, the first write might arrive to the device > right before the second one. > Does not write posting only concern PCI? This seems to be a different topic. Furthermore write posting should not include write reordering... Regards, Lino
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > they should have the same semantics as writel. > > > > One thing that ARM does IIRC is that it only guarantees to order writel() > > within > > one device, and the memory mapped PCI I/O space window almost certainly > > counts as a separate device to the CPU. > > That sounds bogus. To elaborate, if you do the following on arm: writel(DEVICE_FOO); writel(DEVICE_BAR); we generally cannot guarantee in which order those accesses will hit the devices even if we add every barrier under the sun. You'd need something in between, specific to DEVICE_FOO (probably a read-back) to really push the first write out. This doesn't sound like it would be that uncommon to me. On the other hand: writel(DEVICE_FOO); writel(DEVICE_FOO); is obviously ordered and also things like: writel(DEVICE_FOO_IN_PCI_MEM_SPACE); writel(DEVICE_BAR_IN_SAME_PCI_MEM_SPACE); are ordered up to the PCI host bridge, because that's really the "device" here. Will
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > powerpc and ARM can't quite make them synchronous I think, but at least > > they should have the same semantics as writel. > > One thing that ARM does IIRC is that it only guarantees to order writel() > within > one device, and the memory mapped PCI I/O space window almost certainly > counts as a separate device to the CPU. That sounds bogus. > In the absence of an enforced global synchronization during an I/O port > access, that means writel() and outb() can be reordered before they arrive > at a device in theory. Again, this rarely matters in practice, but I think it > makes sense to document the less strict behavior here, given that we have > common hardware that can't provide x86 compatible semantics. Can't you put some kind of super heavy handed barrier in inX/outX ? These things are never going to be performance sensitive anyway... Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 10:09 +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 09:00:01AM +, David Laight wrote: > > From: Will Deacon > > > Sent: 28 March 2018 09:54 > > > > ... > > > > > I don't think so. My reading of memory-barriers.txt says that writeX > > > > > might > > > > > expand to outX, and outX is not ordered with respect to other types of > > > > > memory. > > > > > > > > Ugh ? > > > > > > > > My understanding of HW at least is the exact opposite. outX is *more* > > > > ordered if anything, than any other accessors. IO space is completely > > > > synchronous, non posted and ordered afaik. > > > > > > I'm just going by memory-barriers.txt: > > > > > > > > > (*) inX(), outX(): > > > > > > [...] > > > > > > They are guaranteed to be fully ordered with respect to each other. > > > > > > They are not guaranteed to be fully ordered with respect to other > > > types of > > > memory and I/O operation. > > > > A long time ago there was a document from Intel that said that inb/outb > > weren't > > necessarily synchronised wrt memory accesses. > > (Might be P-pro era). > > However no processors actually behaved that way and more recent docs > > say that inb/outb are fully ordered. > > Thank you, David! I'll write another patch fixing this up and hopefully > we'll soon have one making writeX/readX much clearer. Thanks for doing the grunt work Will ! :-) Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 10:07 +0100, Will Deacon wrote: > > For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to > add an mb() to make it work. > > Do both of these work on power? Yes. There's even another quirk, see further down ;-) > If so, I guess I can make readl even more > expensive :/ Feels a bit like the tail wagging the dog, though. Maybe, but then readl is always horribly slow anyway so you may not necessarily be losing that much. > Another thing I just realised is that we restrict the barriers we use in > readl/writel on arm64 so that they don't necessary apply to both loads and > stores. To be specific: > >writel is ordered against prior writes to memory, but not reads That could be tricky... You may end up with something that reads before triggering a DMA and ends up with the post-DMA value ... ugh. >readl is ordered against subsequent reads of memory, but not writes (but >note that in example (1) above, the control dependency ensures that). > > If necessary, I could move the barrier in our readl implementation to be > before the read, then play the control-dependency + instruction-sync (ISB) > trick that you do on power. Yeah so that other trick I'm talking about is also used for timing accuracy. For example, let's say I have a device with a reset bit and the spec says the reset bit needs to be set for at least 10us. This is wrong: writel(1, RESET_REG); usleep(10); writel(0, RESET_REG); Because of write posting, the first write might arrive to the device right before the second one. The typical "fix" is to turn that into: writel(1, RESET_REG); readl(RESET_REG); /* Flush posted writes */ usleep(10); writel(0, RESET_REG); *However* the issue here, at least on power, is that the CPU can issue that readl but doesn't necessarily wait for it to complete (ie, the data to return), before proceeding to the usleep. Now a usleep contains a bunch of loads and stores and is probably fine, but a udelay which just loops on the timebase may not be. Thus we may still violate the timing requirement. What we did inside readl, with the twi;isync sequence (which basically means, trap on return value with "trap never" as a condition, followed by isync that ensures all excpetion conditions are resolved), is force the CPU to "consume" the data from the read before moving on. This effectively makes readl fully synchronous (we would probably avoid that if we were to implement a readl_relaxed). Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 11:50 AM, Benjamin Herrenschmidt wrote: > On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote: >> For arm/arm64 these end up behaving exactly the same as readX/writeX, but >> I'm nervous about changing the documentation without understanding why it's >> like it is currently. Maybe another ia64 thing?. > > I doubt it ... the Intel ancestry here would make me think they are > completely ordered there too. > > powerpc and ARM can't quite make them synchronous I think, but at least > they should have the same semantics as writel. One thing that ARM does IIRC is that it only guarantees to order writel() within one device, and the memory mapped PCI I/O space window almost certainly counts as a separate device to the CPU. In the absence of an enforced global synchronization during an I/O port access, that means writel() and outb() can be reordered before they arrive at a device in theory. Again, this rarely matters in practice, but I think it makes sense to document the less strict behavior here, given that we have common hardware that can't provide x86 compatible semantics. Arnd
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote: > For arm/arm64 these end up behaving exactly the same as readX/writeX, but > I'm nervous about changing the documentation without understanding why it's > like it is currently. Maybe another ia64 thing?. I doubt it ... the Intel ancestry here would make me think they are completely ordered there too. powerpc and ARM can't quite make them synchronous I think, but at least they should have the same semantics as writel. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 09:00:01AM +, David Laight wrote: > From: Will Deacon > > Sent: 28 March 2018 09:54 > ... > > > > I don't think so. My reading of memory-barriers.txt says that writeX > > > > might > > > > expand to outX, and outX is not ordered with respect to other types of > > > > memory. > > > > > > Ugh ? > > > > > > My understanding of HW at least is the exact opposite. outX is *more* > > > ordered if anything, than any other accessors. IO space is completely > > > synchronous, non posted and ordered afaik. > > > > I'm just going by memory-barriers.txt: > > > > > > (*) inX(), outX(): > > > > [...] > > > > They are guaranteed to be fully ordered with respect to each other. > > > > They are not guaranteed to be fully ordered with respect to other > > types of > > memory and I/O operation. > > A long time ago there was a document from Intel that said that inb/outb > weren't > necessarily synchronised wrt memory accesses. > (Might be P-pro era). > However no processors actually behaved that way and more recent docs > say that inb/outb are fully ordered. Thank you, David! I'll write another patch fixing this up and hopefully we'll soon have one making writeX/readX much clearer. Will
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 05:42:56PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote: > > On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt > > wrote: > > > > > > This is why, I want (with your agreement) to define clearly and once > > > and for all, that the Linux semantics of writel are that it is ordered > > > with previous writes to coherent memory (*) > > > > Honestly, I think those are the sane semantics. In fact, make it > > "ordered with previous writes" full stop, since it's not only ordered > > wrt previous writes to memory, but also previous writel's. > > Of course. It was somewhat a given that it's ordered vs. any previous > MMIO actually, but it doesn't hurt to spell it out once more. Good. So I think this confirms our understanding so far. > > > > Also, can I assume the above ordering with writel() equally applies to > > > readl() or not ? > > > > > > IE: > > > dma_buf->foo = 1; > > > readl(STUPID_DEVICE_DMA_KICK_ON_READ); > > > > If that KICK_ON_READ is UC, then that's definitely the case. And > > honestly, status registers like that really should always be UC. > > > > But if somebody sets the area WC (which is crazy), then I think it > > might be at least debatable. x86 semantics does allow reads to be done > > before previous writes (or, put another way, writes to be buffered - > > the buffers are ordered so writes don't get re-ordered, but reads can > > happen during the buffering). > > Right, for now I worry about UC semantics. Once we have nailed that, we > can look at WC, which is a lot more tricky as archs differs more > widely, but one thing at a time. > > > But UC accesses are always done entirely ordered, and honestly, any > > status register that starts a DMA would not make sense any other way. > > > > Of course, you'd have to be pretty odd to want to start a DMA with a > > read anyway - partly exactly because it's bad for performance since > > reads will be synchronous and not buffered like a write). > > I have bad memories of old adaptec controllers ... > > That said, I think the above might not be right on ARM if we want to > make it the rule, Will, what do you reckon ? So there are two cases to consider: 1. if (readl(DEVICE_DMA_STATUS) == DMA_DONE) mydata = *dma_bufp; 2. *dma_bufp = 42; readl(DEVICE_DMA_KICK_ON_READ); For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to add an mb() to make it work. Do both of these work on power? If so, I guess I can make readl even more expensive :/ Feels a bit like the tail wagging the dog, though. Another thing I just realised is that we restrict the barriers we use in readl/writel on arm64 so that they don't necessary apply to both loads and stores. To be specific: writel is ordered against prior writes to memory, but not reads readl is ordered against subsequent reads of memory, but not writes (but note that in example (1) above, the control dependency ensures that). If necessary, I could move the barrier in our readl implementation to be before the read, then play the control-dependency + instruction-sync (ISB) trick that you do on power. Will
RE: RFC on writel and writel_relaxed
From: Will Deacon > Sent: 28 March 2018 09:54 ... > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > > expand to outX, and outX is not ordered with respect to other types of > > > memory. > > > > Ugh ? > > > > My understanding of HW at least is the exact opposite. outX is *more* > > ordered if anything, than any other accessors. IO space is completely > > synchronous, non posted and ordered afaik. > > I'm just going by memory-barriers.txt: > > > (*) inX(), outX(): > > [...] > > They are guaranteed to be fully ordered with respect to each other. > > They are not guaranteed to be fully ordered with respect to other types > of > memory and I/O operation. A long time ago there was a document from Intel that said that inb/outb weren't necessarily synchronised wrt memory accesses. (Might be P-pro era). However no processors actually behaved that way and more recent docs say that inb/outb are fully ordered. David
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 08:29:45AM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote: > > > Can we say the same thing for iowrite32() and iowrite32be(). I also see > > > wmb() > > > in front of these. > > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > expand to outX, and outX is not ordered with respect to other types of > > memory. > > Ugh ? > > My understanding of HW at least is the exact opposite. outX is *more* > ordered if anything, than any other accessors. IO space is completely > synchronous, non posted and ordered afaik. I'm just going by memory-barriers.txt: (*) inX(), outX(): [...] They are guaranteed to be fully ordered with respect to each other. They are not guaranteed to be fully ordered with respect to other types of memory and I/O operation. For arm/arm64 these end up behaving exactly the same as readX/writeX, but I'm nervous about changing the documentation without understanding why it's like it is currently. Maybe another ia64 thing?. Will
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt wrote: > > > > Of course, you'd have to be pretty odd to want to start a DMA with a > > read anyway - partly exactly because it's bad for performance since > > reads will be synchronous and not buffered like a write). > > I have bad memories of old adaptec controllers ... > *Old* adaptec controllers were likely to use the in/out instructions for status and command data. Those are actually even more ordered than UC reads and writes: the in/out instructions are not just fully ordered, but are fully *synchronous* on x86. So not just doing accesses in order, but actually waiting for everything to drain before they start executing, but they also wait for the operation itself to complete (ie "out" will not just queue the write, it will then wait for the queue to empty and the write data to hit the line). That's why in/out were *so* slow, and why nobody uses them any more (well, the address size limitations and the lack of any remapping of the address obviously also are a reason). Linus >
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 09:11 +0200, Arnd Bergmann wrote: > On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt > wrote: > > On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote: > > > On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt > > > wrote: > > > That's why in/out were *so* slow, and why nobody uses them any more > > > (well, the address size limitations and the lack of any remapping of > > > the address obviously also are a reason). > > > > All true indeed, though a lot of other archs never quite made them > > fully synchronous, which was another can of worms ... oh well. > > Many architectures have no way of providing PCI compliant semantics > for outb, as their instruction set and/or bus interconnect lacks a > method of waiting for completion of an outb. Yup, that includes powerpc. Note that since POWER8 we don't even genetate IO space anymore :-) > In practice, it doesn't seem to matter for any of the devices one would > encounter these days: very few use I/O space, and those that do don't > actually rely on the strict ordering. Some architectures (in particular > s390, but I remember seeing the same thing elsewhere) explicitly > disallow I/O space access on PCI because of this. On ARM, the typical > PCI implementations have other problems that are worse than this > one, so most drivers are fine with the almost-working semantics. /me cries... Ben.
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt wrote: > On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote: >> On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt >> wrote: >> That's why in/out were *so* slow, and why nobody uses them any more >> (well, the address size limitations and the lack of any remapping of >> the address obviously also are a reason). > > All true indeed, though a lot of other archs never quite made them > fully synchronous, which was another can of worms ... oh well. Many architectures have no way of providing PCI compliant semantics for outb, as their instruction set and/or bus interconnect lacks a method of waiting for completion of an outb. In practice, it doesn't seem to matter for any of the devices one would encounter these days: very few use I/O space, and those that do don't actually rely on the strict ordering. Some architectures (in particular s390, but I remember seeing the same thing elsewhere) explicitly disallow I/O space access on PCI because of this. On ARM, the typical PCI implementations have other problems that are worse than this one, so most drivers are fine with the almost-working semantics. Arnd
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote: > > > On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt ing.org> wrote: > > > > > > Of course, you'd have to be pretty odd to want to start a DMA > > with a > > > read anyway - partly exactly because it's bad for performance > > since > > > reads will be synchronous and not buffered like a write). > > > > I have bad memories of old adaptec controllers ... > > *Old* adaptec controllers were likely to use the in/out instructions > for status and command data. > > Those are actually even more ordered than UC reads and writes: the > in/out instructions are not just fully ordered, but are fully > *synchronous* on x86. > > So not just doing accesses in order, but actually waiting for > everything to drain before they start executing, but they also wait > for the operation itself to complete (ie "out" will not just queue > the write, it will then wait for the queue to empty and the write > data to hit the line). > > That's why in/out were *so* slow, and why nobody uses them any more > (well, the address size limitations and the lack of any remapping of > the address obviously also are a reason). All true indeed, though a lot of other archs never quite made them fully synchronous, which was another can of worms ... oh well. As for Adaptec, you might be right, I do remember having cases of old stuff triggering DMA on reads, it might have been "Mac" variants of Adaptec using MMIO or something... Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote: > On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt > wrote: > > > > This is why, I want (with your agreement) to define clearly and once > > and for all, that the Linux semantics of writel are that it is ordered > > with previous writes to coherent memory (*) > > Honestly, I think those are the sane semantics. In fact, make it > "ordered with previous writes" full stop, since it's not only ordered > wrt previous writes to memory, but also previous writel's. Of course. It was somewhat a given that it's ordered vs. any previous MMIO actually, but it doesn't hurt to spell it out once more. > > Also, can I assume the above ordering with writel() equally applies to > > readl() or not ? > > > > IE: > > dma_buf->foo = 1; > > readl(STUPID_DEVICE_DMA_KICK_ON_READ); > > If that KICK_ON_READ is UC, then that's definitely the case. And > honestly, status registers like that really should always be UC. > > But if somebody sets the area WC (which is crazy), then I think it > might be at least debatable. x86 semantics does allow reads to be done > before previous writes (or, put another way, writes to be buffered - > the buffers are ordered so writes don't get re-ordered, but reads can > happen during the buffering). Right, for now I worry about UC semantics. Once we have nailed that, we can look at WC, which is a lot more tricky as archs differs more widely, but one thing at a time. > But UC accesses are always done entirely ordered, and honestly, any > status register that starts a DMA would not make sense any other way. > > Of course, you'd have to be pretty odd to want to start a DMA with a > read anyway - partly exactly because it's bad for performance since > reads will be synchronous and not buffered like a write). I have bad memories of old adaptec controllers ... That said, I think the above might not be right on ARM if we want to make it the rule, Will, what do you reckon ? Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt wrote: > > This is why, I want (with your agreement) to define clearly and once > and for all, that the Linux semantics of writel are that it is ordered > with previous writes to coherent memory (*) Honestly, I think those are the sane semantics. In fact, make it "ordered with previous writes" full stop, since it's not only ordered wrt previous writes to memory, but also previous writel's. > Also, can I assume the above ordering with writel() equally applies to > readl() or not ? > > IE: > dma_buf->foo = 1; > readl(STUPID_DEVICE_DMA_KICK_ON_READ); If that KICK_ON_READ is UC, then that's definitely the case. And honestly, status registers like that really should always be UC. But if somebody sets the area WC (which is crazy), then I think it might be at least debatable. x86 semantics does allow reads to be done before previous writes (or, put another way, writes to be buffered - the buffers are ordered so writes don't get re-ordered, but reads can happen during the buffering). But UC accesses are always done entirely ordered, and honestly, any status register that starts a DMA would not make sense any other way. Of course, you'd have to be pretty odd to want to start a DMA with a read anyway - partly exactly because it's bad for performance since reads will be synchronous and not buffered like a write). Linus
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya wrote: > > Basically changing it to > > dma_buffer->foo = 1;/* WB */ > wmb() > writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */ > mmiowb() Why? Why not just remove the wmb(), and keep the barrier in the writel()? The above code makes no sense, and just looks stupid to me. It also generates pointlessly bad code on x86, so it's bad there too. Linus
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 23:24 -0400, Sinan Kaya wrote: > On 3/27/2018 10:51 PM, Linus Torvalds wrote: > > > The discussion at hand is about > > > > > > dma_buffer->foo = 1;/* WB */ > > > writel(KICK, DMA_KICK_REGISTER);/* UC */ > > > > Yes. That certainly is ordered on x86. In fact, afaik it's ordered > > even if that writel() might be of type WC, because that only delays > > writes, it doesn't move them earlier. > > Now that we clarified x86 myth, Is this guaranteed on all architectures? If not we need to fix it. It's guaranteed on the "main" ones (arm, arm64, powerpc, i386, x86_64). We might need to check with other arch maintainers for the rest. We really want Linux to provide well defined "sane" semantics for the basic writel accessors. Note: We still have open questions about how readl() relates to surrounding memory accesses. It looks like ARM and powerpc do different things here. > We keep getting IA64 exception example. Maybe, this is also corrected since > then. I would think ia64 fixed it back when it was all discussed. I was under the impression all ia64 had "special" was the writel vs. spin_unlock which requires mmiowb, but maybe that was never completely fixed ? > Jose Abreu says "I don't know about x86 but arc architecture doesn't > have a wmb() in the writel() function (in some configs)". Well, it probably should then. > As long as we have these exceptions, these wmb() in common drivers is not > going anywhere and relaxed-arches will continue paying performance penalty. Well, let's fix them or leave them broken, at this point, it doesn't matter. We can give all arch maintainers a wakeup call and start making drivers work based on the documented assumptions. > I see 15% performance loss on ARM64 servers using Intel i40e network > drivers and an XL710 adapter due to CPU keeping itself busy doing barriers > most of the time rather than real work because of sequences like this all over > the place. > > dma_buffer->foo = 1;/* WB */ >wmb() > writel(KICK, DMA_KICK_REGISTER);/* UC */ > > I posted several patches last week to remove duplicate barriers on ARM while > trying to make the code friendly with other architectures. > > Basically changing it to > > dma_buffer->foo = 1;/* WB */ > wmb() > writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */ > mmiowb() > > This is a small step in the performance direction until we remove all > exceptions. > > https://www.spinics.net/lists/netdev/msg491842.html > https://www.spinics.net/lists/linux-rdma/msg62434.html > https://www.spinics.net/lists/arm-kernel/msg642336.html > > Discussion started to move around the need for relaxed API on PPC and then > why wmb() question came up. I'm working on the problem of relaxed APIs for powerpc, but we can keep that orthogonal. As is, today, a wmb() + writel() and a wmb() + writel_relaxed() on powerpc are identical. So changing them will not break us. But I don't see the point of doing that transformation if we can just get the straying archs fixed. It's not like any of them has a significant market presence these days anyway. Cheers, Ben. > Sinan >
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 16:51 -1000, Linus Torvalds wrote: > On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt > wrote: > > > > The discussion at hand is about > > > > dma_buffer->foo = 1;/* WB */ > > writel(KICK, DMA_KICK_REGISTER);/* UC */ > > Yes. That certainly is ordered on x86. In fact, afaik it's ordered > even if that writel() might be of type WC, because that only delays > writes, it doesn't move them earlier. Ok so this is our answer ... ... snip ... (thanks for the background info !) > Oh, the above UC case is absoutely guaranteed. Good. Then > The only issue really is that 99.9% of all testing gets done on x86 > unless you look at specific SoC drivers. > > On ARM, for example, there is likely little reason to care about x86 > memory ordering, because there is almost zero driver overlap between > x86 and ARM. > > *Historically*, the reason for following the x86 IO ordering was > simply that a lot of architectures used the drivers that were > developed on x86. The alpha and powerpc workstations were *designed* > with the x86 IO bus (PCI, then PCIe) and to work with the devices that > came with it. > > ARM? PCIe is almost irrelevant. For ARM servers, if they ever take > off, sure. But 99.99% of ARM is about their own SoC's, and so "x86 > test coverage" is simply not an issue. > > How much of an issue is it for Power? Maybe you decide it's not a big deal. > > Then all the above is almost irrelevant. So the overlap may not be that NIL in practice :-) But even then that doesn't matter as ARM has been happily implementing the same semantic you describe above for years, as do we powerpc. This is why, I want (with your agreement) to define clearly and once and for all, that the Linux semantics of writel are that it is ordered with previous writes to coherent memory (*) This is already what ARM and powerpc provide, from what you say, what x86 provides, I don't see any reason to keep that badly documented and have drivers randomly growing useless wmb()'s because they don't think it works on x86 without them ! Once that's sorted, let's tackle the problem of mmiowb vs. spin_unlock and the problem of writel_relaxed semantics but as separate issues :-) Also, can I assume the above ordering with writel() equally applies to readl() or not ? IE: dma_buf->foo = 1; readl(STUPID_DEVICE_DMA_KICK_ON_READ); Also works on x86 ? (It does on power, maybe not on ARM). Cheers, Ben. (*) From an Linux API perspective, all of this is only valid if the memory was allocated by dma_alloc_coherent(). Anything obtained by dma_map_something() might have been bounced bufferred or might require extra cache flushes on some architectures, and thus needs dma_sync_for_{cpu,device} calls. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On 3/27/2018 10:51 PM, Linus Torvalds wrote: >> The discussion at hand is about >> >> dma_buffer->foo = 1;/* WB */ >> writel(KICK, DMA_KICK_REGISTER);/* UC */ > Yes. That certainly is ordered on x86. In fact, afaik it's ordered > even if that writel() might be of type WC, because that only delays > writes, it doesn't move them earlier. Now that we clarified x86 myth, Is this guaranteed on all architectures? We keep getting IA64 exception example. Maybe, this is also corrected since then. Jose Abreu says "I don't know about x86 but arc architecture doesn't have a wmb() in the writel() function (in some configs)". As long as we have these exceptions, these wmb() in common drivers is not going anywhere and relaxed-arches will continue paying performance penalty. I see 15% performance loss on ARM64 servers using Intel i40e network drivers and an XL710 adapter due to CPU keeping itself busy doing barriers most of the time rather than real work because of sequences like this all over the place. dma_buffer->foo = 1;/* WB */ wmb() writel(KICK, DMA_KICK_REGISTER);/* UC */ I posted several patches last week to remove duplicate barriers on ARM while trying to make the code friendly with other architectures. Basically changing it to dma_buffer->foo = 1;/* WB */ wmb() writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */ mmiowb() This is a small step in the performance direction until we remove all exceptions. https://www.spinics.net/lists/netdev/msg491842.html https://www.spinics.net/lists/linux-rdma/msg62434.html https://www.spinics.net/lists/arm-kernel/msg642336.html Discussion started to move around the need for relaxed API on PPC and then why wmb() question came up. Sinan -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt wrote: > > The discussion at hand is about > > dma_buffer->foo = 1;/* WB */ > writel(KICK, DMA_KICK_REGISTER);/* UC */ Yes. That certainly is ordered on x86. In fact, afaik it's ordered even if that writel() might be of type WC, because that only delays writes, it doesn't move them earlier. Whether people *do* that or not, I don't know. But I wouldn't be surprised if they do. So if it's a DMA buffer, it's "cached". And even cached accesses are ordered wrt MMIO. Basically, to get unordered writes on x86, you need to either use explicitly nontemporal stores, or have a writecombining region with back-to-back writes that actually combine. And nobody really does that nontemporal store thing any more because the hardware that cared pretty much doesn't exist any more. It was too much pain. People use DMA and maybe an UC store for starting the DMA (or possibly a WC buffer that gets multiple stores in ascending order as a stream of commands). Things like UC will force everything to be entirely ordered, but even without UC, loads won't pass loads, and stores won't pass stores. > Now it appears that this wasn't fully understood back then, and some > people are now saying that x86 might not even provide that semantic > always. Oh, the above UC case is absoutely guaranteed. And I think even if it's WC, the write to kick off the DMA is ordered wrt the cached write. On x86, I think you need barriers only if you do things like - do two non-temporal stores and require them to be ordered: put a sfence or mfence in between them. - do two WC stores, and make sure they do not combine: put a sfence or mfence between them. - do a store, and a subsequent from a different address, and neither of them is UC: put a mfence between them. But note that this is literally just "load after store". A "store after load" doesn't need one. I think that's pretty much it. For example, the "lfence" instruction is almost entirely pointless on x86 - it was designed back in the time when people *thought* they might re-order loads. But loads don't get re-ordered. At least Intel seems to document that only non-temporal *stores* can get re-ordered wrt each other. End result: lfence is a historical oddity that can now be used to guarantee that a previous load has finished, and that in turn meant that it is now used in the Spectre mitigations. But it basically has no real memory ordering meaning since nothing passes an earlier load anyway, it's more of a pipeline thing. But in the end, one question is just "how much do drivers actually _rely_ on the x86 strong ordering?" We so support "smp_wmb()" even though x86 has strong enough ordering that just a barrier suffices. Somebody might just say "screw the x86 memory ordering, we're relaxed, and we'll fix up the drivers we care about". The only issue really is that 99.9% of all testing gets done on x86 unless you look at specific SoC drivers. On ARM, for example, there is likely little reason to care about x86 memory ordering, because there is almost zero driver overlap between x86 and ARM. *Historically*, the reason for following the x86 IO ordering was simply that a lot of architectures used the drivers that were developed on x86. The alpha and powerpc workstations were *designed* with the x86 IO bus (PCI, then PCIe) and to work with the devices that came with it. ARM? PCIe is almost irrelevant. For ARM servers, if they ever take off, sure. But 99.99% of ARM is about their own SoC's, and so "x86 test coverage" is simply not an issue. How much of an issue is it for Power? Maybe you decide it's not a big deal. Then all the above is almost irrelevant. Linus
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 16:10 +0100, Will Deacon wrote: > To clarify: are you saying that on x86 you need a wmb() prior to a writel > if you want that writel to be ordered after prior writes to memory? Is this > specific to WC memory or some other non-standard attribute? > > The only reason we have wmb() inside writel() on arm, arm64 and power is for > parity with x86 because Linus (CC'd) wanted architectures to order I/O vs > memory by default so that it was easier to write portable drivers. The > performance impact of that implicit barrier is non-trivial, but we want the > driver portability and I went as far as adding generic _relaxed versions for > the cases where ordering isn't required. You seem to be suggesting that none > of this is necessary and drivers would already run into problems on x86 if > they didn't use wmb() explicitly in conjunction with writel, which I find > hard to believe and is in direct contradiction with the current Linux I/O > memory model (modulo the broken example in the dma_*mb section of > memory-barriers.txt). Another clarification while we are at it All of this only applies to concurrent access by the CPU and the device to memory allocate with dma_alloc_coherent(). For memory "mapped" into the DMA domain via dma_map_* then an extra dma_sync_for_* is needed. In most useful server cases etc... these latter are NOPs, but architecture without full DMA cache coherency or using swiotlb, dma_map_* might maintain bounce buffers or play additional cache flushing tricks. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 14:39 -1000, Linus Torvalds wrote: > On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt > wrote: > > > > Well, we need to clarify that once and for all, because as I wrote > > earlier, it was decreed by Linus more than a decade ago that writel > > would be fully ordered by itself vs. previous memory stores (at least > > on UC memory). > > Yes. > > So "writel()" needs to be ordered with respect to other writel() uses > on the same thread. Anything else *will* break drivers. Obviously, the > drivers may then do magic to say "do write combining etc", but that > magic will be architecture-specific. > > The other issue is that "writel()" needs to be ordered wrt other CPU's > doing "writel()" if those writel's are in a spinlocked region. .../... The discussion at hand is about dma_buffer->foo = 1;/* WB */ writel(KICK, DMA_KICK_REGISTER);/* UC */ (The WC case is something else, let's not mix things up just yet) IE, a store to normal WB cache memory followed by a writel to a device which will then DMA from that buffer. Back in the days, we did require on powerpc a wmb() between these, but you made the point that x86 didn't and driver writers would never get that right. We decided to go conservative, added the necessary barrier inside writel, so did ARM and it became the norm that writel is also fully ordered vs. previous stores to memory *by the same CPU* of course (or protected by the same spinlock). Now it appears that this wasn't fully understood back then, and some people are now saying that x86 might not even provide that semantic always. So a number (fairly large) of drivers have been adding wmb() in those case, while others haven't, and it's a mess. The mess is compounded by the fact that if writel is now defined to *not* provide that ordering guarantee, then writel_relaxed() is pointless since all it is defined to relax is precisely the above ordering guarantee. So I want to get to the bottom of this once and for all so we can have well defined and documented semantics and stop having drivers do random things that may or may not work on some or all architectures (including x86 !). Quite note about the spinlock case... In fact this is the only case you did allow back then to be relaxed. In theory a writel followed by a spin_unlock requires an mmiowb (which is the only point of that barrier in fact). This was done because an arch (I think ia64) had a hard time getting MMIOs from multiple CPUs get in order vs. a lock and required an expensive access to the PCI host bridge to do so. Back then, on powerpc, we chose not to allow that relaxing and instead added code to our writel to set a per-cpu flag which would cause the next spin_unlock to use a stronger barrier than usual. We do need to clarify this as well, but let's start with the most basic one first, there is enough confusion already. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt wrote: > > Well, we need to clarify that once and for all, because as I wrote > earlier, it was decreed by Linus more than a decade ago that writel > would be fully ordered by itself vs. previous memory stores (at least > on UC memory). Yes. So "writel()" needs to be ordered with respect to other writel() uses on the same thread. Anything else *will* break drivers. Obviously, the drivers may then do magic to say "do write combining etc", but that magic will be architecture-specific. The other issue is that "writel()" needs to be ordered wrt other CPU's doing "writel()" if those writel's are in a spinlocked region. So it's not that "writel()" needs to be ordered wrt the spinlock itself, but you *do* need to honor ordering if you have something like this: spin_lock(&somelock); writel(a); writel(b); spin_unlock(&somelock); and if two CPU's run the above code "at the same time", then the *ONLY* acceptable sequence is abab. You cannot, and must not, ever see "aabb" at the device, for example, because of how the writel would basically leak out of the spinlock. That sounds "obvious", but dammit, a lot of architectures got that wrong, afaik. Linus
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 14:54 -0700, Alexander Duyck wrote: > On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt > wrote: > > On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote: > > > combined buffers. > > > > > > Alex: > > > "Don't bother. I can tell you right now that for x86 you have to have a > > > wmb() before the writel(). > > > > No, this isn't the semantics of writel. You shouldn't need it unless > > something changed and we need to revisit our complete understanding of > > *all* MMIO accessor semantics. > > The issue seems to be that there have been two different ways of > dealing with this. There has historically been a number of different > drivers that have been carrying this wmb() workaround since something > like 2002. I get that the semantics for writel might have changed > since then, but those of us who already have the wmb() in our drivers > will be very wary of anyone wanting to go through and remove them > since writel is supposed to be "good enough". I would much rather err > on the side of caution here. > > I view the wmb() + writel_relaxed() as more of a driver owning and > handling this itself. Besides in the Intel Ethernet driver case it is > better performance as our wmb() placement for us also provides a > secondary barrier so we don't need to add a separate smp_wmb() to deal > with a potential race we have with the Tx cleanup. > > > At least for UC space, it has always been accepted (and enforced) that > > writel would not require any other barrier to order vs. previous stores > > to memory. > > So the one thing I would question here is if this is UC vs UC or if > this extends to other types as well? So for x86 we could find > references to Write Combining being flushed by a write to UC memory, > however I have yet to find a clear explanation of what a write to UC > does to WB. Well, this is the standard write memory + trigger DMA case, the one specific case for which Linus was adamant we don't need another barrier back then ... > My personal inclination would be to err on the side of > caution. Which means writel_relaxed is now pointless ? We need clear semantics here. In this case the "side of caution" means we are randomly doing things not understanding what really happens and that makes me *more* nervous. > I just don't want us going through and removing the wmb() > calls because it "should" work. I would want to know for certain it > will work. We need to know for certain anyway. Otherwise, all the drivers that do not have wmb's are potentially broken. So I dont agree with the status quo. We need to establish precisely what x86 does, decide what we want the semantic of writel to be, and implement things accordingly. Ben.
Re: RFC on writel and writel_relaxed
On 3/27/2018 5:54 PM, Alexander Duyck wrote: > I view the wmb() + writel_relaxed() as more of a driver owning and > handling this itself. Besides in the Intel Ethernet driver case it is > better performance as our wmb() placement for us also provides a > secondary barrier so we don't need to add a separate smp_wmb() to deal > with a potential race we have with the Tx cleanup. Thanks for the reminder. I forgot about the double barrier optimization. wmb() + writel_relaxed() seems to be the best option for Intel network drivers at this moment. Otherwise, we'll have to remove wmb() and throw in smp barriers there like you mentioned. I'll leave the changes in the Intel drivers alone. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote: >> combined buffers. >> >> Alex: >> "Don't bother. I can tell you right now that for x86 you have to have a >> wmb() before the writel(). > > No, this isn't the semantics of writel. You shouldn't need it unless > something changed and we need to revisit our complete understanding of > *all* MMIO accessor semantics. The issue seems to be that there have been two different ways of dealing with this. There has historically been a number of different drivers that have been carrying this wmb() workaround since something like 2002. I get that the semantics for writel might have changed since then, but those of us who already have the wmb() in our drivers will be very wary of anyone wanting to go through and remove them since writel is supposed to be "good enough". I would much rather err on the side of caution here. I view the wmb() + writel_relaxed() as more of a driver owning and handling this itself. Besides in the Intel Ethernet driver case it is better performance as our wmb() placement for us also provides a secondary barrier so we don't need to add a separate smp_wmb() to deal with a potential race we have with the Tx cleanup. > At least for UC space, it has always been accepted (and enforced) that > writel would not require any other barrier to order vs. previous stores > to memory. So the one thing I would question here is if this is UC vs UC or if this extends to other types as well? So for x86 we could find references to Write Combining being flushed by a write to UC memory, however I have yet to find a clear explanation of what a write to UC does to WB. My personal inclination would be to err on the side of caution. I just don't want us going through and removing the wmb() calls because it "should" work. I would want to know for certain it will work. - Alex
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote: > combined buffers. > > Alex: > "Don't bother. I can tell you right now that for x86 you have to have a > wmb() before the writel(). No, this isn't the semantics of writel. You shouldn't need it unless something changed and we need to revisit our complete understanding of *all* MMIO accessor semantics. At least for UC space, it has always been accepted (and enforced) that writel would not require any other barrier to order vs. previous stores to memory. > Based on the comment in > (https://www.spinics.net/lists/linux-rdma/msg62666.html): > Replacing wmb() + writel() with wmb() + writel_relaxed() will work on > PPC, it will just not give you a benefit today. > > I say the patch set stays. This gives benefit on ARM, and has no > effect on x86 and PowerPC. If you want to look at trying to optimize > things further on PowerPC and such then go for it in terms of trying > to implement the writel_relaxed(). Otherwise I say we call the ARM > goodness a win and don't get ourselves too wrapped up in trying to fix > this for all architectures."
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 11:54 -0700, Alexander Duyck wrote: > As far as I know the code has been this way for a while, something > like 2002, when the barrier was already present in e1000. However > there it was calling out weakly ordered models "such as IA-64". Since > then pretty much all the hardware based network drivers at this point > have similar code floating around with wmb() in place to prevent > issues on weak ordered memory systems. > > So in any case we still need to be careful as there are architectures > that are depending on this even if they might not be x86. :-/ Well, we need to clarify that once and for all, because as I wrote earlier, it was decreed by Linus more than a decade ago that writel would be fully ordered by itself vs. previous memory stores (at least on UC memory). This is why we added sync's to writel on powerpc and later ARM added similar barriers to theirs. This is also why writel_relaxed was added (though much later), since what writel_relaxed does is to life that specific requirement. IE. If what you say is true and wmb() is needed on x86, then writel_relaxed is now completely useless... Ben.
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote: > > Can we say the same thing for iowrite32() and iowrite32be(). I also see > > wmb() > > in front of these. > > I don't think so. My reading of memory-barriers.txt says that writeX might > expand to outX, and outX is not ordered with respect to other types of > memory. Ugh ? My understanding of HW at least is the exact opposite. outX is *more* ordered if anything, than any other accessors. IO space is completely synchronous, non posted and ordered afaik. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 08:12 -0600, Jason Gunthorpe wrote: > > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) > > > > I will have to just remove the wmb and keep writel, then repost. > > Okay, but before you do that, can we get a statement how this works > for WC? > > Some of these writels are to WC memory, do they need the wmb()?!? This is an issue as we don't have well defined semantics for WC. At this point, I would suggest staying away from that (ie, not changing them). We need to look into it. I know for example that on powerpc I cannot give you any weaker semantic on WC for writel (I have to put a full sync in there), but I am trying to see if I can make writel_relaxed both work with the existing semantics and provide combining. But it's not yet a given (our weaker IO barrier, eieio, isn't architecturally defined to do anything on G=0 space, looking with the HW guys at what the HW actually does). Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 09:46 -0400, Sinan Kaya wrote: > On 3/27/2018 7:02 AM, Will Deacon wrote: > > - See Documentation/DMA-API.txt for more information on consistent > > memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the MMIO region. The cheaper > > + writel_relaxed() does not provide this guarantee and must not be used > > + here. > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > in front of these. Yes, they should have the same semantics as writel Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 9:54 PM, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck > wrote: >> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon wrote: > > 11.10 STORE BUFFER > Intel 64 and IA-32 processors temporarily store each write (store) to > memory in a store buffer. The store buffer > improves processor performance by allowing the processor to continue > executing instructions without having to > wait until a write to memory and/or to a cache is complete. It also > allows writes to be delayed for more efficient use > of memory-access bus cycles. > In general, the existence of the store buffer is transparent to > software, even in systems that use multiple processors. > The processor ensures that write operations are always carried out in > program order. It also insures that the > contents of the store buffer are always drained to memory in the > following situations: > • When an exception or interrupt is generated. > • (P6 and more recent processor families only) When a serializing > instruction is executed. > • When an I/O instruction is executed. I guess I/O instruction is still ambiguous on x86, it may just refer to 'inb'/'outb' style instructions rather than 'mov' on a device MMIO area. Here's a link to a reply from Linus that I found on this topic: http://yarchive.net/comp/linux/write_combining.html Arnd
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck wrote: > On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon wrote: >>> >>> Sinan >>> "We are being told that if you use writel(), then you don't need a wmb() on >>> all architectures." >>> >>> Alex: >>> "I'm not sure who told you that but that is incorrect, at least for >>> x86. If you attempt to use writel() without the wmb() we will have to >>> NAK the patches. We will accept the wmb() with writel_releaxed() since >>> that solves things for ARM." >>> >>> > Jason is seeking behavior clarification for write combined buffers. >>> >>> Alex: >>> "Don't bother. I can tell you right now that for x86 you have to have a >>> wmb() before the writel(). >> >> To clarify: are you saying that on x86 you need a wmb() prior to a writel >> if you want that writel to be ordered after prior writes to memory? Is this >> specific to WC memory or some other non-standard attribute? > > Note, I am not a CPU guy so this is just my interpretation. It is my > understanding that the wmb(), aka sfence, is needed on x86 to sort out > writes between Write-back(WB) system memory and Strong Uncacheable > (UC) MMIO accesses. > > I was hoping to be able to cite something in the software developers > manual > (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf), > but that tends to be pretty vague. I have re-read section 22.34 > (volume 3B) several times and I am still not clear on if it says we > need the sfence or not. It is a matter of figuring out what the impact > of store buffers and caching are for WB versus UC memory. Here is what I found regarding the store buffer in that document: 11.10 STORE BUFFER Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles. In general, the existence of the store buffer is transparent to software, even in systems that use multiple processors. The processor ensures that write operations are always carried out in program order. It also insures that the contents of the store buffer are always drained to memory in the following situations: • When an exception or interrupt is generated. • (P6 and more recent processor families only) When a serializing instruction is executed. • When an I/O instruction is executed. • When a LOCK operation is performed. • (P6 and more recent processor families only) When a BINIT operation is performed. • (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores. • (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores. The discussion of write ordering in Section 8.2, “Memory Ordering,” gives a detailed description of the operation of the store buffer. Arnd
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon wrote: > Hi Alex, > > On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote: >> +netdev, +Alex >> >> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote: >> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: >> >> Most of the drivers have a unwound loop with writeq() or something to >> >>> do it. >> >> >> >> But isn't the writeq() barrier much more expensive than anything you'd >> >> do in function calls? >> > >> > It is for us, and will break any write combining. >> > >> > The same document says that _relaxed() does not give that guarentee. >> > >> > The lwn articule on this went into some depth on the interaction with >> > spinlocks. >> > >> > As far as I can see, containment in a spinlock seems to be the only >> > different between writel and writel_relaxed.. >> >> I was always puzzled by this: The intention of _relaxed() on ARM >> (where it originates) was to skip the barrier that serializes DMA >> with MMIO, not to skip the serialization between MMIO and locks. >> >>> >> >>> But that was never a requirement of writel(), >> >>> Documentation/memory-barriers.txt gives an explicit example demanding >> >>> the wmb() before writel() for ordering system memory against writel. >> > >> > This is a bug in the documentation. >> > >> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). >> >> Adding Alexander Duyck to Cc, he added that section as part of >> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and >> >> dma_wmb()"). Also adding the other people that were involved with that. >> > >> > Linus himself made it very clear years ago. readl and writel have to >> > order vs memory accesses. >> > >> >>> I actually have no idea why ARM had that barrier, I always assumed it >> >>> was to give program ordering to the accesses and that _relaxed allowed >> >>> re-ordering (the usual meaning of relaxed).. >> >>> >> >>> But the barrier document makes it pretty clear that the only >> >>> difference between the two is spinlock containment, and WillD wrote >> >>> this text, so I belive it is accurate for ARM. >> >>> >> >>> Very confusing. >> >> >> >> It does mention serialization with both DMA and locks in the >> >> section about readX_relaxed()/writeX_relaxed(). The part >> >> about DMA is very clear here, and I must have just forgotten >> >> the exact semantics with regards to spinlocks. I'm still not >> >> sure what prevents a writel() from leaking out the end of a >> >> spinlock section that doesn't happen with writel_relaxed(), since >> >> the barrier in writel() comes before the access, and the >> >> spin_unlock() shouldn't affect the external buses. >> > >> > So... >> > >> > Historically, what happened is that we (we means whoever participated >> > in the discussion on the list with Linus calling the shots really) >> > decided that there was no sane way for drivers to understand a world >> > where readl/writel didn't fully order things vs. memory accesses (ie, >> > DMA). >> > >> > So it should always be correct to do: >> > >> > - Write to some in-memory buffer >> > - writel() to kick the DMA read of that buffer >> > >> > without any extra barrier. >> > >> > The spinlock situation however got murky. Mostly that came up because >> > on architecture (I forgot who, might have been ia64) has a hard time >> > providing that consistency without making writel insanely expensive. >> > >> > Thus they created mmiowb whose main purpose was precisely to order >> > writel with a following spin_unlock. >> > >> > I decided not to go down that path on power because getting all drivers >> > "fixed" to do the right thing was going to be a losing battle, and >> > instead added per-cpu tracking of writel in order to "escalate" to a >> > heavier barrier in spin_unlock itself when necessary. >> > >> > Now, all this happened more than a decade ago and it's possible that >> > the understanding or expectations "shifted" over time... >> >> Alex is raising concerns on the netdev list. >> >> Sinan >> "We are being told that if you use writel(), then you don't need a wmb() on >> all architectures." >> >> Alex: >> "I'm not sure who told you that but that is incorrect, at least for >> x86. If you attempt to use writel() without the wmb() we will have to >> NAK the patches. We will accept the wmb() with writel_releaxed() since >> that solves things for ARM." >> >> > Jason is seeking behavior clarification for write combined buffers. >> >> Alex: >> "Don't bother. I can tell you right now that for x86 you have to have a >> wmb() before the writel(). > > To clarify: are you saying that on x86 you need a wmb() prior to a writel > if you want that writel to be ordered after prior writes to memory? Is this > specific to WC memory or some other non-standard attribute? Note, I am not a CPU guy so this is just my interpretation. It is my understanding that the wmb(), aka sfence, is needed on x86 to sort ou
Re: RFC on writel and writel_relaxed
Hi, On 27-03-2018 15:46, Sinan Kaya wrote: > > Sinan > "We are being told that if you use writel(), then you don't need a wmb() on > all architectures." > > Alex: > "I'm not sure who told you that but that is incorrect, at least for > x86. If you attempt to use writel() without the wmb() we will have to > NAK the patches. We will accept the wmb() with writel_releaxed() since > that solves things for ARM." > So this means we should always use writel() + wmb() in *all* accesses? I don't know about x86 but arc architecture doesn't have a wmb() in the writel() function (in some configs). I see the point in net drivers while you have dma + io accesses but for most drivers this shouldn't be needed, right? What about ordering of writes? Is it guaranteed that one write will happen before the next one ? Best Regards, Jose Miguel Abreu
Re: RFC on writel and writel_relaxed
Hi Alex, On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote: > +netdev, +Alex > > On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote: > > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: > >> Most of the drivers have a unwound loop with writeq() or something to > >>> do it. > >> > >> But isn't the writeq() barrier much more expensive than anything you'd > >> do in function calls? > > > > It is for us, and will break any write combining. > > > > The same document says that _relaxed() does not give that guarentee. > > > > The lwn articule on this went into some depth on the interaction with > > spinlocks. > > > > As far as I can see, containment in a spinlock seems to be the only > > different between writel and writel_relaxed.. > > I was always puzzled by this: The intention of _relaxed() on ARM > (where it originates) was to skip the barrier that serializes DMA > with MMIO, not to skip the serialization between MMIO and locks. > >>> > >>> But that was never a requirement of writel(), > >>> Documentation/memory-barriers.txt gives an explicit example demanding > >>> the wmb() before writel() for ordering system memory against writel. > > > > This is a bug in the documentation. > > > >> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > >> Adding Alexander Duyck to Cc, he added that section as part of > >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > >> dma_wmb()"). Also adding the other people that were involved with that. > > > > Linus himself made it very clear years ago. readl and writel have to > > order vs memory accesses. > > > >>> I actually have no idea why ARM had that barrier, I always assumed it > >>> was to give program ordering to the accesses and that _relaxed allowed > >>> re-ordering (the usual meaning of relaxed).. > >>> > >>> But the barrier document makes it pretty clear that the only > >>> difference between the two is spinlock containment, and WillD wrote > >>> this text, so I belive it is accurate for ARM. > >>> > >>> Very confusing. > >> > >> It does mention serialization with both DMA and locks in the > >> section about readX_relaxed()/writeX_relaxed(). The part > >> about DMA is very clear here, and I must have just forgotten > >> the exact semantics with regards to spinlocks. I'm still not > >> sure what prevents a writel() from leaking out the end of a > >> spinlock section that doesn't happen with writel_relaxed(), since > >> the barrier in writel() comes before the access, and the > >> spin_unlock() shouldn't affect the external buses. > > > > So... > > > > Historically, what happened is that we (we means whoever participated > > in the discussion on the list with Linus calling the shots really) > > decided that there was no sane way for drivers to understand a world > > where readl/writel didn't fully order things vs. memory accesses (ie, > > DMA). > > > > So it should always be correct to do: > > > > - Write to some in-memory buffer > > - writel() to kick the DMA read of that buffer > > > > without any extra barrier. > > > > The spinlock situation however got murky. Mostly that came up because > > on architecture (I forgot who, might have been ia64) has a hard time > > providing that consistency without making writel insanely expensive. > > > > Thus they created mmiowb whose main purpose was precisely to order > > writel with a following spin_unlock. > > > > I decided not to go down that path on power because getting all drivers > > "fixed" to do the right thing was going to be a losing battle, and > > instead added per-cpu tracking of writel in order to "escalate" to a > > heavier barrier in spin_unlock itself when necessary. > > > > Now, all this happened more than a decade ago and it's possible that > > the understanding or expectations "shifted" over time... > > Alex is raising concerns on the netdev list. > > Sinan > "We are being told that if you use writel(), then you don't need a wmb() on > all architectures." > > Alex: > "I'm not sure who told you that but that is incorrect, at least for > x86. If you attempt to use writel() without the wmb() we will have to > NAK the patches. We will accept the wmb() with writel_releaxed() since > that solves things for ARM." > > > Jason is seeking behavior clarification for write combined buffers. > > Alex: > "Don't bother. I can tell you right now that for x86 you have to have a > wmb() before the writel(). To clarify: are you saying that on x86 you need a wmb() prior to a writel if you want that writel to be ordered after prior writes to memory? Is this specific to WC memory or some other non-standard attribute? The only reason we have wmb() inside writel() on arm, arm64 and power is for parity with x86 because Linus (CC'd) wanted architectures to order I/O vs memory by default so that it was easier to write portable drivers. The performance impact of that implicit barrier is non-trivial, but
Re: RFC on writel and writel_relaxed
+netdev, +Alex On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: >> Most of the drivers have a unwound loop with writeq() or something to >>> do it. >> >> But isn't the writeq() barrier much more expensive than anything you'd >> do in function calls? > > It is for us, and will break any write combining. > > The same document says that _relaxed() does not give that guarentee. > > The lwn articule on this went into some depth on the interaction with > spinlocks. > > As far as I can see, containment in a spinlock seems to be the only > different between writel and writel_relaxed.. I was always puzzled by this: The intention of _relaxed() on ARM (where it originates) was to skip the barrier that serializes DMA with MMIO, not to skip the serialization between MMIO and locks. >>> >>> But that was never a requirement of writel(), >>> Documentation/memory-barriers.txt gives an explicit example demanding >>> the wmb() before writel() for ordering system memory against writel. > > This is a bug in the documentation. > >> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). >> Adding Alexander Duyck to Cc, he added that section as part of >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and >> dma_wmb()"). Also adding the other people that were involved with that. > > Linus himself made it very clear years ago. readl and writel have to > order vs memory accesses. > >>> I actually have no idea why ARM had that barrier, I always assumed it >>> was to give program ordering to the accesses and that _relaxed allowed >>> re-ordering (the usual meaning of relaxed).. >>> >>> But the barrier document makes it pretty clear that the only >>> difference between the two is spinlock containment, and WillD wrote >>> this text, so I belive it is accurate for ARM. >>> >>> Very confusing. >> >> It does mention serialization with both DMA and locks in the >> section about readX_relaxed()/writeX_relaxed(). The part >> about DMA is very clear here, and I must have just forgotten >> the exact semantics with regards to spinlocks. I'm still not >> sure what prevents a writel() from leaking out the end of a >> spinlock section that doesn't happen with writel_relaxed(), since >> the barrier in writel() comes before the access, and the >> spin_unlock() shouldn't affect the external buses. > > So... > > Historically, what happened is that we (we means whoever participated > in the discussion on the list with Linus calling the shots really) > decided that there was no sane way for drivers to understand a world > where readl/writel didn't fully order things vs. memory accesses (ie, > DMA). > > So it should always be correct to do: > > - Write to some in-memory buffer > - writel() to kick the DMA read of that buffer > > without any extra barrier. > > The spinlock situation however got murky. Mostly that came up because > on architecture (I forgot who, might have been ia64) has a hard time > providing that consistency without making writel insanely expensive. > > Thus they created mmiowb whose main purpose was precisely to order > writel with a following spin_unlock. > > I decided not to go down that path on power because getting all drivers > "fixed" to do the right thing was going to be a losing battle, and > instead added per-cpu tracking of writel in order to "escalate" to a > heavier barrier in spin_unlock itself when necessary. > > Now, all this happened more than a decade ago and it's possible that > the understanding or expectations "shifted" over time... Alex is raising concerns on the netdev list. Sinan "We are being told that if you use writel(), then you don't need a wmb() on all architectures." Alex: "I'm not sure who told you that but that is incorrect, at least for x86. If you attempt to use writel() without the wmb() we will have to NAK the patches. We will accept the wmb() with writel_releaxed() since that solves things for ARM." > Jason is seeking behavior clarification for write combined buffers. Alex: "Don't bother. I can tell you right now that for x86 you have to have a wmb() before the writel(). Based on the comment in (https://www.spinics.net/lists/linux-rdma/msg62666.html): Replacing wmb() + writel() with wmb() + writel_relaxed() will work on PPC, it will just not give you a benefit today. I say the patch set stays. This gives benefit on ARM, and has no effect on x86 and PowerPC. If you want to look at trying to optimize things further on PowerPC and such then go for it in terms of trying to implement the writel_relaxed(). Otherwise I say we call the ARM goodness a win and don't get ourselves too wrapped up in trying to fix this for all architectures." > > Cheers, > Ben. > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 09:46:51AM -0400, Sinan Kaya wrote: > On 3/27/2018 7:02 AM, Will Deacon wrote: > > - See Documentation/DMA-API.txt for more information on consistent > > memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the MMIO region. The cheaper > > + writel_relaxed() does not provide this guarantee and must not be used > > + here. > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > in front of these. I don't think so. My reading of memory-barriers.txt says that writeX might expand to outX, and outX is not ordered with respect to other types of memory. Will
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 10:42:00AM +0100, Will Deacon wrote: > On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe wrote: > > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > > > I even see patches adding wmb() based on actual observed memory > > > corruption during testing on Intel: > > > > > > https://patchwork.kernel.org/patch/10177207/ > > > > > > So you think all of this is unnecessary and writel is totally strongly > > > ordered, even on multi-socket Intel? > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > If this is just about ordering accesses to coherent DMA, then using > dma_wmb() instead will be much better performance on arm/arm64. dma_wmb() is a NOP on x86, it was tested anyhow and didn't help this case.. Confusing, but probably not relevant to this discussion. Jason
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. Ah! So it is, too many things called 'db' in that driver.. One of the 'db' is also WC BAR memory.. Jason
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 08:22:55AM -0400, ok...@codeaurora.org wrote: > On 2018-03-27 07:23, Benjamin Herrenschmidt wrote: > >On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: > >>> The interesting thing is that we do seem to have a whole LOT of these > >>> spurrious wmb before writel all over the tree, I suspect because of > >>> that incorrect recommendation in memory-barriers.txt. > >>> > >>> We should fix that. > >> > >>Maybe the problem is just that it's so counter-intuitive that we don't > >>need that barrier in Linux, when the hardware does need one on some > >>architectures. > >> > >>How about we define a barrier type instruction specifically for this > >>purpose, something like wmb_before_mmio() and have all architectures > >>define that to an empty macro? > > > >This is exactly what wmb() is about and exactly what Linux rejected > >back in the day (and in hindsight I agree with him). > > > >>That way, having correct code using wmb_before_mmio() will not > >>trigger an incorrect review comment that leads to extra wmb(). ;-) > > > >Ah, you mean have an empty macro that will always be empty on all > >architectures just to fool people ? :-) > > > >Not sure that will fly ... I think we just need to be documenting that > >stuff better and not have incorrect examples. Also a sweep to remove > >some useless ones like the one in e1000e would help. > > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) > > I will have to just remove the wmb and keep writel, then repost. Okay, but before you do that, can we get a statement how this works for WC? Some of these writels are to WC memory, do they need the wmb()?!? Jason
Re: RFC on writel and writel_relaxed
On 3/27/2018 7:02 AM, Will Deacon wrote: > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() in front of these. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
RE: RFC on writel and writel_relaxed
> Fair enough. I'd rather people used _relaxed by default, but I have to admit > that it will probably just result in them getting things wrong... Certainly requiring the driver writes use explicit barriers should make them understand when and why they are needed - and then put in the correct ones. The problem I've had is that I have a good idea which barriers are needed but find that readl/writel seem to contain a lot of extra ones. Maybe the are required in some places, but the extra synchronising instructions could easily have measureable performance effects on hot paths. Drivers are likely to contain sequences like: read_io if (...) return write_mem ... write_mem barrier write_mem barrier write_io for things like ring updates. Where the 'mem' might actually be in io space. In such sequences not all the synchronising instructions are needed. I'm not at all sure it is easy to get the right set. David
Re: RFC on writel and writel_relaxed
On 2018-03-27 07:23, Benjamin Herrenschmidt wrote: On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: > The interesting thing is that we do seem to have a whole LOT of these > spurrious wmb before writel all over the tree, I suspect because of > that incorrect recommendation in memory-barriers.txt. > > We should fix that. Maybe the problem is just that it's so counter-intuitive that we don't need that barrier in Linux, when the hardware does need one on some architectures. How about we define a barrier type instruction specifically for this purpose, something like wmb_before_mmio() and have all architectures define that to an empty macro? This is exactly what wmb() is about and exactly what Linux rejected back in the day (and in hindsight I agree with him). That way, having correct code using wmb_before_mmio() will not trigger an incorrect review comment that leads to extra wmb(). ;-) Ah, you mean have an empty macro that will always be empty on all architectures just to fool people ? :-) Not sure that will fly ... I think we just need to be documenting that stuff better and not have incorrect examples. Also a sweep to remove some useless ones like the one in e1000e would help. I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) I will have to just remove the wmb and keep writel, then repost. Some of these got applied. It will cause some churn for the maintainers. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 10:57 +0100, Will Deacon wrote: > > > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Patch below. Thoughts? Looks good, we should probably also have a clearer (explicit) definition of that ordering in the driver-api doco. Now, to remove all those useless wmb's and find what other bugs they were papering over ... ;-) Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 12:02 +0100, Will Deacon wrote: > can see it now has ownership. Note that, when using writel(), a prior > > wmb() is not needed to guarantee that the cache coherent memory writes > > have completed before writing to the cache incoherent MMIO region. > > The cheaper writel_relaxed() does not guarantee the DMA to be visible > > to the device and must not be used here. > > Fair enough. I'd rather people used _relaxed by default, but I have to admit > that it will probably just result in them getting things wrong. Just a tiny > bit of wordsmithing brings this to: I prefer people using writel() by default for the simple reason that 99% of writels out there are configuration stuff for which the performance difference doesn't matter, and people will just get it wrong. Let's focus on the rare fast path for optimisation. > > diff --git a/Documentation/memory-barriers.txt > b/Documentation/memory-barriers.txt > index a863009849a3..3247547d1c36 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. > + > + See the subsection "Kernel I/O barrier effects" for more information on > + relaxed I/O accessors and the Documentation/DMA-API.txt file for more > + information on consistent memory. > > > MMIO WRITE BARRIER > > > If you're happy with that, I'll send it as a proper patch. > > Cheers, > > Will
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 10:20:02PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote: > > > > > > This example adds a wmb() between two writes to a coherent DMA > > > area, it is definitely required there. I'm pretty sure I've never seen > > > any bug reports pointing to a missing wmb() between memory > > > and MMIO write accesses, but if you remember seeing them in the > > > list, maybe you can look again for some evidence of something going > > > wrong on x86 without it? > > > > If this is just about ordering accesses to coherent DMA, then using > > dma_wmb() instead will be much better performance on arm/arm64. > > Ah, something we should look into for powerpc as well, as we could use > an lwsync for that which is also cheaper than a full sync wmb does. > > dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP > conditional right ? Almost -- the slight change we have on arm64 is to say that it's "outer-shareable", which means it also orders non-cacheable accesses in the case that dma_alloc_coherent is used to allocate a consistent buffer for a non-coherent device. Will
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Maybe the problem is just that it's so counter-intuitive that we don't > need that barrier in Linux, when the hardware does need one on some > architectures. > > How about we define a barrier type instruction specifically for this > purpose, something like wmb_before_mmio() and have all architectures > define that to an empty macro? This is exactly what wmb() is about and exactly what Linux rejected back in the day (and in hindsight I agree with him). > That way, having correct code using wmb_before_mmio() will not > trigger an incorrect review comment that leads to extra wmb(). ;-) Ah, you mean have an empty macro that will always be empty on all architectures just to fool people ? :-) Not sure that will fly ... I think we just need to be documenting that stuff better and not have incorrect examples. Also a sweep to remove some useless ones like the one in e1000e would help. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote: > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > If this is just about ordering accesses to coherent DMA, then using > dma_wmb() instead will be much better performance on arm/arm64. Ah, something we should look into for powerpc as well, as we could use an lwsync for that which is also cheaper than a full sync wmb does. dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP conditional right ? Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 1:02 PM, Will Deacon wrote: > On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote: >> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon wrote: >> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > diff --git a/Documentation/memory-barriers.txt > b/Documentation/memory-barriers.txt > index a863009849a3..3247547d1c36 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. > + > + See the subsection "Kernel I/O barrier effects" for more information on > + relaxed I/O accessors and the Documentation/DMA-API.txt file for more > + information on consistent memory. > > > MMIO WRITE BARRIER > > > If you're happy with that, I'll send it as a proper patch. Looks good to me, thanks! Arnd
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon wrote: > > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > >> > - > >> > - See Documentation/DMA-API.txt for more information on consistent > >> > memory. > >> > + can see it now has ownership. Note that, when using writel(), a > >> > prior > >> > + wmb() is not needed to guarantee that the cache coherent memory > >> > writes > >> > + have completed before writing to the cache incoherent MMIO region. > >> > + If this ordering between incoherent MMIO and coherent memory > >> > regions > > One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd > prefer just "MMIO" here. At least I don't have the faintest clue what the > difference between "coherent MMIO" and "incoherent MMIO" would be ;-) Yes, you're right. I was just following the terminology that's already used here, but actually that seems not be used anywhere else in the document! I'll kill it. > >> > + is not required, writel_relaxed() can be used instead and is > >> > significantly > >> > + cheaper on some weakly-ordered architectures. > >> > >> I think that's a great improvement, but I'm a bit worried about > >> recommending > >> writel_relaxed() too much: I've seen a lot of drivers that just always use > >> writel_relaxed() over write(), and some of them get that wrong when they > >> don't understand the difference but end up using DMA without explicit > >> barriers anyway. > >> > >> Also, having an architecture-independent driver use wmb()+writel_relaxed() > >> ends up being more expensive than just using write(). Not sure how to > >> best phrase it though. > > > > Perhaps I add reword that with a simple example to say: > > > > If this ordering between incoherent MMIO and coherent memory regions > > is not required (e.g. in a sequence of accesses all to the MMIO region) > > [...] > > > > since that seems to be the usual case where the _relaxed accessors help. > > That still doesn't quite capture what I'd like driver writes to do: in essence > I would recommend them to use writel() all the time, except in performance > critical code that has been shown to be correct and has a comment to explain > why _relaxed() is ok in that particular function. > > Maybe it can just be rephrased to warn against the use of writel_relaxed() > here, and explain the difference that way: > > can see it now has ownership. Note that, when using writel(), a prior > wmb() is not needed to guarantee that the cache coherent memory writes > have completed before writing to the cache incoherent MMIO region. > The cheaper writel_relaxed() does not guarantee the DMA to be visible > to the device and must not be used here. Fair enough. I'd rather people used _relaxed by default, but I have to admit that it will probably just result in them getting things wrong. Just a tiny bit of wordsmithing brings this to: diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index a863009849a3..3247547d1c36 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: /* assign ownership */ desc->status = DEVICE_OWN; - /* force memory to sync before notifying device via MMIO */ - wmb(); - /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device - can see it now has ownership. The wmb() is needed to guarantee that the - cache coherent memory writes have completed before attempting a write to - the cache incoherent MMIO region. - - See Documentation/DMA-API.txt for more information on consistent memory. + can see it now has ownership. Note that, when using writel(), a prior + wmb() is not needed to guarantee that the cache coherent memory writes + have completed before writing to the MMIO region. The cheaper + writel_relaxed() does not provide this guarantee and must not be used + here. + + See the subsection "Kernel I/O barrier effects" for more information on + relaxed I/O accessors and the Documentation/DMA-API.txt file for more + information on consistent memory. MMIO WRITE BARRIER If you're happy with that, I'll send it as a proper patch. Cheers, Will
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon wrote: > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: >> > - >> > - See Documentation/DMA-API.txt for more information on consistent >> > memory. >> > + can see it now has ownership. Note that, when using writel(), a >> > prior >> > + wmb() is not needed to guarantee that the cache coherent memory >> > writes >> > + have completed before writing to the cache incoherent MMIO region. >> > + If this ordering between incoherent MMIO and coherent memory regions One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd prefer just "MMIO" here. At least I don't have the faintest clue what the difference between "coherent MMIO" and "incoherent MMIO" would be ;-) >> > + is not required, writel_relaxed() can be used instead and is >> > significantly >> > + cheaper on some weakly-ordered architectures. >> >> I think that's a great improvement, but I'm a bit worried about recommending >> writel_relaxed() too much: I've seen a lot of drivers that just always use >> writel_relaxed() over write(), and some of them get that wrong when they >> don't understand the difference but end up using DMA without explicit >> barriers anyway. >> >> Also, having an architecture-independent driver use wmb()+writel_relaxed() >> ends up being more expensive than just using write(). Not sure how to >> best phrase it though. > > Perhaps I add reword that with a simple example to say: > > If this ordering between incoherent MMIO and coherent memory regions > is not required (e.g. in a sequence of accesses all to the MMIO region) > [...] > > since that seems to be the usual case where the _relaxed accessors help. That still doesn't quite capture what I'd like driver writes to do: in essence I would recommend them to use writel() all the time, except in performance critical code that has been shown to be correct and has a comment to explain why _relaxed() is ok in that particular function. Maybe it can just be rephrased to warn against the use of writel_relaxed() here, and explain the difference that way: can see it now has ownership. Note that, when using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the cache incoherent MMIO region. The cheaper writel_relaxed() does not guarantee the DMA to be visible to the device and must not be used here. Arnd
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon wrote: > > > > > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 > > From: Will Deacon > > Date: Tue, 27 Mar 2018 10:49:58 +0100 > > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering > > example > > > > The section of memory-barriers.txt that describes the dma_Xmb() barriers > > has an incorrect example claiming that a wmb() is required after writing > > to coherent memory in order for those writes to be visible to a device > > before a subsequent MMIO access using writel() can reach the device. > > > > In fact, this ordering guarantee is provided (at significant cost on some > > architectures such as arm and power) by writel, so the wmb() is not > > necessary. writel_relaxed exists for cases where this ordering is not > > required. > > > > Fix the example and update the text to make this clearer. > > > > Cc: Benjamin Herrenschmidt > > Cc: Arnd Bergmann > > Cc: Jason Gunthorpe > > Cc: "Paul E. McKenney" > > Cc: Peter Zijlstra > > Cc: Ingo Molnar > > Cc: Jonathan Corbet > > Reported-by: Sinan Kaya > > Signed-off-by: Will Deacon > > --- > > Documentation/memory-barriers.txt | 18 ++ > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/Documentation/memory-barriers.txt > > b/Documentation/memory-barriers.txt > > index a863009849a3..2556b4b0e6f9 100644 > > --- a/Documentation/memory-barriers.txt > > +++ b/Documentation/memory-barriers.txt > > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > > /* assign ownership */ > > desc->status = DEVICE_OWN; > > > > - /* force memory to sync before notifying device via MMIO */ > > - wmb(); > > - > > /* notify device of new descriptors */ > > writel(DESC_NOTIFY, doorbell); > > } > > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: > > The dma_rmb() allows us guarantee the device has released ownership > > before we read the data from the descriptor, and the dma_wmb() allows > > us to guarantee the data is written to the descriptor before the > > device > > - can see it now has ownership. The wmb() is needed to guarantee that > > the > > - cache coherent memory writes have completed before attempting a write > > to > > - the cache incoherent MMIO region. > > - > > - See Documentation/DMA-API.txt for more information on consistent > > memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the cache incoherent MMIO region. > > + If this ordering between incoherent MMIO and coherent memory regions > > + is not required, writel_relaxed() can be used instead and is > > significantly > > + cheaper on some weakly-ordered architectures. > > I think that's a great improvement, but I'm a bit worried about recommending > writel_relaxed() too much: I've seen a lot of drivers that just always use > writel_relaxed() over write(), and some of them get that wrong when they > don't understand the difference but end up using DMA without explicit > barriers anyway. > > Also, having an architecture-independent driver use wmb()+writel_relaxed() > ends up being more expensive than just using write(). Not sure how to > best phrase it though. Perhaps I add reword that with a simple example to say: If this ordering between incoherent MMIO and coherent memory regions is not required (e.g. in a sequence of accesses all to the MMIO region) [...] since that seems to be the usual case where the _relaxed accessors help. Will
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon wrote: > > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 > From: Will Deacon > Date: Tue, 27 Mar 2018 10:49:58 +0100 > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering > example > > The section of memory-barriers.txt that describes the dma_Xmb() barriers > has an incorrect example claiming that a wmb() is required after writing > to coherent memory in order for those writes to be visible to a device > before a subsequent MMIO access using writel() can reach the device. > > In fact, this ordering guarantee is provided (at significant cost on some > architectures such as arm and power) by writel, so the wmb() is not > necessary. writel_relaxed exists for cases where this ordering is not > required. > > Fix the example and update the text to make this clearer. > > Cc: Benjamin Herrenschmidt > Cc: Arnd Bergmann > Cc: Jason Gunthorpe > Cc: "Paul E. McKenney" > Cc: Peter Zijlstra > Cc: Ingo Molnar > Cc: Jonathan Corbet > Reported-by: Sinan Kaya > Signed-off-by: Will Deacon > --- > Documentation/memory-barriers.txt | 18 ++ > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/Documentation/memory-barriers.txt > b/Documentation/memory-barriers.txt > index a863009849a3..2556b4b0e6f9 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the cache incoherent MMIO region. > + If this ordering between incoherent MMIO and coherent memory regions > + is not required, writel_relaxed() can be used instead and is > significantly > + cheaper on some weakly-ordered architectures. I think that's a great improvement, but I'm a bit worried about recommending writel_relaxed() too much: I've seen a lot of drivers that just always use writel_relaxed() over write(), and some of them get that wrong when they don't understand the difference but end up using DMA without explicit barriers anyway. Also, having an architecture-independent driver use wmb()+writel_relaxed() ends up being more expensive than just using write(). Not sure how to best phrase it though. Arnd
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 11:44:22AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt > wrote: > > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe wrote: > >> > >> I'm pretty sure I've never seen > >> any bug reports pointing to a missing wmb() between memory > >> and MMIO write accesses, but if you remember seeing them in the > >> list, maybe you can look again for some evidence of something going > >> wrong on x86 without it? > > > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Maybe the problem is just that it's so counter-intuitive that we don't > need that barrier in Linux, when the hardware does need one on some > architectures. > > How about we define a barrier type instruction specifically for this > purpose, something like wmb_before_mmio() and have all architectures > define that to an empty macro? > > That way, having correct code using wmb_before_mmio() will not > trigger an incorrect review comment that leads to extra wmb(). ;-) Please don't add more barriers :)! I think that will make it even more difficult to understand how to use the ones we already have -- the problem here seems to be that the documentation that was added for the dma_* barriers got this wrong, but it was at least in contradiction with the section elsewhere in memory-barriers.txt that describes the relaxed I/O accessors. I guess somebody could hack checkpatch to look for back-to-back wmb/writel sequences? I suspect we could do something with coccinelle too. Will
Re: RFC on writel and writel_relaxed
[+ locking/ordering/docs people] On Tue, Mar 27, 2018 at 07:56:59PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe wrote: > > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > > > I even see patches adding wmb() based on actual observed memory > > > corruption during testing on Intel: > > > > > > https://patchwork.kernel.org/patch/10177207/ > > > > > > So you think all of this is unnecessary and writel is totally strongly > > > ordered, even on multi-socket Intel? > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. > > Ah you are right, I incorrectly assumed that the "prod_db" function was > an MMIO. So we do NOT have a counter example where wmb is needed on > x86, pfiew ! :-) > > > I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > The interesting thing is that we do seem to have a whole LOT of these > spurrious wmb before writel all over the tree, I suspect because of > that incorrect recommendation in memory-barriers.txt. > > We should fix that. Patch below. Thoughts? Will --->8 >From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 27 Mar 2018 10:49:58 +0100 Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering example The section of memory-barriers.txt that describes the dma_Xmb() barriers has an incorrect example claiming that a wmb() is required after writing to coherent memory in order for those writes to be visible to a device before a subsequent MMIO access using writel() can reach the device. In fact, this ordering guarantee is provided (at significant cost on some architectures such as arm and power) by writel, so the wmb() is not necessary. writel_relaxed exists for cases where this ordering is not required. Fix the example and update the text to make this clearer. Cc: Benjamin Herrenschmidt Cc: Arnd Bergmann Cc: Jason Gunthorpe Cc: "Paul E. McKenney" Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jonathan Corbet Reported-by: Sinan Kaya Signed-off-by: Will Deacon --- Documentation/memory-barriers.txt | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index a863009849a3..2556b4b0e6f9 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: /* assign ownership */ desc->status = DEVICE_OWN; - /* force memory to sync before notifying device via MMIO */ - wmb(); - /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device - can see it now has ownership. The wmb() is needed to guarantee that the - cache coherent memory writes have completed before attempting a write to - the cache incoherent MMIO region. - - See Documentation/DMA-API.txt for more information on consistent memory. + can see it now has ownership. Note that, when using writel(), a prior + wmb() is not needed to guarantee that the cache coherent memory writes + have completed before writing to the cache incoherent MMIO region. + If this ordering between incoherent MMIO and coherent memory regions + is not required, writel_relaxed() can be used instead and is significantly + cheaper on some weakly-ordered architectures. + + See the subsection "Kernel I/O barrier effects" for more information on + relaxed I/O accessors and the Documentation/DMA-API.txt file for more + information on consistent memory. MMIO WRITE BARRIER -- 2.1.4
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe wrote: >> >> I'm pretty sure I've never seen >> any bug reports pointing to a missing wmb() between memory >> and MMIO write accesses, but if you remember seeing them in the >> list, maybe you can look again for some evidence of something going >> wrong on x86 without it? > > The interesting thing is that we do seem to have a whole LOT of these > spurrious wmb before writel all over the tree, I suspect because of > that incorrect recommendation in memory-barriers.txt. > > We should fix that. Maybe the problem is just that it's so counter-intuitive that we don't need that barrier in Linux, when the hardware does need one on some architectures. How about we define a barrier type instruction specifically for this purpose, something like wmb_before_mmio() and have all architectures define that to an empty macro? That way, having correct code using wmb_before_mmio() will not trigger an incorrect review comment that leads to extra wmb(). ;-) Arnd
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. I'm pretty sure I've never seen > any bug reports pointing to a missing wmb() between memory > and MMIO write accesses, but if you remember seeing them in the > list, maybe you can look again for some evidence of something going > wrong on x86 without it? If this is just about ordering accesses to coherent DMA, then using dma_wmb() instead will be much better performance on arm/arm64. Will
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. Ah you are right, I incorrectly assumed that the "prod_db" function was an MMIO. So we do NOT have a counter example where wmb is needed on x86, pfiew ! :-) > I'm pretty sure I've never seen > any bug reports pointing to a missing wmb() between memory > and MMIO write accesses, but if you remember seeing them in the > list, maybe you can look again for some evidence of something going > wrong on x86 without it? The interesting thing is that we do seem to have a whole LOT of these spurrious wmb before writel all over the tree, I suspect because of that incorrect recommendation in memory-barriers.txt. We should fix that. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe wrote: > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > I even see patches adding wmb() based on actual observed memory > corruption during testing on Intel: > > https://patchwork.kernel.org/patch/10177207/ > > So you think all of this is unnecessary and writel is totally strongly > ordered, even on multi-socket Intel? This example adds a wmb() between two writes to a coherent DMA area, it is definitely required there. I'm pretty sure I've never seen any bug reports pointing to a missing wmb() between memory and MMIO write accesses, but if you remember seeing them in the list, maybe you can look again for some evidence of something going wrong on x86 without it? Arnd
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 10:59:40AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote: > > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > > > Otherwise almost all drivers out there are broken which I very much > > > > > doubt :-) > > > > > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > > > find stuff like this: > > > > > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > > > > > /* Force memory writes to complete before letting h/w > > > > * know there are new descriptors to fetch. > > > > */ > > > > wmb(); > > > > > > > > > > > > It is *systemic* > > > > > > Yes, because they all copied e1000e :-) If you look at the comment in > > > there, it does say it's only for weakly ordered archs such as ia64, and > > > even then, probably predates Linus strong statement on the matter. > > > > Hahah, sure I'll buy that.. > > > > But still, if this really is the case, a *strong* statement in > > barriers.txt to that effect (and not an example demanding the wmb()!) > > would be very helpful for those of us that have to review driver code! > > I agree, and that Mellanox bug you pointed me to seems to indicate that > this may not even be true on x86 anymore ... However, with bugs like that it is hard to know what is going on.. It could be a CPU bug instead. > I think we might need to revisit this properly... I would love to hear a definitive statement from Intel on what wmb(); writel(); does on x86.. Sinan's patches are backwards if writel is ordered, instead of using writel_relaxed, they should be eliminating the wmb(). But there is no way patches like that could go ahead until barriers.txt is updated.. Jason
Re: RFC on writel and writel_relaxed
On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote: > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > > Otherwise almost all drivers out there are broken which I very much > > > > doubt :-) > > > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > > find stuff like this: > > > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > > > /* Force memory writes to complete before letting h/w > > > * know there are new descriptors to fetch. > > > */ > > > wmb(); > > > > > > > > > It is *systemic* > > > > Yes, because they all copied e1000e :-) If you look at the comment in > > there, it does say it's only for weakly ordered archs such as ia64, and > > even then, probably predates Linus strong statement on the matter. > > Hahah, sure I'll buy that.. > > But still, if this really is the case, a *strong* statement in > barriers.txt to that effect (and not an example demanding the wmb()!) > would be very helpful for those of us that have to review driver code! I agree, and that Mellanox bug you pointed me to seems to indicate that this may not even be true on x86 anymore ... I think we might need to revisit this properly... Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > Otherwise almost all drivers out there are broken which I very much > > > doubt :-) > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > find stuff like this: > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > /* Force memory writes to complete before letting h/w > > * know there are new descriptors to fetch. > > */ > > wmb(); > > > > > > It is *systemic* > > Yes, because they all copied e1000e :-) If you look at the comment in > there, it does say it's only for weakly ordered archs such as ia64, and > even then, probably predates Linus strong statement on the matter. Hahah, sure I'll buy that.. But still, if this really is the case, a *strong* statement in barriers.txt to that effect (and not an example demanding the wmb()!) would be very helpful for those of us that have to review driver code! Jason
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 09:36 +1100, Benjamin Herrenschmidt wrote: > I don't kow, it used to be the case, at least that's what drove us to > define things the way we did. > > Maybe things changed, but if that's the case, nobody knows for sure, > and we probably want to get Linus POV on the matter. > > I know I still write drivers that do not add a wmb in that case because > I expect things to work without it. > > If that has changed, we probably can relax some of the barriers in our > implementations of writel on a number of architectures, but not before > auditing a bunch more drivers to make sure they have the write wmb()'s Note also that this was the entire point behind the definition of the _relaxed() accessors, to lift that specific ordering guarantee. If you now says that memory + writel requires a wmb() in between then you made writel be identical to writel_relaxed. You might notice that Documentation/driver-api/device-io.rst makes no mention of wmb() at all. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > Otherwise almost all drivers out there are broken which I very much > > doubt :-) > > But.. Sinan is right, you look anywhere in the driver tree and you > find stuff like this: > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > /* Force memory writes to complete before letting h/w > * know there are new descriptors to fetch. > */ > wmb(); > > > It is *systemic* Yes, because they all copied e1000e :-) If you look at the comment in there, it does say it's only for weakly ordered archs such as ia64, and even then, probably predates Linus strong statement on the matter. > I even see patches adding wmb() based on actual observed memory > corruption during testing on Intel: > > https://patchwork.kernel.org/patch/10177207/ > > So you think all of this is unnecessary and writel is totally strongly > ordered, even on multi-socket Intel? I don't kow, it used to be the case, at least that's what drove us to define things the way we did. Maybe things changed, but if that's the case, nobody knows for sure, and we probably want to get Linus POV on the matter. I know I still write drivers that do not add a wmb in that case because I expect things to work without it. If that has changed, we probably can relax some of the barriers in our implementations of writel on a number of architectures, but not before auditing a bunch more drivers to make sure they have the write wmb()'s Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Mon, 2018-03-26 at 18:08 -0400, Sinan Kaya wrote: > On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote: > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > > > But that was never a requirement of writel(), > > > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > > > the wmb() before writel() for ordering system memory against writel. > > > > > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > > > Adding Alexander Duyck to Cc, he added that section as part of > > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > > > dma_wmb()"). Also adding the other people that were involved with that. > > > > > > > > > > ARM developers can get away with not including wmb() in their code and use > > > writel() to observe memory writes due to implicit barriers. > > > > > > However, same code will not work on Intel. > > > > Wrong. It will. > > > > You do NOT need wmb between writes to memory and writel. > > If writel() provides such a guarantee, why do I see code sequences like > > wmb() > writel() > > all over the place. Because it was badly documented and people didn't know what to do, or maybe the underlying mapping is WC ? I don't know for sure but I can tell you Linus opinion on the matter back in the days was very clear and that's why we implemented writel the way we did on powerpc. > > > > > writel() has a compiler barrier in it for x86. > > > wmb() has a sync operation in it for x86. > > > > > > Unless wmb() is called, PCIe device won't observe memory updates from the > > > CPU. > > > > This is completely wrong. They will. Intel provides the necessary > > ordering guarantees without an explicit wmb. > > > > I'm still reserving my doubts here. I was told about an explicit > wmb() requirement last week. By whome ? > > Otherwise almost all drivers out there are broken which I very much > > doubt :-) > > > > Cheers, > > Ben. > > > > > > > >
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > > But that was never a requirement of writel(), > > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > > the wmb() before writel() for ordering system memory against writel. > > > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > > Adding Alexander Duyck to Cc, he added that section as part of > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > > dma_wmb()"). Also adding the other people that were involved with that. > > > > > > > ARM developers can get away with not including wmb() in their code and use > > writel() to observe memory writes due to implicit barriers. > > > > However, same code will not work on Intel. > > Wrong. It will. > > You do NOT need wmb between writes to memory and writel. > > > writel() has a compiler barrier in it for x86. > > wmb() has a sync operation in it for x86. > > > > Unless wmb() is called, PCIe device won't observe memory updates from the > > CPU. > > This is completely wrong. They will. Intel provides the necessary > ordering guarantees without an explicit wmb. > > Otherwise almost all drivers out there are broken which I very much > doubt :-) But.. Sinan is right, you look anywhere in the driver tree and you find stuff like this: drivers/net/ethernet/intel/i40e/i40e_txrx.c /* Force memory writes to complete before letting h/w * know there are new descriptors to fetch. */ wmb(); It is *systemic* I even see patches adding wmb() based on actual observed memory corruption during testing on Intel: https://patchwork.kernel.org/patch/10177207/ So you think all of this is unnecessary and writel is totally strongly ordered, even on multi-socket Intel? Jason
Re: RFC on writel and writel_relaxed
On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: >> On 3/26/2018 5:30 PM, Arnd Bergmann wrote: But that was never a requirement of writel(), Documentation/memory-barriers.txt gives an explicit example demanding the wmb() before writel() for ordering system memory against writel. >>> >>> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). >>> Adding Alexander Duyck to Cc, he added that section as part of >>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and >>> dma_wmb()"). Also adding the other people that were involved with that. >>> >> >> ARM developers can get away with not including wmb() in their code and use >> writel() to observe memory writes due to implicit barriers. >> >> However, same code will not work on Intel. > > Wrong. It will. > > You do NOT need wmb between writes to memory and writel. If writel() provides such a guarantee, why do I see code sequences like wmb() writel() all over the place. > >> writel() has a compiler barrier in it for x86. >> wmb() has a sync operation in it for x86. >> >> Unless wmb() is called, PCIe device won't observe memory updates from the >> CPU. > > This is completely wrong. They will. Intel provides the necessary > ordering guarantees without an explicit wmb. > I'm still reserving my doubts here. I was told about an explicit wmb() requirement last week. > Otherwise almost all drivers out there are broken which I very much > doubt :-) > > Cheers, > Ben. > > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: RFC on writel and writel_relaxed
On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > But that was never a requirement of writel(), > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > the wmb() before writel() for ordering system memory against writel. > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > Adding Alexander Duyck to Cc, he added that section as part of > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > dma_wmb()"). Also adding the other people that were involved with that. > > > > ARM developers can get away with not including wmb() in their code and use > writel() to observe memory writes due to implicit barriers. > > However, same code will not work on Intel. Wrong. It will. You do NOT need wmb between writes to memory and writel. > writel() has a compiler barrier in it for x86. > wmb() has a sync operation in it for x86. > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU. This is completely wrong. They will. Intel provides the necessary ordering guarantees without an explicit wmb. Otherwise almost all drivers out there are broken which I very much doubt :-) Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: > Most of the drivers have a unwound loop with writeq() or something to > > do it. > > But isn't the writeq() barrier much more expensive than anything you'd > do in function calls? It is for us, and will break any write combining. > > > > The same document says that _relaxed() does not give that guarentee. > > > > > > > > The lwn articule on this went into some depth on the interaction with > > > > spinlocks. > > > > > > > > As far as I can see, containment in a spinlock seems to be the only > > > > different between writel and writel_relaxed.. > > > > > > I was always puzzled by this: The intention of _relaxed() on ARM > > > (where it originates) was to skip the barrier that serializes DMA > > > with MMIO, not to skip the serialization between MMIO and locks. > > > > But that was never a requirement of writel(), > > Documentation/memory-barriers.txt gives an explicit example demanding > > the wmb() before writel() for ordering system memory against writel. This is a bug in the documentation. > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > Adding Alexander Duyck to Cc, he added that section as part of > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > dma_wmb()"). Also adding the other people that were involved with that. Linus himself made it very clear years ago. readl and writel have to order vs memory accesses. > > I actually have no idea why ARM had that barrier, I always assumed it > > was to give program ordering to the accesses and that _relaxed allowed > > re-ordering (the usual meaning of relaxed).. > > > > But the barrier document makes it pretty clear that the only > > difference between the two is spinlock containment, and WillD wrote > > this text, so I belive it is accurate for ARM. > > > > Very confusing. > > It does mention serialization with both DMA and locks in the > section about readX_relaxed()/writeX_relaxed(). The part > about DMA is very clear here, and I must have just forgotten > the exact semantics with regards to spinlocks. I'm still not > sure what prevents a writel() from leaking out the end of a > spinlock section that doesn't happen with writel_relaxed(), since > the barrier in writel() comes before the access, and the > spin_unlock() shouldn't affect the external buses. So... Historically, what happened is that we (we means whoever participated in the discussion on the list with Linus calling the shots really) decided that there was no sane way for drivers to understand a world where readl/writel didn't fully order things vs. memory accesses (ie, DMA). So it should always be correct to do: - Write to some in-memory buffer - writel() to kick the DMA read of that buffer without any extra barrier. The spinlock situation however got murky. Mostly that came up because on architecture (I forgot who, might have been ia64) has a hard time providing that consistency without making writel insanely expensive. Thus they created mmiowb whose main purpose was precisely to order writel with a following spin_unlock. I decided not to go down that path on power because getting all drivers "fixed" to do the right thing was going to be a losing battle, and instead added per-cpu tracking of writel in order to "escalate" to a heavier barrier in spin_unlock itself when necessary. Now, all this happened more than a decade ago and it's possible that the understanding or expectations "shifted" over time... Cheers, Ben.
Re: RFC on writel and writel_relaxed
On 3/26/2018 5:30 PM, Arnd Bergmann wrote: >> But that was never a requirement of writel(), >> Documentation/memory-barriers.txt gives an explicit example demanding >> the wmb() before writel() for ordering system memory against writel. > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > Adding Alexander Duyck to Cc, he added that section as part of > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > dma_wmb()"). Also adding the other people that were involved with that. > ARM developers can get away with not including wmb() in their code and use writel() to observe memory writes due to implicit barriers. However, same code will not work on Intel. writel() has a compiler barrier in it for x86. wmb() has a sync operation in it for x86. Unless wmb() is called, PCIe device won't observe memory updates from the CPU. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: RFC on writel and writel_relaxed
On Mon, 2018-03-26 at 10:54 -0600, Jason Gunthorpe wrote: > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote: > > > > This is a super performance critical operation for most drivers and > > > > directly impacts network performance. > > > > Perhaps there ought to be writel_nobarrier() (etc) that never contain > > any barriers at all. > > This might mean that they are always just the memory operation, > > but it would make it more obvious what the driver was doing. > > I think that is what writel_relaxed is supposed to be. > > The only restriction it has is that the writes to a single device > using UC memory must be kept in program order.. Which requires barriers on some architectures :-) Also we don't have a clear definition of what happens on WC memory. Cheers, Ben.
Re: RFC on writel and writel_relaxed
On Mon, Mar 26, 2018 at 11:09 PM, Jason Gunthorpe wrote: > On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote: >> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe wrote: >> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: >> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe wrote: >> >> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote: >> >> >> > > This is a super performance critical operation for most drivers and >> >> >> > > directly impacts network performance. >> >> >> >> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain >> >> >> any barriers at all. >> >> >> This might mean that they are always just the memory operation, >> >> >> but it would make it more obvious what the driver was doing. >> >> > >> >> > I think that is what writel_relaxed is supposed to be. >> >> > >> >> > The only restriction it has is that the writes to a single device >> >> > using UC memory must be kept in program order.. >> >> >> >> Not sure about whether we have ever defined what happens to >> >> writel_relaxed() on WC memory though: On ARM, we disallow >> >> the compiler to combine writes, but the CPU still might. >> > >> > If the driver uses WC memory then I think it should not expect >> > anything in terms of how writes map to TLPs other than nothing >> > combines across mmiowb() and mmiowb() is fully globally ordered when >> > enclosed in a spinlock. >> > >> > The entire point of using WC memory is usually to get combining :) If >> > the driver doesn't want that then it should map UC.. >> >> Usually, WC memory is used with memcpy_toio() though, which >> by definition doesn't have any barriers between accesses, and >> is required to get the correct byte ordering on writes to memory buffers. > > memcpy_toio is too expensive to actually use for anything performance > though. It is too pessimistic. What the drivers usually want is a > unwound block of 4 or 8 8-byte copies. No function calls, no > branching. Everything is already known to be aligned. > > Most of the drivers have a unwound loop with writeq() or something to > do it. But isn't the writeq() barrier much more expensive than anything you'd do in function calls? >> > The same document says that _relaxed() does not give that guarentee. >> > >> > The lwn articule on this went into some depth on the interaction with >> > spinlocks. >> > >> > As far as I can see, containment in a spinlock seems to be the only >> > different between writel and writel_relaxed.. >> >> I was always puzzled by this: The intention of _relaxed() on ARM >> (where it originates) was to skip the barrier that serializes DMA >> with MMIO, not to skip the serialization between MMIO and locks. > > But that was never a requirement of writel(), > Documentation/memory-barriers.txt gives an explicit example demanding > the wmb() before writel() for ordering system memory against writel. Indeed, but it's in an example for when to use dma_wmb(), not wmb(). Adding Alexander Duyck to Cc, he added that section as part of 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and dma_wmb()"). Also adding the other people that were involved with that. > I actually have no idea why ARM had that barrier, I always assumed it > was to give program ordering to the accesses and that _relaxed allowed > re-ordering (the usual meaning of relaxed).. > > But the barrier document makes it pretty clear that the only > difference between the two is spinlock containment, and WillD wrote > this text, so I belive it is accurate for ARM. > > Very confusing. It does mention serialization with both DMA and locks in the section about readX_relaxed()/writeX_relaxed(). The part about DMA is very clear here, and I must have just forgotten the exact semantics with regards to spinlocks. I'm still not sure what prevents a writel() from leaking out the end of a spinlock section that doesn't happen with writel_relaxed(), since the barrier in writel() comes before the access, and the spin_unlock() shouldn't affect the external buses. Arnd
Re: RFC on writel and writel_relaxed
On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote: > On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe wrote: > > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: > >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe wrote: > >> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote: > >> >> > > This is a super performance critical operation for most drivers and > >> >> > > directly impacts network performance. > >> >> > >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain > >> >> any barriers at all. > >> >> This might mean that they are always just the memory operation, > >> >> but it would make it more obvious what the driver was doing. > >> > > >> > I think that is what writel_relaxed is supposed to be. > >> > > >> > The only restriction it has is that the writes to a single device > >> > using UC memory must be kept in program order.. > >> > >> Not sure about whether we have ever defined what happens to > >> writel_relaxed() on WC memory though: On ARM, we disallow > >> the compiler to combine writes, but the CPU still might. > > > > If the driver uses WC memory then I think it should not expect > > anything in terms of how writes map to TLPs other than nothing > > combines across mmiowb() and mmiowb() is fully globally ordered when > > enclosed in a spinlock. > > > > The entire point of using WC memory is usually to get combining :) If > > the driver doesn't want that then it should map UC.. > > Usually, WC memory is used with memcpy_toio() though, which > by definition doesn't have any barriers between accesses, and > is required to get the correct byte ordering on writes to memory buffers. memcpy_toio is too expensive to actually use for anything performance though. It is too pessimistic. What the drivers usually want is a unwound block of 4 or 8 8-byte copies. No function calls, no branching. Everything is already known to be aligned. Most of the drivers have a unwound loop with writeq() or something to do it. > > The same document says that _relaxed() does not give that guarentee. > > > > The lwn articule on this went into some depth on the interaction with > > spinlocks. > > > > As far as I can see, containment in a spinlock seems to be the only > > different between writel and writel_relaxed.. > > I was always puzzled by this: The intention of _relaxed() on ARM > (where it originates) was to skip the barrier that serializes DMA > with MMIO, not to skip the serialization between MMIO and locks. But that was never a requirement of writel(), Documentation/memory-barriers.txt gives an explicit example demanding the wmb() before writel() for ordering system memory against writel. I actually have no idea why ARM had that barrier, I always assumed it was to give program ordering to the accesses and that _relaxed allowed re-ordering (the usual meaning of relaxed).. But the barrier document makes it pretty clear that the only difference between the two is spinlock containment, and WillD wrote this text, so I belive it is accurate for ARM. Very confusing. > I never fully understood the part about the locks, but from what > I remember, ARM is still serialized without the barrier here, but > dropping the barrier on powerpc writel_relaxed() would not > serialize against locks or DMA. WC is usually the problem here.. I've been told it is necessary on ARM as well.. Jason
Re: RFC on writel and writel_relaxed
On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe wrote: > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe wrote: >> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote: >> >> > > This is a super performance critical operation for most drivers and >> >> > > directly impacts network performance. >> >> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain >> >> any barriers at all. >> >> This might mean that they are always just the memory operation, >> >> but it would make it more obvious what the driver was doing. >> > >> > I think that is what writel_relaxed is supposed to be. >> > >> > The only restriction it has is that the writes to a single device >> > using UC memory must be kept in program order.. >> >> Not sure about whether we have ever defined what happens to >> writel_relaxed() on WC memory though: On ARM, we disallow >> the compiler to combine writes, but the CPU still might. > > If the driver uses WC memory then I think it should not expect > anything in terms of how writes map to TLPs other than nothing > combines across mmiowb() and mmiowb() is fully globally ordered when > enclosed in a spinlock. > > The entire point of using WC memory is usually to get combining :) If > the driver doesn't want that then it should map UC.. Usually, WC memory is used with memcpy_toio() though, which by definition doesn't have any barriers between accesses, and is required to get the correct byte ordering on writes to memory buffers. >> It's also not entirely clear to me what we want writel() inside a >> spinlock to mean: should the spinlock guarantee that two writel() >> calls on different CPUs that are protected by spinlocks are >> serialized by those locks, or not? > > Yes for writel, I think that is already defined by the barriers > document Sorry, I meant writel_relaxed(), not writel() > The same document says that _relaxed() does not give that guarentee. > > The lwn articule on this went into some depth on the interaction with > spinlocks. > > As far as I can see, containment in a spinlock seems to be the only > different between writel and writel_relaxed.. I was always puzzled by this: The intention of _relaxed() on ARM (where it originates) was to skip the barrier that serializes DMA with MMIO, not to skip the serialization between MMIO and locks. I never fully understood the part about the locks, but from what I remember, ARM is still serialized without the barrier here, but dropping the barrier on powerpc writel_relaxed() would not serialize against locks or DMA. Arnd
Re: RFC on writel and writel_relaxed
On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: > On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe wrote: > > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote: > >> > > This is a super performance critical operation for most drivers and > >> > > directly impacts network performance. > >> > >> Perhaps there ought to be writel_nobarrier() (etc) that never contain > >> any barriers at all. > >> This might mean that they are always just the memory operation, > >> but it would make it more obvious what the driver was doing. > > > > I think that is what writel_relaxed is supposed to be. > > > > The only restriction it has is that the writes to a single device > > using UC memory must be kept in program order.. > > Not sure about whether we have ever defined what happens to > writel_relaxed() on WC memory though: On ARM, we disallow > the compiler to combine writes, but the CPU still might. If the driver uses WC memory then I think it should not expect anything in terms of how writes map to TLPs other than nothing combines across mmiowb() and mmiowb() is fully globally ordered when enclosed in a spinlock. The entire point of using WC memory is usually to get combining :) If the driver doesn't want that then it should map UC.. > It's also not entirely clear to me what we want writel() inside a > spinlock to mean: should the spinlock guarantee that two writel() > calls on different CPUs that are protected by spinlocks are > serialized by those locks, or not? Yes for writel, I think that is already defined by the barriers document The same document says that _relaxed() does not give that guarentee. The lwn articule on this went into some depth on the interaction with spinlocks. As far as I can see, containment in a spinlock seems to be the only different between writel and writel_relaxed.. Jason