Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, 06 Jun 2017 01:54:23 PDT (-0700), Arnd Bergmann wrote: > On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbeltwrote: >> On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote: >>> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt wrote: On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt > wrote: >> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h >> new file mode 100644 >> index ..d942555a7a08 >> --- /dev/null >> +++ b/arch/riscv/include/asm/io.h >> @@ -0,0 +1,36 @@ > >> +#ifndef _ASM_RISCV_IO_H >> +#define _ASM_RISCV_IO_H >> + >> +#include > > I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} > helpers using inline assembly, to prevent the compiler for breaking > up accesses into byte accesses. > > Also, most architectures require to some synchronization after a > non-relaxed readl() to prevent prefetching of DMA buffers, and > before a writel() to flush write buffers when a DMA gets triggered. Makes sense. These were all OK on existing implementations (as there's no writable PMAs, so all MMIO regions are strictly ordered), but that's not actually what the RISC-V ISA says. I patterned this on arm64 https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa where I think the only odd thing is our definition of mmiowb +/* IO barriers. These only fence on the IO bits because they're only required + * to order device access. We're defining mmiowb because our AMO instructions + * (which are used to implement locks) don't specify ordering. From Chapter 7 + * of v2.2 of the user ISA: + * "The bits order accesses to one of the two address domains, memory or I/O, + * depending on which address domain the atomic instruction is accessing. No + * ordering constraint is implied to accesses to the other domain, and a FENCE + * instruction should be used to order across both domains." + */ + +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : "memory"); +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : "memory"); >>> >>> Looks ok, yes. >>> +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : : "memory"); which I think is correct. >>> >>> I can never remember what exactly this one does. >> >> I can't find the reference again, but what I found said that if your atomics >> (or whatever's used for locking) don't stay ordered with your MMIO accesses, >> then you should define mmiowb to ensure ordering. I managed to screw this >> up, >> as there's no "w" in the successor set (to actually enforce the AMO >> ordering). >> This is somewhat confirmed by >> >> https://lkml.org/lkml/2006/8/31/174 >> Subject: Re: When to use mmiowb()? >> AFAICT, they're both right. Generally, mmiowb() should be used prior to >> unlock in a critical section whose last PIO operation is a writeX. >> >> Thus, I think the actual fence should be at least >> >> fence o,w > ... >> which matches what's above. I think "fence o,w" is sufficient for a mmiowb >> on >> RISC-V. I'll make the change. > > This sounds reasonable according to the documentation, but with your > longer explanation of the barriers, I think the __iormb/__iowmb definitions > above are wrong. What you actually need I think is > > void writel(u32 v, volatile void __iomem *addr) > { > asm volatile("fence w,o" : : : "memory"); > writel_relaxed(v, addr); > } > > u32 readl(volatile void __iomem *addr) > { > u32 ret = readl_relaxed(addr); > asm volatile("fence i,r" : : : "memory"); > return ret; > } > > to synchronize between DMA and I/O. The barriers you listed above > in contrast appear to be directed at synchronizing I/O with other I/O. > We normally assume that this is not required when you have > subsequent MMIO accesses on the same device (on PCI) or the > same address region (per ARM architecture and others). If you do > need to enforce ordering between MMIO, you might even need to > add those barriers in the relaxed version to be portable with drivers > written for ARM SoCs: > > void writel_relaxed(u32 v, volatile void __iomem *addr) > { > __raw_writel((__force u32)cpu_to_le32(v, addr) > asm volatile("fence o,io" : : : "memory"); > } > > u32 readl_relaxed(volatile void __iomem *addr) > { > asm volatile("fence i,io" : : : "memory"); > return le32_to_cpu((__force __le32)__raw_readl(addr)); > } > > You then end up with a barrier before and after each regular > readl/writel in order to synchronize both with DMA and MMIO >
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, 06 Jun 2017 01:54:23 PDT (-0700), Arnd Bergmann wrote: > On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt wrote: >> On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote: >>> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt wrote: On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt > wrote: >> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h >> new file mode 100644 >> index ..d942555a7a08 >> --- /dev/null >> +++ b/arch/riscv/include/asm/io.h >> @@ -0,0 +1,36 @@ > >> +#ifndef _ASM_RISCV_IO_H >> +#define _ASM_RISCV_IO_H >> + >> +#include > > I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} > helpers using inline assembly, to prevent the compiler for breaking > up accesses into byte accesses. > > Also, most architectures require to some synchronization after a > non-relaxed readl() to prevent prefetching of DMA buffers, and > before a writel() to flush write buffers when a DMA gets triggered. Makes sense. These were all OK on existing implementations (as there's no writable PMAs, so all MMIO regions are strictly ordered), but that's not actually what the RISC-V ISA says. I patterned this on arm64 https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa where I think the only odd thing is our definition of mmiowb +/* IO barriers. These only fence on the IO bits because they're only required + * to order device access. We're defining mmiowb because our AMO instructions + * (which are used to implement locks) don't specify ordering. From Chapter 7 + * of v2.2 of the user ISA: + * "The bits order accesses to one of the two address domains, memory or I/O, + * depending on which address domain the atomic instruction is accessing. No + * ordering constraint is implied to accesses to the other domain, and a FENCE + * instruction should be used to order across both domains." + */ + +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : "memory"); +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : "memory"); >>> >>> Looks ok, yes. >>> +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : : "memory"); which I think is correct. >>> >>> I can never remember what exactly this one does. >> >> I can't find the reference again, but what I found said that if your atomics >> (or whatever's used for locking) don't stay ordered with your MMIO accesses, >> then you should define mmiowb to ensure ordering. I managed to screw this >> up, >> as there's no "w" in the successor set (to actually enforce the AMO >> ordering). >> This is somewhat confirmed by >> >> https://lkml.org/lkml/2006/8/31/174 >> Subject: Re: When to use mmiowb()? >> AFAICT, they're both right. Generally, mmiowb() should be used prior to >> unlock in a critical section whose last PIO operation is a writeX. >> >> Thus, I think the actual fence should be at least >> >> fence o,w > ... >> which matches what's above. I think "fence o,w" is sufficient for a mmiowb >> on >> RISC-V. I'll make the change. > > This sounds reasonable according to the documentation, but with your > longer explanation of the barriers, I think the __iormb/__iowmb definitions > above are wrong. What you actually need I think is > > void writel(u32 v, volatile void __iomem *addr) > { > asm volatile("fence w,o" : : : "memory"); > writel_relaxed(v, addr); > } > > u32 readl(volatile void __iomem *addr) > { > u32 ret = readl_relaxed(addr); > asm volatile("fence i,r" : : : "memory"); > return ret; > } > > to synchronize between DMA and I/O. The barriers you listed above > in contrast appear to be directed at synchronizing I/O with other I/O. > We normally assume that this is not required when you have > subsequent MMIO accesses on the same device (on PCI) or the > same address region (per ARM architecture and others). If you do > need to enforce ordering between MMIO, you might even need to > add those barriers in the relaxed version to be portable with drivers > written for ARM SoCs: > > void writel_relaxed(u32 v, volatile void __iomem *addr) > { > __raw_writel((__force u32)cpu_to_le32(v, addr) > asm volatile("fence o,io" : : : "memory"); > } > > u32 readl_relaxed(volatile void __iomem *addr) > { > asm volatile("fence i,io" : : : "memory"); > return le32_to_cpu((__force __le32)__raw_readl(addr)); > } > > You then end up with a barrier before and after each regular > readl/writel in order to synchronize both with DMA and MMIO > instrictructions, and you still need the extre mmiowb() to >
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbeltwrote: > On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote: >> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt wrote: >>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> - Many architectures have cache line prefetch, flush, zero or copy >> instructions that are used for important performance optimizations >> but that are typically defined on a cacheline granularity. I don't >> think you currently have any of them, but it seems likely that there >> will be demand for them later. > > We actually have an implicit prefetch (loads to x0, the zero register), but > it still has all the load side-effects so nothing uses it. > >> Having a larger than necessary alignment can waste substantial amounts >> of memory for arrays of cache line aligned structures (typically >> per-cpu arrays), but otherwise should not cause harm. > > I bugged our L1 guy and he says 64-byte lines are a bit of a magic number > because of how they line up with DIMMs. Since there's no spec to define this, > there's no correct answer. I'd be amenable to making this a Kconfig option, > but I think we'll leave it alone for now. It does match the extant > implementations. Hmm, this sounds like a hole in the architecture definition: if you have an instruction that performs a prefetch (even one that is not easily usable), I would argue that the cache line size has become a feature of the architecture and is no longer strictly an implementation detail of the microarchitecture. Regarding the memory interface, a lot of systems use two DIMMs on each memory channel for 128-bit parallel buses, and with LP-DDRx controllers, you might have a width as small as 16 bits. > diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h > new file mode 100644 > index ..d942555a7a08 > --- /dev/null > +++ b/arch/riscv/include/asm/io.h > @@ -0,0 +1,36 @@ > +#ifndef _ASM_RISCV_IO_H > +#define _ASM_RISCV_IO_H > + > +#include I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} helpers using inline assembly, to prevent the compiler for breaking up accesses into byte accesses. Also, most architectures require to some synchronization after a non-relaxed readl() to prevent prefetching of DMA buffers, and before a writel() to flush write buffers when a DMA gets triggered. >>> >>> Makes sense. These were all OK on existing implementations (as there's no >>> writable PMAs, so all MMIO regions are strictly ordered), but that's not >>> actually what the RISC-V ISA says. I patterned this on arm64 >>> >>> >>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa >>> >>> where I think the only odd thing is our definition of mmiowb >>> >>> +/* IO barriers. These only fence on the IO bits because they're only >>> required >>> + * to order device access. We're defining mmiowb because our AMO >>> instructions >>> + * (which are used to implement locks) don't specify ordering. From >>> Chapter 7 >>> + * of v2.2 of the user ISA: >>> + * "The bits order accesses to one of the two address domains, memory or >>> I/O, >>> + * depending on which address domain the atomic instruction is >>> accessing. No >>> + * ordering constraint is implied to accesses to the other domain, and a >>> FENCE >>> + * instruction should be used to order across both domains." >>> + */ >>> + >>> +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : >>> "memory"); >>> +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : >>> "memory"); >> >> Looks ok, yes. >> >>> +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : >>> : "memory"); >>> >>> which I think is correct. >> >> I can never remember what exactly this one does. > > I can't find the reference again, but what I found said that if your atomics > (or whatever's used for locking) don't stay ordered with your MMIO accesses, > then you should define mmiowb to ensure ordering. I managed to screw this up, > as there's no "w" in the successor set (to actually enforce the AMO ordering). > This is somewhat confirmed by > > https://lkml.org/lkml/2006/8/31/174 > Subject: Re: When to use mmiowb()? > AFAICT, they're both right. Generally, mmiowb() should be used prior to > unlock in a critical section whose last PIO operation is a writeX. > > Thus, I think the actual fence should be at least > > fence o,w ... > which matches what's above. I think "fence o,w" is sufficient for a mmiowb on > RISC-V. I'll make the change. This sounds reasonable according to the documentation, but with your longer explanation of the barriers, I think the __iormb/__iowmb definitions above are wrong. What you
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt wrote: > On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote: >> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt wrote: >>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> - Many architectures have cache line prefetch, flush, zero or copy >> instructions that are used for important performance optimizations >> but that are typically defined on a cacheline granularity. I don't >> think you currently have any of them, but it seems likely that there >> will be demand for them later. > > We actually have an implicit prefetch (loads to x0, the zero register), but > it still has all the load side-effects so nothing uses it. > >> Having a larger than necessary alignment can waste substantial amounts >> of memory for arrays of cache line aligned structures (typically >> per-cpu arrays), but otherwise should not cause harm. > > I bugged our L1 guy and he says 64-byte lines are a bit of a magic number > because of how they line up with DIMMs. Since there's no spec to define this, > there's no correct answer. I'd be amenable to making this a Kconfig option, > but I think we'll leave it alone for now. It does match the extant > implementations. Hmm, this sounds like a hole in the architecture definition: if you have an instruction that performs a prefetch (even one that is not easily usable), I would argue that the cache line size has become a feature of the architecture and is no longer strictly an implementation detail of the microarchitecture. Regarding the memory interface, a lot of systems use two DIMMs on each memory channel for 128-bit parallel buses, and with LP-DDRx controllers, you might have a width as small as 16 bits. > diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h > new file mode 100644 > index ..d942555a7a08 > --- /dev/null > +++ b/arch/riscv/include/asm/io.h > @@ -0,0 +1,36 @@ > +#ifndef _ASM_RISCV_IO_H > +#define _ASM_RISCV_IO_H > + > +#include I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} helpers using inline assembly, to prevent the compiler for breaking up accesses into byte accesses. Also, most architectures require to some synchronization after a non-relaxed readl() to prevent prefetching of DMA buffers, and before a writel() to flush write buffers when a DMA gets triggered. >>> >>> Makes sense. These were all OK on existing implementations (as there's no >>> writable PMAs, so all MMIO regions are strictly ordered), but that's not >>> actually what the RISC-V ISA says. I patterned this on arm64 >>> >>> >>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa >>> >>> where I think the only odd thing is our definition of mmiowb >>> >>> +/* IO barriers. These only fence on the IO bits because they're only >>> required >>> + * to order device access. We're defining mmiowb because our AMO >>> instructions >>> + * (which are used to implement locks) don't specify ordering. From >>> Chapter 7 >>> + * of v2.2 of the user ISA: >>> + * "The bits order accesses to one of the two address domains, memory or >>> I/O, >>> + * depending on which address domain the atomic instruction is >>> accessing. No >>> + * ordering constraint is implied to accesses to the other domain, and a >>> FENCE >>> + * instruction should be used to order across both domains." >>> + */ >>> + >>> +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : >>> "memory"); >>> +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : >>> "memory"); >> >> Looks ok, yes. >> >>> +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : >>> : "memory"); >>> >>> which I think is correct. >> >> I can never remember what exactly this one does. > > I can't find the reference again, but what I found said that if your atomics > (or whatever's used for locking) don't stay ordered with your MMIO accesses, > then you should define mmiowb to ensure ordering. I managed to screw this up, > as there's no "w" in the successor set (to actually enforce the AMO ordering). > This is somewhat confirmed by > > https://lkml.org/lkml/2006/8/31/174 > Subject: Re: When to use mmiowb()? > AFAICT, they're both right. Generally, mmiowb() should be used prior to > unlock in a critical section whose last PIO operation is a writeX. > > Thus, I think the actual fence should be at least > > fence o,w ... > which matches what's above. I think "fence o,w" is sufficient for a mmiowb on > RISC-V. I'll make the change. This sounds reasonable according to the documentation, but with your longer explanation of the barriers, I think the __iormb/__iowmb definitions above are wrong. What you actually need I think is void writel(u32 v, volatile void __iomem
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote: > On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbeltwrote: >> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: >>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: > > +#ifndef _ASM_RISCV_CACHE_H +#define _ASM_RISCV_CACHE_H + +#define L1_CACHE_SHIFT 6 + +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >>> >>> Is this the only valid cache line size on riscv, or just the largest >>> one that is allowed? >> >> The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no >> restriction on L1 cache line size (we tried to keep microarchitecture out of >> the ISA specification). We provide the actual cache parameters as part of >> the >> device tree, but it looks like this needs to be known staticly in some places >> so we can't use that everywhere. >> >> We could always make this a Kconfig parameter. > > The cache line size is used in a couple of places, let's go through the most > common ones to see where that abstraction might be leaky and you actually > get an architectural effect: > > - On SMP machines, cacheline_aligned_in_smp is used to annotate > data structures used in lockless algorithms, typically with one CPU writing > to some members of a structure, and another CPU reading from it but > not writing the same members. Depending on the architecture, having a > larger actual alignment than L1_CACHE_BYTES will either lead to > bad performance from cache line ping pong, or actual data corruption. On RISC-V it's just a performance problem, so at least it's not catastrophic. > - On systems with DMA masters that are not fully coherent, > cacheline_aligned is used to annotate data structures used > for DMA buffers, to make sure that the cache maintenance operations > in dma_sync_*_for_*() helpers don't corrup data outside of the > DMA buffer. You don't seem to support noncoherent DMA masters > or the cache maintenance operations required to use those, so this > might not be a problem until someone adds an extension for those. > > - Depending on the bus interconnect, a coherent DMA master might > not be able to update partial cache lines, so you need the same > annotation. Well, our (SiFive's) bus is easy to master so hopefully we won't end up doing that. There is, of course, the rest of the world -- but that's just a bridge we'll have to cross later (if such an implementation arises). > - The kmalloc() family of memory allocators aligns data to the cache > line size, for both DMA and SMP synchronization above. Ya, but luckily just a performance problem on RISC-V. > - Many architectures have cache line prefetch, flush, zero or copy > instructions that are used for important performance optimizations > but that are typically defined on a cacheline granularity. I don't > think you currently have any of them, but it seems likely that there > will be demand for them later. We actually have an implicit prefetch (loads to x0, the zero register), but it still has all the load side-effects so nothing uses it. > Having a larger than necessary alignment can waste substantial amounts > of memory for arrays of cache line aligned structures (typically > per-cpu arrays), but otherwise should not cause harm. I bugged our L1 guy and he says 64-byte lines are a bit of a magic number because of how they line up with DIMMs. Since there's no spec to define this, there's no correct answer. I'd be amenable to making this a Kconfig option, but I think we'll leave it alone for now. It does match the extant implementations. diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h new file mode 100644 index ..d942555a7a08 --- /dev/null +++ b/arch/riscv/include/asm/io.h @@ -0,0 +1,36 @@ >>> +#ifndef _ASM_RISCV_IO_H +#define _ASM_RISCV_IO_H + +#include >>> >>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} >>> helpers using inline assembly, to prevent the compiler for breaking >>> up accesses into byte accesses. >>> >>> Also, most architectures require to some synchronization after a >>> non-relaxed readl() to prevent prefetching of DMA buffers, and >>> before a writel() to flush write buffers when a DMA gets triggered. >> >> Makes sense. These were all OK on existing implementations (as there's no >> writable PMAs, so all MMIO regions are strictly ordered), but that's not >> actually what the RISC-V ISA says. I patterned this on arm64 >> >> >> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa >> >> where I think the only odd thing is our definition of mmiowb >> >> +/* IO barriers. These only fence on the IO bits because they're only >> required >> + * to order device access. We're defining mmiowb because our AMO >> instructions >> + * (which are used
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote: > On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt wrote: >> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: >>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: > > +#ifndef _ASM_RISCV_CACHE_H +#define _ASM_RISCV_CACHE_H + +#define L1_CACHE_SHIFT 6 + +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >>> >>> Is this the only valid cache line size on riscv, or just the largest >>> one that is allowed? >> >> The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no >> restriction on L1 cache line size (we tried to keep microarchitecture out of >> the ISA specification). We provide the actual cache parameters as part of >> the >> device tree, but it looks like this needs to be known staticly in some places >> so we can't use that everywhere. >> >> We could always make this a Kconfig parameter. > > The cache line size is used in a couple of places, let's go through the most > common ones to see where that abstraction might be leaky and you actually > get an architectural effect: > > - On SMP machines, cacheline_aligned_in_smp is used to annotate > data structures used in lockless algorithms, typically with one CPU writing > to some members of a structure, and another CPU reading from it but > not writing the same members. Depending on the architecture, having a > larger actual alignment than L1_CACHE_BYTES will either lead to > bad performance from cache line ping pong, or actual data corruption. On RISC-V it's just a performance problem, so at least it's not catastrophic. > - On systems with DMA masters that are not fully coherent, > cacheline_aligned is used to annotate data structures used > for DMA buffers, to make sure that the cache maintenance operations > in dma_sync_*_for_*() helpers don't corrup data outside of the > DMA buffer. You don't seem to support noncoherent DMA masters > or the cache maintenance operations required to use those, so this > might not be a problem until someone adds an extension for those. > > - Depending on the bus interconnect, a coherent DMA master might > not be able to update partial cache lines, so you need the same > annotation. Well, our (SiFive's) bus is easy to master so hopefully we won't end up doing that. There is, of course, the rest of the world -- but that's just a bridge we'll have to cross later (if such an implementation arises). > - The kmalloc() family of memory allocators aligns data to the cache > line size, for both DMA and SMP synchronization above. Ya, but luckily just a performance problem on RISC-V. > - Many architectures have cache line prefetch, flush, zero or copy > instructions that are used for important performance optimizations > but that are typically defined on a cacheline granularity. I don't > think you currently have any of them, but it seems likely that there > will be demand for them later. We actually have an implicit prefetch (loads to x0, the zero register), but it still has all the load side-effects so nothing uses it. > Having a larger than necessary alignment can waste substantial amounts > of memory for arrays of cache line aligned structures (typically > per-cpu arrays), but otherwise should not cause harm. I bugged our L1 guy and he says 64-byte lines are a bit of a magic number because of how they line up with DIMMs. Since there's no spec to define this, there's no correct answer. I'd be amenable to making this a Kconfig option, but I think we'll leave it alone for now. It does match the extant implementations. diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h new file mode 100644 index ..d942555a7a08 --- /dev/null +++ b/arch/riscv/include/asm/io.h @@ -0,0 +1,36 @@ >>> +#ifndef _ASM_RISCV_IO_H +#define _ASM_RISCV_IO_H + +#include >>> >>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} >>> helpers using inline assembly, to prevent the compiler for breaking >>> up accesses into byte accesses. >>> >>> Also, most architectures require to some synchronization after a >>> non-relaxed readl() to prevent prefetching of DMA buffers, and >>> before a writel() to flush write buffers when a DMA gets triggered. >> >> Makes sense. These were all OK on existing implementations (as there's no >> writable PMAs, so all MMIO regions are strictly ordered), but that's not >> actually what the RISC-V ISA says. I patterned this on arm64 >> >> >> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa >> >> where I think the only odd thing is our definition of mmiowb >> >> +/* IO barriers. These only fence on the IO bits because they're only >> required >> + * to order device access. We're defining mmiowb because our AMO >> instructions >> + * (which are used to implement locks) don't specify
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, 23 May 2017 14:23:50 PDT (-0700), b...@kernel.crashing.org wrote: > On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote: >> > + >> > +#include >> >> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} >> helpers using inline assembly, to prevent the compiler for breaking >> up accesses into byte accesses. >> >> Also, most architectures require to some synchronization after a >> non-relaxed readl() to prevent prefetching of DMA buffers, and >> before a writel() to flush write buffers when a DMA gets triggered. > > Right, I was about to comment on that one. Well, you're both correct: what was there just isn't correct. Our implementations were safe because they don't have aggressive MMIO systems, but that won't remain true for long. I've gone ahead and added a proper IO implementation patterned on arm64. It'll be part of the v2 patch set. Here's the bulk of the patch, if you're curious https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa > The question Palmer is about the ordering semantics of non-cached > storage. > > What kind of ordering is provided architecturally ? Especially > between cachable and non-cachable loads and stores ? The memory model on RISC-V is pretty weak. Without fences there are no ordering constraints. We provide 2 "ordering spaces" (an odd name I just made up, there's one address space): the IO space and the regular memory space. The base RISC-V ordering primitive is a fence, which takes a predecessor set and successor set. Fences can look like fence IORW,IORW where the left side is the predecessor set and the right side is the successor set. The fence enforces ordering between any operations in the two sets: all operations in the predecessor set must be globally visible before any operation in the successor set becomes visible anywhere. For example, if you're emitting a DMA transaction you'd have to do something like build_message_in_memory() fence w,o set_control_register() with the fence ensuring all the memory writes are visible before the control register write. More information can be found in the ISA manuals https://github.com/riscv/riscv-isa-manual/releases/download/riscv-user-2.2/riscv-spec-v2.2.pdf https://github.com/riscv/riscv-isa-manual/releases/download/riscv-priv-1.10/riscv-privileged-v1.10.pdf > Also, you have PCIe right ? What is the behaviour of MSIs ? > > Does your HW provide a guarantee that in the case of a series of DMA > writes to memory by a device followed by an MSI, the CPU getting the > MSI will only get it after all the previous DMA writes have reached > coherency ? (Unlike LSIs where the driver is required to do an MMIO > read from the device, MSIs are expected to be ordered with data). We do have PCIe, but I'm not particularly familiar with it as I haven't spent any time hacking on our PCIe hardware or driver. My understanding here is that PCIe defines that MSIs must not be reordered before the DMA writes, so the implementation is required to enforce this ordering. Thus it's a problem for the PCIe controller implementation (which isn't covered by RISC-V) and therefor doesn't need any ordering enforced by the driver. If I'm correct in the assumption that the hardware is required to enforce these ordering constraints then I think this isn't a RISC-V issue. I'll go bug our PCIe guys to make sure everything is kosher in that case, but it's an ordering constraint that is possible to enforce in our coherence protocol so it's not a fundamental problem (and just an implementation one at that). If the hardware isn't required to enforce the ordering, then we'll need a fence before handling the data. > Another things with the read*() accessors. It's not uncommon for > a driver to do: > > writel(1, reset_reg); > readl(reset_reg); /* flush posted writes */ > udelay(10); > writel(0, reset_reg); > > Now, in the above case, what can typically happen if you aren't careful > is that the readl which is intended to "push" the previous writel, will > not actually do its job because the return value hasn't been "consumed" > by the processor. Thus, the CPU will stick that on some kind of load > queue and won't actually wait for the return value before hitting the > delay loop. > > Thus you might end up in a situation where the writel of 1 to the > device is itself reaching the device way after you started the delay > loop, and thus end up violating the delay requirement of the HW. > > On powerpc we solve that by using a special instruction construct > inside the read* accessors that prevents the CPU from executing > subsequent instructions until the read value has been returned. > > You may want to consider something similar. Ooh, that's a fun one :). I bugged Andrew (the ISA wizard), and this might require some clarification in the ISA manual. The code we emit will look something like fence io,o st 1, RESET_REG ld RESET_REG fence o,io
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, 23 May 2017 14:23:50 PDT (-0700), b...@kernel.crashing.org wrote: > On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote: >> > + >> > +#include >> >> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} >> helpers using inline assembly, to prevent the compiler for breaking >> up accesses into byte accesses. >> >> Also, most architectures require to some synchronization after a >> non-relaxed readl() to prevent prefetching of DMA buffers, and >> before a writel() to flush write buffers when a DMA gets triggered. > > Right, I was about to comment on that one. Well, you're both correct: what was there just isn't correct. Our implementations were safe because they don't have aggressive MMIO systems, but that won't remain true for long. I've gone ahead and added a proper IO implementation patterned on arm64. It'll be part of the v2 patch set. Here's the bulk of the patch, if you're curious https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa > The question Palmer is about the ordering semantics of non-cached > storage. > > What kind of ordering is provided architecturally ? Especially > between cachable and non-cachable loads and stores ? The memory model on RISC-V is pretty weak. Without fences there are no ordering constraints. We provide 2 "ordering spaces" (an odd name I just made up, there's one address space): the IO space and the regular memory space. The base RISC-V ordering primitive is a fence, which takes a predecessor set and successor set. Fences can look like fence IORW,IORW where the left side is the predecessor set and the right side is the successor set. The fence enforces ordering between any operations in the two sets: all operations in the predecessor set must be globally visible before any operation in the successor set becomes visible anywhere. For example, if you're emitting a DMA transaction you'd have to do something like build_message_in_memory() fence w,o set_control_register() with the fence ensuring all the memory writes are visible before the control register write. More information can be found in the ISA manuals https://github.com/riscv/riscv-isa-manual/releases/download/riscv-user-2.2/riscv-spec-v2.2.pdf https://github.com/riscv/riscv-isa-manual/releases/download/riscv-priv-1.10/riscv-privileged-v1.10.pdf > Also, you have PCIe right ? What is the behaviour of MSIs ? > > Does your HW provide a guarantee that in the case of a series of DMA > writes to memory by a device followed by an MSI, the CPU getting the > MSI will only get it after all the previous DMA writes have reached > coherency ? (Unlike LSIs where the driver is required to do an MMIO > read from the device, MSIs are expected to be ordered with data). We do have PCIe, but I'm not particularly familiar with it as I haven't spent any time hacking on our PCIe hardware or driver. My understanding here is that PCIe defines that MSIs must not be reordered before the DMA writes, so the implementation is required to enforce this ordering. Thus it's a problem for the PCIe controller implementation (which isn't covered by RISC-V) and therefor doesn't need any ordering enforced by the driver. If I'm correct in the assumption that the hardware is required to enforce these ordering constraints then I think this isn't a RISC-V issue. I'll go bug our PCIe guys to make sure everything is kosher in that case, but it's an ordering constraint that is possible to enforce in our coherence protocol so it's not a fundamental problem (and just an implementation one at that). If the hardware isn't required to enforce the ordering, then we'll need a fence before handling the data. > Another things with the read*() accessors. It's not uncommon for > a driver to do: > > writel(1, reset_reg); > readl(reset_reg); /* flush posted writes */ > udelay(10); > writel(0, reset_reg); > > Now, in the above case, what can typically happen if you aren't careful > is that the readl which is intended to "push" the previous writel, will > not actually do its job because the return value hasn't been "consumed" > by the processor. Thus, the CPU will stick that on some kind of load > queue and won't actually wait for the return value before hitting the > delay loop. > > Thus you might end up in a situation where the writel of 1 to the > device is itself reaching the device way after you started the delay > loop, and thus end up violating the delay requirement of the HW. > > On powerpc we solve that by using a special instruction construct > inside the read* accessors that prevents the CPU from executing > subsequent instructions until the read value has been returned. > > You may want to consider something similar. Ooh, that's a fun one :). I bugged Andrew (the ISA wizard), and this might require some clarification in the ISA manual. The code we emit will look something like fence io,o st 1, RESET_REG ld RESET_REG fence o,io
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbeltwrote: > On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: >> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >>> +#ifndef _ASM_RISCV_CACHE_H >>> +#define _ASM_RISCV_CACHE_H >>> + >>> +#define L1_CACHE_SHIFT 6 >>> + >>> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >> >> Is this the only valid cache line size on riscv, or just the largest >> one that is allowed? > > The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no > restriction on L1 cache line size (we tried to keep microarchitecture out of > the ISA specification). We provide the actual cache parameters as part of the > device tree, but it looks like this needs to be known staticly in some places > so we can't use that everywhere. > > We could always make this a Kconfig parameter. The cache line size is used in a couple of places, let's go through the most common ones to see where that abstraction might be leaky and you actually get an architectural effect: - On SMP machines, cacheline_aligned_in_smp is used to annotate data structures used in lockless algorithms, typically with one CPU writing to some members of a structure, and another CPU reading from it but not writing the same members. Depending on the architecture, having a larger actual alignment than L1_CACHE_BYTES will either lead to bad performance from cache line ping pong, or actual data corruption. - On systems with DMA masters that are not fully coherent, cacheline_aligned is used to annotate data structures used for DMA buffers, to make sure that the cache maintenance operations in dma_sync_*_for_*() helpers don't corrup data outside of the DMA buffer. You don't seem to support noncoherent DMA masters or the cache maintenance operations required to use those, so this might not be a problem until someone adds an extension for those. - Depending on the bus interconnect, a coherent DMA master might not be able to update partial cache lines, so you need the same annotation. - The kmalloc() family of memory allocators aligns data to the cache line size, for both DMA and SMP synchronization above. - Many architectures have cache line prefetch, flush, zero or copy instructions that are used for important performance optimizations but that are typically defined on a cacheline granularity. I don't think you currently have any of them, but it seems likely that there will be demand for them later. Having a larger than necessary alignment can waste substantial amounts of memory for arrays of cache line aligned structures (typically per-cpu arrays), but otherwise should not cause harm. >>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h >>> new file mode 100644 >>> index ..d942555a7a08 >>> --- /dev/null >>> +++ b/arch/riscv/include/asm/io.h >>> @@ -0,0 +1,36 @@ >> >>> +#ifndef _ASM_RISCV_IO_H >>> +#define _ASM_RISCV_IO_H >>> + >>> +#include >> >> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} >> helpers using inline assembly, to prevent the compiler for breaking >> up accesses into byte accesses. >> >> Also, most architectures require to some synchronization after a >> non-relaxed readl() to prevent prefetching of DMA buffers, and >> before a writel() to flush write buffers when a DMA gets triggered. > > Makes sense. These were all OK on existing implementations (as there's no > writable PMAs, so all MMIO regions are strictly ordered), but that's not > actually what the RISC-V ISA says. I patterned this on arm64 > > > https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa > > where I think the only odd thing is our definition of mmiowb > > +/* IO barriers. These only fence on the IO bits because they're only > required > + * to order device access. We're defining mmiowb because our AMO > instructions > + * (which are used to implement locks) don't specify ordering. From > Chapter 7 > + * of v2.2 of the user ISA: > + * "The bits order accesses to one of the two address domains, memory or > I/O, > + * depending on which address domain the atomic instruction is accessing. > No > + * ordering constraint is implied to accesses to the other domain, and a > FENCE > + * instruction should be used to order across both domains." > + */ > + > +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : > "memory"); > +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : > "memory"); Looks ok, yes. > +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : : > "memory"); > > which I think is correct. I can never remember what exactly this one does. >>> +#ifdef __KERNEL__ >>> + >>> +#ifdef CONFIG_MMU >>> + >>> +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size); >>> + >>> +#define ioremap_nocache(addr, size)
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt wrote: > On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: >> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >>> +#ifndef _ASM_RISCV_CACHE_H >>> +#define _ASM_RISCV_CACHE_H >>> + >>> +#define L1_CACHE_SHIFT 6 >>> + >>> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >> >> Is this the only valid cache line size on riscv, or just the largest >> one that is allowed? > > The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no > restriction on L1 cache line size (we tried to keep microarchitecture out of > the ISA specification). We provide the actual cache parameters as part of the > device tree, but it looks like this needs to be known staticly in some places > so we can't use that everywhere. > > We could always make this a Kconfig parameter. The cache line size is used in a couple of places, let's go through the most common ones to see where that abstraction might be leaky and you actually get an architectural effect: - On SMP machines, cacheline_aligned_in_smp is used to annotate data structures used in lockless algorithms, typically with one CPU writing to some members of a structure, and another CPU reading from it but not writing the same members. Depending on the architecture, having a larger actual alignment than L1_CACHE_BYTES will either lead to bad performance from cache line ping pong, or actual data corruption. - On systems with DMA masters that are not fully coherent, cacheline_aligned is used to annotate data structures used for DMA buffers, to make sure that the cache maintenance operations in dma_sync_*_for_*() helpers don't corrup data outside of the DMA buffer. You don't seem to support noncoherent DMA masters or the cache maintenance operations required to use those, so this might not be a problem until someone adds an extension for those. - Depending on the bus interconnect, a coherent DMA master might not be able to update partial cache lines, so you need the same annotation. - The kmalloc() family of memory allocators aligns data to the cache line size, for both DMA and SMP synchronization above. - Many architectures have cache line prefetch, flush, zero or copy instructions that are used for important performance optimizations but that are typically defined on a cacheline granularity. I don't think you currently have any of them, but it seems likely that there will be demand for them later. Having a larger than necessary alignment can waste substantial amounts of memory for arrays of cache line aligned structures (typically per-cpu arrays), but otherwise should not cause harm. >>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h >>> new file mode 100644 >>> index ..d942555a7a08 >>> --- /dev/null >>> +++ b/arch/riscv/include/asm/io.h >>> @@ -0,0 +1,36 @@ >> >>> +#ifndef _ASM_RISCV_IO_H >>> +#define _ASM_RISCV_IO_H >>> + >>> +#include >> >> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} >> helpers using inline assembly, to prevent the compiler for breaking >> up accesses into byte accesses. >> >> Also, most architectures require to some synchronization after a >> non-relaxed readl() to prevent prefetching of DMA buffers, and >> before a writel() to flush write buffers when a DMA gets triggered. > > Makes sense. These were all OK on existing implementations (as there's no > writable PMAs, so all MMIO regions are strictly ordered), but that's not > actually what the RISC-V ISA says. I patterned this on arm64 > > > https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa > > where I think the only odd thing is our definition of mmiowb > > +/* IO barriers. These only fence on the IO bits because they're only > required > + * to order device access. We're defining mmiowb because our AMO > instructions > + * (which are used to implement locks) don't specify ordering. From > Chapter 7 > + * of v2.2 of the user ISA: > + * "The bits order accesses to one of the two address domains, memory or > I/O, > + * depending on which address domain the atomic instruction is accessing. > No > + * ordering constraint is implied to accesses to the other domain, and a > FENCE > + * instruction should be used to order across both domains." > + */ > + > +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : > "memory"); > +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : > "memory"); Looks ok, yes. > +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : : > "memory"); > > which I think is correct. I can never remember what exactly this one does. >>> +#ifdef __KERNEL__ >>> + >>> +#ifdef CONFIG_MMU >>> + >>> +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size); >>> + >>> +#define ioremap_nocache(addr, size) ioremap((addr), (size)) >>> +#define
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbeltwrote: >> +/** >> + * atomic_read - read atomic variable >> + * @v: pointer of type atomic_t >> + * >> + * Atomically reads the value of @v. >> + */ >> +static inline int atomic_read(const atomic_t *v) >> +{ >> + return *((volatile int *)(&(v->counter))); >> +} >> +/** >> + * atomic_set - set atomic variable >> + * @v: pointer of type atomic_t >> + * @i: required value >> + * >> + * Atomically sets the value of @v to @i. >> + */ >> +static inline void atomic_set(atomic_t *v, int i) >> +{ >> + v->counter = i; >> +} > > These commonly use READ_ONCE() and WRITE_ONCE, > I'd recommend doing the same here to be on the safe side. Makes sense. https://github.com/riscv/riscv-linux/commit/77647f9e4dccab68c69a212a63c9efe1db2b7b1c >> diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h >> new file mode 100644 >> index ..10d894ac3137 >> --- /dev/null >> +++ b/arch/riscv/include/asm/bug.h >> @@ -0,0 +1,81 @@ >> +/* >> >> +#ifndef _ASM_RISCV_BUG_H >> +#define _ASM_RISCV_BUG_H > >> +#ifdef CONFIG_GENERIC_BUG >> +#define __BUG_INSN _AC(0x00100073, UL) /* sbreak */ > > Please have a look at the modifications I did for !CONFIG_BUG > on x86, arm and arm64. It's generally better to define BUG to a > trap even when CONFIG_BUG is disabled, otherwise you run > into undefined behavior in some code, and gcc will print annoying > warnings about that. OK, seems like a good thing. https://github.com/riscv/riscv-linux/commit/67db001653614c6555424b3812d7edfba12a6d4c >> +#ifndef _ASM_RISCV_CACHE_H >> +#define _ASM_RISCV_CACHE_H >> + >> +#define L1_CACHE_SHIFT 6 >> + >> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > > Is this the only valid cache line size on riscv, or just the largest > one that is allowed? The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no restriction on L1 cache line size (we tried to keep microarchitecture out of the ISA specification). We provide the actual cache parameters as part of the device tree, but it looks like this needs to be known staticly in some places so we can't use that everywhere. We could always make this a Kconfig parameter. >> + >> +static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr) >> +{ >> + return (dma_addr_t)paddr; >> +} >> + >> +static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t >> dev_addr) >> +{ >> + return (phys_addr_t)dev_addr; >> +} > > What do you need these for? If possible, try to remove them. > >> +static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t >> size, enum dma_data_direction dir) >> +{ >> + /* >> +* RISC-V is cache-coherent, so this is mostly a no-op. >> +* However, we do need to ensure that dma_cache_sync() >> +* enforces order, hence the mb(). >> +*/ >> + mb(); >> +} > > Do you even support any drivers that use > dma_alloc_noncoherent()/dma_cache_sync()? > > I would guess you can just leave this out. These must have been vestigial code, they appear safe to remove. https://github.com/riscv/riscv-linux/commit/d1c88783d5ff66464a25173f7a4af139f0ebf5e2 >> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h >> new file mode 100644 >> index ..d942555a7a08 >> --- /dev/null >> +++ b/arch/riscv/include/asm/io.h >> @@ -0,0 +1,36 @@ > >> +#ifndef _ASM_RISCV_IO_H >> +#define _ASM_RISCV_IO_H >> + >> +#include > > I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} > helpers using inline assembly, to prevent the compiler for breaking > up accesses into byte accesses. > > Also, most architectures require to some synchronization after a > non-relaxed readl() to prevent prefetching of DMA buffers, and > before a writel() to flush write buffers when a DMA gets triggered. Makes sense. These were all OK on existing implementations (as there's no writable PMAs, so all MMIO regions are strictly ordered), but that's not actually what the RISC-V ISA says. I patterned this on arm64 https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa where I think the only odd thing is our definition of mmiowb +/* IO barriers. These only fence on the IO bits because they're only required + * to order device access. We're defining mmiowb because our AMO instructions + * (which are used to implement locks) don't specify ordering. From Chapter 7 + * of v2.2 of the user ISA: + * "The bits order accesses to one of the two address domains, memory or I/O, + * depending on which address domain the atomic instruction is accessing. No + * ordering constraint is implied to accesses to the other domain, and a FENCE + * instruction should be used to order across both domains." + */ + +#define __iormb() __asm__ __volatile__ ("fence i,io"
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote: > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> +/** >> + * atomic_read - read atomic variable >> + * @v: pointer of type atomic_t >> + * >> + * Atomically reads the value of @v. >> + */ >> +static inline int atomic_read(const atomic_t *v) >> +{ >> + return *((volatile int *)(&(v->counter))); >> +} >> +/** >> + * atomic_set - set atomic variable >> + * @v: pointer of type atomic_t >> + * @i: required value >> + * >> + * Atomically sets the value of @v to @i. >> + */ >> +static inline void atomic_set(atomic_t *v, int i) >> +{ >> + v->counter = i; >> +} > > These commonly use READ_ONCE() and WRITE_ONCE, > I'd recommend doing the same here to be on the safe side. Makes sense. https://github.com/riscv/riscv-linux/commit/77647f9e4dccab68c69a212a63c9efe1db2b7b1c >> diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h >> new file mode 100644 >> index ..10d894ac3137 >> --- /dev/null >> +++ b/arch/riscv/include/asm/bug.h >> @@ -0,0 +1,81 @@ >> +/* >> >> +#ifndef _ASM_RISCV_BUG_H >> +#define _ASM_RISCV_BUG_H > >> +#ifdef CONFIG_GENERIC_BUG >> +#define __BUG_INSN _AC(0x00100073, UL) /* sbreak */ > > Please have a look at the modifications I did for !CONFIG_BUG > on x86, arm and arm64. It's generally better to define BUG to a > trap even when CONFIG_BUG is disabled, otherwise you run > into undefined behavior in some code, and gcc will print annoying > warnings about that. OK, seems like a good thing. https://github.com/riscv/riscv-linux/commit/67db001653614c6555424b3812d7edfba12a6d4c >> +#ifndef _ASM_RISCV_CACHE_H >> +#define _ASM_RISCV_CACHE_H >> + >> +#define L1_CACHE_SHIFT 6 >> + >> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > > Is this the only valid cache line size on riscv, or just the largest > one that is allowed? The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no restriction on L1 cache line size (we tried to keep microarchitecture out of the ISA specification). We provide the actual cache parameters as part of the device tree, but it looks like this needs to be known staticly in some places so we can't use that everywhere. We could always make this a Kconfig parameter. >> + >> +static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr) >> +{ >> + return (dma_addr_t)paddr; >> +} >> + >> +static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t >> dev_addr) >> +{ >> + return (phys_addr_t)dev_addr; >> +} > > What do you need these for? If possible, try to remove them. > >> +static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t >> size, enum dma_data_direction dir) >> +{ >> + /* >> +* RISC-V is cache-coherent, so this is mostly a no-op. >> +* However, we do need to ensure that dma_cache_sync() >> +* enforces order, hence the mb(). >> +*/ >> + mb(); >> +} > > Do you even support any drivers that use > dma_alloc_noncoherent()/dma_cache_sync()? > > I would guess you can just leave this out. These must have been vestigial code, they appear safe to remove. https://github.com/riscv/riscv-linux/commit/d1c88783d5ff66464a25173f7a4af139f0ebf5e2 >> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h >> new file mode 100644 >> index ..d942555a7a08 >> --- /dev/null >> +++ b/arch/riscv/include/asm/io.h >> @@ -0,0 +1,36 @@ > >> +#ifndef _ASM_RISCV_IO_H >> +#define _ASM_RISCV_IO_H >> + >> +#include > > I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} > helpers using inline assembly, to prevent the compiler for breaking > up accesses into byte accesses. > > Also, most architectures require to some synchronization after a > non-relaxed readl() to prevent prefetching of DMA buffers, and > before a writel() to flush write buffers when a DMA gets triggered. Makes sense. These were all OK on existing implementations (as there's no writable PMAs, so all MMIO regions are strictly ordered), but that's not actually what the RISC-V ISA says. I patterned this on arm64 https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa where I think the only odd thing is our definition of mmiowb +/* IO barriers. These only fence on the IO bits because they're only required + * to order device access. We're defining mmiowb because our AMO instructions + * (which are used to implement locks) don't specify ordering. From Chapter 7 + * of v2.2 of the user ISA: + * "The bits order accesses to one of the two address domains, memory or I/O, + * depending on which address domain the atomic instruction is accessing. No + * ordering constraint is implied to accesses to the other domain, and a FENCE + * instruction should be used to order across both domains." + */ + +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : "memory");
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote: > > + > > +#include > > I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} > helpers using inline assembly, to prevent the compiler for breaking > up accesses into byte accesses. > > Also, most architectures require to some synchronization after a > non-relaxed readl() to prevent prefetching of DMA buffers, and > before a writel() to flush write buffers when a DMA gets triggered. Right, I was about to comment on that one. The question Palmer is about the ordering semantics of non-cached storage. What kind of ordering is provided architecturally ? Especially between cachable and non-cachable loads and stores ? Also, you have PCIe right ? What is the behaviour of MSIs ? Does your HW provide a guarantee that in the case of a series of DMA writes to memory by a device followed by an MSI, the CPU getting the MSI will only get it after all the previous DMA writes have reached coherency ? (Unlike LSIs where the driver is required to do an MMIO read from the device, MSIs are expected to be ordered with data). Another things with the read*() accessors. It's not uncommon for a driver to do: writel(1, reset_reg); readl(reset_reg); /* flush posted writes */ udelay(10); writel(0, reset_reg); Now, in the above case, what can typically happen if you aren't careful is that the readl which is intended to "push" the previous writel, will not actually do its job because the return value hasn't been "consumed" by the processor. Thus, the CPU will stick that on some kind of load queue and won't actually wait for the return value before hitting the delay loop. Thus you might end up in a situation where the writel of 1 to the device is itself reaching the device way after you started the delay loop, and thus end up violating the delay requirement of the HW. On powerpc we solve that by using a special instruction construct inside the read* accessors that prevents the CPU from executing subsequent instructions until the read value has been returned. You may want to consider something similar. Cheers, Ben.
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote: > > + > > +#include > > I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} > helpers using inline assembly, to prevent the compiler for breaking > up accesses into byte accesses. > > Also, most architectures require to some synchronization after a > non-relaxed readl() to prevent prefetching of DMA buffers, and > before a writel() to flush write buffers when a DMA gets triggered. Right, I was about to comment on that one. The question Palmer is about the ordering semantics of non-cached storage. What kind of ordering is provided architecturally ? Especially between cachable and non-cachable loads and stores ? Also, you have PCIe right ? What is the behaviour of MSIs ? Does your HW provide a guarantee that in the case of a series of DMA writes to memory by a device followed by an MSI, the CPU getting the MSI will only get it after all the previous DMA writes have reached coherency ? (Unlike LSIs where the driver is required to do an MMIO read from the device, MSIs are expected to be ordered with data). Another things with the read*() accessors. It's not uncommon for a driver to do: writel(1, reset_reg); readl(reset_reg); /* flush posted writes */ udelay(10); writel(0, reset_reg); Now, in the above case, what can typically happen if you aren't careful is that the readl which is intended to "push" the previous writel, will not actually do its job because the return value hasn't been "consumed" by the processor. Thus, the CPU will stick that on some kind of load queue and won't actually wait for the return value before hitting the delay loop. Thus you might end up in a situation where the writel of 1 to the device is itself reaching the device way after you started the delay loop, and thus end up violating the delay requirement of the HW. On powerpc we solve that by using a special instruction construct inside the read* accessors that prevents the CPU from executing subsequent instructions until the read value has been returned. You may want to consider something similar. Cheers, Ben.
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbeltwrote: > +/** > + * atomic_read - read atomic variable > + * @v: pointer of type atomic_t > + * > + * Atomically reads the value of @v. > + */ > +static inline int atomic_read(const atomic_t *v) > +{ > + return *((volatile int *)(&(v->counter))); > +} > +/** > + * atomic_set - set atomic variable > + * @v: pointer of type atomic_t > + * @i: required value > + * > + * Atomically sets the value of @v to @i. > + */ > +static inline void atomic_set(atomic_t *v, int i) > +{ > + v->counter = i; > +} These commonly use READ_ONCE() and WRITE_ONCE, I'd recommend doing the same here to be on the safe side. > +/** > + * atomic64_read - read atomic64 variable > + * @v: pointer of type atomic64_t > + * > + * Atomically reads the value of @v. > + */ > +static inline s64 atomic64_read(const atomic64_t *v) > +{ > + return *((volatile long *)(&(v->counter))); > +} > + > +/** > + * atomic64_set - set atomic64 variable > + * @v: pointer to type atomic64_t > + * @i: required value > + * > + * Atomically sets the value of @v to @i. > + */ > +static inline void atomic64_set(atomic64_t *v, s64 i) > +{ > + v->counter = i; > +} same here > diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h > new file mode 100644 > index ..10d894ac3137 > --- /dev/null > +++ b/arch/riscv/include/asm/bug.h > @@ -0,0 +1,81 @@ > +/* > > +#ifndef _ASM_RISCV_BUG_H > +#define _ASM_RISCV_BUG_H > +#ifdef CONFIG_GENERIC_BUG > +#define __BUG_INSN _AC(0x00100073, UL) /* sbreak */ Please have a look at the modifications I did for !CONFIG_BUG on x86, arm and arm64. It's generally better to define BUG to a trap even when CONFIG_BUG is disabled, otherwise you run into undefined behavior in some code, and gcc will print annoying warnings about that. > +#ifndef _ASM_RISCV_CACHE_H > +#define _ASM_RISCV_CACHE_H > + > +#define L1_CACHE_SHIFT 6 > + > +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) Is this the only valid cache line size on riscv, or just the largest one that is allowed? > + > +static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr) > +{ > + return (dma_addr_t)paddr; > +} > + > +static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t > dev_addr) > +{ > + return (phys_addr_t)dev_addr; > +} What do you need these for? If possible, try to remove them. > +static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t > size, enum dma_data_direction dir) > +{ > + /* > +* RISC-V is cache-coherent, so this is mostly a no-op. > +* However, we do need to ensure that dma_cache_sync() > +* enforces order, hence the mb(). > +*/ > + mb(); > +} Do you even support any drivers that use dma_alloc_noncoherent()/dma_cache_sync()? I would guess you can just leave this out. > diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h > new file mode 100644 > index ..d942555a7a08 > --- /dev/null > +++ b/arch/riscv/include/asm/io.h > @@ -0,0 +1,36 @@ > +#ifndef _ASM_RISCV_IO_H > +#define _ASM_RISCV_IO_H > + > +#include I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} helpers using inline assembly, to prevent the compiler for breaking up accesses into byte accesses. Also, most architectures require to some synchronization after a non-relaxed readl() to prevent prefetching of DMA buffers, and before a writel() to flush write buffers when a DMA gets triggered. > +#ifdef __KERNEL__ > + > +#ifdef CONFIG_MMU > + > +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size); > + > +#define ioremap_nocache(addr, size) ioremap((addr), (size)) > +#define ioremap_wc(addr, size) ioremap((addr), (size)) > +#define ioremap_wt(addr, size) ioremap((addr), (size)) Is this a hard architecture limitation? Normally you really want write-combined access on frame buffer memory and a few other cases for performance reasons, and ioremap_wc() gets used for by memremap() for addressing RAM in some cases, and you normally don't want to have PTEs for the same memory using cached and uncached page flags > diff --git a/arch/riscv/include/asm/serial.h b/arch/riscv/include/asm/serial.h > new file mode 100644 > index ..d783dbe80a4b > --- /dev/null > +++ b/arch/riscv/include/asm/serial.h > @@ -0,0 +1,43 @@ > +/* > + * Copyright (C) 2014 Regents of the University of California > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation, version 2. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or > + * NON INFRINGEMENT. See the GNU General Public License for > + * more
Re: [PATCH 4/7] RISC-V: arch/riscv/include
On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: > +/** > + * atomic_read - read atomic variable > + * @v: pointer of type atomic_t > + * > + * Atomically reads the value of @v. > + */ > +static inline int atomic_read(const atomic_t *v) > +{ > + return *((volatile int *)(&(v->counter))); > +} > +/** > + * atomic_set - set atomic variable > + * @v: pointer of type atomic_t > + * @i: required value > + * > + * Atomically sets the value of @v to @i. > + */ > +static inline void atomic_set(atomic_t *v, int i) > +{ > + v->counter = i; > +} These commonly use READ_ONCE() and WRITE_ONCE, I'd recommend doing the same here to be on the safe side. > +/** > + * atomic64_read - read atomic64 variable > + * @v: pointer of type atomic64_t > + * > + * Atomically reads the value of @v. > + */ > +static inline s64 atomic64_read(const atomic64_t *v) > +{ > + return *((volatile long *)(&(v->counter))); > +} > + > +/** > + * atomic64_set - set atomic64 variable > + * @v: pointer to type atomic64_t > + * @i: required value > + * > + * Atomically sets the value of @v to @i. > + */ > +static inline void atomic64_set(atomic64_t *v, s64 i) > +{ > + v->counter = i; > +} same here > diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h > new file mode 100644 > index ..10d894ac3137 > --- /dev/null > +++ b/arch/riscv/include/asm/bug.h > @@ -0,0 +1,81 @@ > +/* > > +#ifndef _ASM_RISCV_BUG_H > +#define _ASM_RISCV_BUG_H > +#ifdef CONFIG_GENERIC_BUG > +#define __BUG_INSN _AC(0x00100073, UL) /* sbreak */ Please have a look at the modifications I did for !CONFIG_BUG on x86, arm and arm64. It's generally better to define BUG to a trap even when CONFIG_BUG is disabled, otherwise you run into undefined behavior in some code, and gcc will print annoying warnings about that. > +#ifndef _ASM_RISCV_CACHE_H > +#define _ASM_RISCV_CACHE_H > + > +#define L1_CACHE_SHIFT 6 > + > +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) Is this the only valid cache line size on riscv, or just the largest one that is allowed? > + > +static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr) > +{ > + return (dma_addr_t)paddr; > +} > + > +static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t > dev_addr) > +{ > + return (phys_addr_t)dev_addr; > +} What do you need these for? If possible, try to remove them. > +static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t > size, enum dma_data_direction dir) > +{ > + /* > +* RISC-V is cache-coherent, so this is mostly a no-op. > +* However, we do need to ensure that dma_cache_sync() > +* enforces order, hence the mb(). > +*/ > + mb(); > +} Do you even support any drivers that use dma_alloc_noncoherent()/dma_cache_sync()? I would guess you can just leave this out. > diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h > new file mode 100644 > index ..d942555a7a08 > --- /dev/null > +++ b/arch/riscv/include/asm/io.h > @@ -0,0 +1,36 @@ > +#ifndef _ASM_RISCV_IO_H > +#define _ASM_RISCV_IO_H > + > +#include I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed} helpers using inline assembly, to prevent the compiler for breaking up accesses into byte accesses. Also, most architectures require to some synchronization after a non-relaxed readl() to prevent prefetching of DMA buffers, and before a writel() to flush write buffers when a DMA gets triggered. > +#ifdef __KERNEL__ > + > +#ifdef CONFIG_MMU > + > +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size); > + > +#define ioremap_nocache(addr, size) ioremap((addr), (size)) > +#define ioremap_wc(addr, size) ioremap((addr), (size)) > +#define ioremap_wt(addr, size) ioremap((addr), (size)) Is this a hard architecture limitation? Normally you really want write-combined access on frame buffer memory and a few other cases for performance reasons, and ioremap_wc() gets used for by memremap() for addressing RAM in some cases, and you normally don't want to have PTEs for the same memory using cached and uncached page flags > diff --git a/arch/riscv/include/asm/serial.h b/arch/riscv/include/asm/serial.h > new file mode 100644 > index ..d783dbe80a4b > --- /dev/null > +++ b/arch/riscv/include/asm/serial.h > @@ -0,0 +1,43 @@ > +/* > + * Copyright (C) 2014 Regents of the University of California > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation, version 2. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or > + * NON INFRINGEMENT. See the GNU General Public License for > + * more details. > + */ > + >