Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-06 Thread Palmer Dabbelt
On Tue, 06 Jun 2017 01:54:23 PDT (-0700), Arnd Bergmann wrote:
> On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt  wrote:
>> On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
>>> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt  wrote:
 On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  
> wrote:
>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>> new file mode 100644
>> index ..d942555a7a08
>> --- /dev/null
>> +++ b/arch/riscv/include/asm/io.h
>> @@ -0,0 +1,36 @@
>
>> +#ifndef _ASM_RISCV_IO_H
>> +#define _ASM_RISCV_IO_H
>> +
>> +#include 
>
> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
> helpers using inline assembly, to prevent the compiler for breaking
> up accesses into byte accesses.
>
> Also, most architectures require to some synchronization after a
> non-relaxed readl() to prevent prefetching of DMA buffers, and
> before a writel() to flush write buffers when a DMA gets triggered.

 Makes sense.  These were all OK on existing implementations (as there's no
 writable PMAs, so all MMIO regions are strictly ordered), but that's not
 actually what the RISC-V ISA says.  I patterned this on arm64

   
 https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa

 where I think the only odd thing is our definition of mmiowb

   +/* IO barriers.  These only fence on the IO bits because they're only 
 required
   + * to order device access.  We're defining mmiowb because our AMO 
 instructions
   + * (which are used to implement locks) don't specify ordering.  From 
 Chapter 7
   + * of v2.2 of the user ISA:
   + * "The bits order accesses to one of the two address domains, memory 
 or I/O,
   + * depending on which address domain the atomic instruction is 
 accessing. No
   + * ordering constraint is implied to accesses to the other domain, and 
 a FENCE
   + * instruction should be used to order across both domains."
   + */
   +
   +#define __iormb()   __asm__ __volatile__ ("fence i,io" : : 
 : "memory");
   +#define __iowmb()   __asm__ __volatile__ ("fence io,o" : : 
 : "memory");
>>>
>>> Looks ok, yes.
>>>
   +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : 
 : "memory");

 which I think is correct.
>>>
>>> I can never remember what exactly this one does.
>>
>> I can't find the reference again, but what I found said that if your atomics
>> (or whatever's used for locking) don't stay ordered with your MMIO accesses,
>> then you should define mmiowb to ensure ordering.  I managed to screw this 
>> up,
>> as there's no "w" in the successor set (to actually enforce the AMO 
>> ordering).
>> This is somewhat confirmed by
>>
>>   https://lkml.org/lkml/2006/8/31/174
>>   Subject: Re: When to use mmiowb()?
>>   AFAICT, they're both right.  Generally, mmiowb() should be used prior to
>>   unlock in a critical section whose last PIO operation is a writeX.
>>
>> Thus, I think the actual fence should be at least
>>
>>   fence o,w
> ...
>> which matches what's above.  I think "fence o,w" is sufficient for a mmiowb 
>> on
>> RISC-V.  I'll make the change.
>
> This sounds reasonable according to the documentation, but with your
> longer explanation of the barriers, I think the __iormb/__iowmb definitions
> above are wrong. What you actually need I think is
>
> void writel(u32 v, volatile void __iomem *addr)
> {
>  asm volatile("fence w,o" : : : "memory");
>  writel_relaxed(v, addr);
> }
>
> u32 readl(volatile void __iomem *addr)
> {
>  u32 ret = readl_relaxed(addr);
>  asm volatile("fence i,r" : : : "memory");
>  return ret;
> }
>
> to synchronize between DMA and I/O. The barriers you listed above
> in contrast appear to be directed at synchronizing I/O with other I/O.
> We normally assume that this is not required when you have
> subsequent MMIO accesses on the same device (on PCI) or the
> same address region (per ARM architecture and others). If you do
> need to enforce ordering between MMIO, you might even need to
> add those barriers in the relaxed version to be portable with drivers
> written for ARM SoCs:
>
> void writel_relaxed(u32 v, volatile void __iomem *addr)
> {
> __raw_writel((__force u32)cpu_to_le32(v, addr)
>  asm volatile("fence o,io" : : : "memory");
> }
>
> u32 readl_relaxed(volatile void __iomem *addr)
> {
>  asm volatile("fence i,io" : : : "memory");
>  return le32_to_cpu((__force __le32)__raw_readl(addr));
> }
>
> You then end up with a barrier before and after each regular
> readl/writel in order to synchronize both with DMA and MMIO
> 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-06 Thread Palmer Dabbelt
On Tue, 06 Jun 2017 01:54:23 PDT (-0700), Arnd Bergmann wrote:
> On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt  wrote:
>> On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
>>> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt  wrote:
 On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  
> wrote:
>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>> new file mode 100644
>> index ..d942555a7a08
>> --- /dev/null
>> +++ b/arch/riscv/include/asm/io.h
>> @@ -0,0 +1,36 @@
>
>> +#ifndef _ASM_RISCV_IO_H
>> +#define _ASM_RISCV_IO_H
>> +
>> +#include 
>
> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
> helpers using inline assembly, to prevent the compiler for breaking
> up accesses into byte accesses.
>
> Also, most architectures require to some synchronization after a
> non-relaxed readl() to prevent prefetching of DMA buffers, and
> before a writel() to flush write buffers when a DMA gets triggered.

 Makes sense.  These were all OK on existing implementations (as there's no
 writable PMAs, so all MMIO regions are strictly ordered), but that's not
 actually what the RISC-V ISA says.  I patterned this on arm64

   
 https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa

 where I think the only odd thing is our definition of mmiowb

   +/* IO barriers.  These only fence on the IO bits because they're only 
 required
   + * to order device access.  We're defining mmiowb because our AMO 
 instructions
   + * (which are used to implement locks) don't specify ordering.  From 
 Chapter 7
   + * of v2.2 of the user ISA:
   + * "The bits order accesses to one of the two address domains, memory 
 or I/O,
   + * depending on which address domain the atomic instruction is 
 accessing. No
   + * ordering constraint is implied to accesses to the other domain, and 
 a FENCE
   + * instruction should be used to order across both domains."
   + */
   +
   +#define __iormb()   __asm__ __volatile__ ("fence i,io" : : 
 : "memory");
   +#define __iowmb()   __asm__ __volatile__ ("fence io,o" : : 
 : "memory");
>>>
>>> Looks ok, yes.
>>>
   +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : 
 : "memory");

 which I think is correct.
>>>
>>> I can never remember what exactly this one does.
>>
>> I can't find the reference again, but what I found said that if your atomics
>> (or whatever's used for locking) don't stay ordered with your MMIO accesses,
>> then you should define mmiowb to ensure ordering.  I managed to screw this 
>> up,
>> as there's no "w" in the successor set (to actually enforce the AMO 
>> ordering).
>> This is somewhat confirmed by
>>
>>   https://lkml.org/lkml/2006/8/31/174
>>   Subject: Re: When to use mmiowb()?
>>   AFAICT, they're both right.  Generally, mmiowb() should be used prior to
>>   unlock in a critical section whose last PIO operation is a writeX.
>>
>> Thus, I think the actual fence should be at least
>>
>>   fence o,w
> ...
>> which matches what's above.  I think "fence o,w" is sufficient for a mmiowb 
>> on
>> RISC-V.  I'll make the change.
>
> This sounds reasonable according to the documentation, but with your
> longer explanation of the barriers, I think the __iormb/__iowmb definitions
> above are wrong. What you actually need I think is
>
> void writel(u32 v, volatile void __iomem *addr)
> {
>  asm volatile("fence w,o" : : : "memory");
>  writel_relaxed(v, addr);
> }
>
> u32 readl(volatile void __iomem *addr)
> {
>  u32 ret = readl_relaxed(addr);
>  asm volatile("fence i,r" : : : "memory");
>  return ret;
> }
>
> to synchronize between DMA and I/O. The barriers you listed above
> in contrast appear to be directed at synchronizing I/O with other I/O.
> We normally assume that this is not required when you have
> subsequent MMIO accesses on the same device (on PCI) or the
> same address region (per ARM architecture and others). If you do
> need to enforce ordering between MMIO, you might even need to
> add those barriers in the relaxed version to be portable with drivers
> written for ARM SoCs:
>
> void writel_relaxed(u32 v, volatile void __iomem *addr)
> {
> __raw_writel((__force u32)cpu_to_le32(v, addr)
>  asm volatile("fence o,io" : : : "memory");
> }
>
> u32 readl_relaxed(volatile void __iomem *addr)
> {
>  asm volatile("fence i,io" : : : "memory");
>  return le32_to_cpu((__force __le32)__raw_readl(addr));
> }
>
> You then end up with a barrier before and after each regular
> readl/writel in order to synchronize both with DMA and MMIO
> instrictructions, and you still need the extre mmiowb() to
> 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-06 Thread Arnd Bergmann
On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt  wrote:
> On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
>> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt  wrote:
>>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
 On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:
>> - Many architectures have cache line prefetch, flush, zero or copy
>>   instructions  that are used for important performance optimizations
>>   but that are typically defined on a cacheline granularity. I don't
>>   think you currently have any of them, but it seems likely that there
>>   will be demand for them later.
>
> We actually have an implicit prefetch (loads to x0, the zero register), but
> it still has all the load side-effects so nothing uses it.
>
>> Having a larger than necessary alignment can waste substantial amounts
>> of memory for arrays of cache line aligned structures (typically
>> per-cpu arrays), but otherwise should not cause harm.
>
> I bugged our L1 guy and he says 64-byte lines are a bit of a magic number
> because of how they line up with DIMMs.  Since there's no spec to define this,
> there's no correct answer.  I'd be amenable to making this a Kconfig option,
> but I think we'll leave it alone for now.  It does match the extant
> implementations.

Hmm, this sounds like a hole in the architecture definition: if you have an
instruction that performs a prefetch (even one that is not easily usable),
I would argue that the cache line size has become a feature of the
architecture and is no longer strictly an implementation detail of the
microarchitecture.

Regarding the memory interface, a lot of systems use two DIMMs on
each memory channel for 128-bit parallel buses, and with LP-DDRx
controllers, you might have a width as small as 16 bits.

> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
> new file mode 100644
> index ..d942555a7a08
> --- /dev/null
> +++ b/arch/riscv/include/asm/io.h
> @@ -0,0 +1,36 @@

> +#ifndef _ASM_RISCV_IO_H
> +#define _ASM_RISCV_IO_H
> +
> +#include 

 I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
 helpers using inline assembly, to prevent the compiler for breaking
 up accesses into byte accesses.

 Also, most architectures require to some synchronization after a
 non-relaxed readl() to prevent prefetching of DMA buffers, and
 before a writel() to flush write buffers when a DMA gets triggered.
>>>
>>> Makes sense.  These were all OK on existing implementations (as there's no
>>> writable PMAs, so all MMIO regions are strictly ordered), but that's not
>>> actually what the RISC-V ISA says.  I patterned this on arm64
>>>
>>>   
>>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>>>
>>> where I think the only odd thing is our definition of mmiowb
>>>
>>>   +/* IO barriers.  These only fence on the IO bits because they're only 
>>> required
>>>   + * to order device access.  We're defining mmiowb because our AMO 
>>> instructions
>>>   + * (which are used to implement locks) don't specify ordering.  From 
>>> Chapter 7
>>>   + * of v2.2 of the user ISA:
>>>   + * "The bits order accesses to one of the two address domains, memory or 
>>> I/O,
>>>   + * depending on which address domain the atomic instruction is 
>>> accessing. No
>>>   + * ordering constraint is implied to accesses to the other domain, and a 
>>> FENCE
>>>   + * instruction should be used to order across both domains."
>>>   + */
>>>   +
>>>   +#define __iormb()   __asm__ __volatile__ ("fence i,io" : : : 
>>> "memory");
>>>   +#define __iowmb()   __asm__ __volatile__ ("fence io,o" : : : 
>>> "memory");
>>
>> Looks ok, yes.
>>
>>>   +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : 
>>> : "memory");
>>>
>>> which I think is correct.
>>
>> I can never remember what exactly this one does.
>
> I can't find the reference again, but what I found said that if your atomics
> (or whatever's used for locking) don't stay ordered with your MMIO accesses,
> then you should define mmiowb to ensure ordering.  I managed to screw this up,
> as there's no "w" in the successor set (to actually enforce the AMO ordering).
> This is somewhat confirmed by
>
>   https://lkml.org/lkml/2006/8/31/174
>   Subject: Re: When to use mmiowb()?
>   AFAICT, they're both right.  Generally, mmiowb() should be used prior to
>   unlock in a critical section whose last PIO operation is a writeX.
>
> Thus, I think the actual fence should be at least
>
>   fence o,w
...
> which matches what's above.  I think "fence o,w" is sufficient for a mmiowb on
> RISC-V.  I'll make the change.

This sounds reasonable according to the documentation, but with your
longer explanation of the barriers, I think the __iormb/__iowmb definitions
above are wrong. What you 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-06 Thread Arnd Bergmann
On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt  wrote:
> On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
>> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt  wrote:
>>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
 On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:
>> - Many architectures have cache line prefetch, flush, zero or copy
>>   instructions  that are used for important performance optimizations
>>   but that are typically defined on a cacheline granularity. I don't
>>   think you currently have any of them, but it seems likely that there
>>   will be demand for them later.
>
> We actually have an implicit prefetch (loads to x0, the zero register), but
> it still has all the load side-effects so nothing uses it.
>
>> Having a larger than necessary alignment can waste substantial amounts
>> of memory for arrays of cache line aligned structures (typically
>> per-cpu arrays), but otherwise should not cause harm.
>
> I bugged our L1 guy and he says 64-byte lines are a bit of a magic number
> because of how they line up with DIMMs.  Since there's no spec to define this,
> there's no correct answer.  I'd be amenable to making this a Kconfig option,
> but I think we'll leave it alone for now.  It does match the extant
> implementations.

Hmm, this sounds like a hole in the architecture definition: if you have an
instruction that performs a prefetch (even one that is not easily usable),
I would argue that the cache line size has become a feature of the
architecture and is no longer strictly an implementation detail of the
microarchitecture.

Regarding the memory interface, a lot of systems use two DIMMs on
each memory channel for 128-bit parallel buses, and with LP-DDRx
controllers, you might have a width as small as 16 bits.

> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
> new file mode 100644
> index ..d942555a7a08
> --- /dev/null
> +++ b/arch/riscv/include/asm/io.h
> @@ -0,0 +1,36 @@

> +#ifndef _ASM_RISCV_IO_H
> +#define _ASM_RISCV_IO_H
> +
> +#include 

 I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
 helpers using inline assembly, to prevent the compiler for breaking
 up accesses into byte accesses.

 Also, most architectures require to some synchronization after a
 non-relaxed readl() to prevent prefetching of DMA buffers, and
 before a writel() to flush write buffers when a DMA gets triggered.
>>>
>>> Makes sense.  These were all OK on existing implementations (as there's no
>>> writable PMAs, so all MMIO regions are strictly ordered), but that's not
>>> actually what the RISC-V ISA says.  I patterned this on arm64
>>>
>>>   
>>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>>>
>>> where I think the only odd thing is our definition of mmiowb
>>>
>>>   +/* IO barriers.  These only fence on the IO bits because they're only 
>>> required
>>>   + * to order device access.  We're defining mmiowb because our AMO 
>>> instructions
>>>   + * (which are used to implement locks) don't specify ordering.  From 
>>> Chapter 7
>>>   + * of v2.2 of the user ISA:
>>>   + * "The bits order accesses to one of the two address domains, memory or 
>>> I/O,
>>>   + * depending on which address domain the atomic instruction is 
>>> accessing. No
>>>   + * ordering constraint is implied to accesses to the other domain, and a 
>>> FENCE
>>>   + * instruction should be used to order across both domains."
>>>   + */
>>>   +
>>>   +#define __iormb()   __asm__ __volatile__ ("fence i,io" : : : 
>>> "memory");
>>>   +#define __iowmb()   __asm__ __volatile__ ("fence io,o" : : : 
>>> "memory");
>>
>> Looks ok, yes.
>>
>>>   +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : 
>>> : "memory");
>>>
>>> which I think is correct.
>>
>> I can never remember what exactly this one does.
>
> I can't find the reference again, but what I found said that if your atomics
> (or whatever's used for locking) don't stay ordered with your MMIO accesses,
> then you should define mmiowb to ensure ordering.  I managed to screw this up,
> as there's no "w" in the successor set (to actually enforce the AMO ordering).
> This is somewhat confirmed by
>
>   https://lkml.org/lkml/2006/8/31/174
>   Subject: Re: When to use mmiowb()?
>   AFAICT, they're both right.  Generally, mmiowb() should be used prior to
>   unlock in a critical section whose last PIO operation is a writeX.
>
> Thus, I think the actual fence should be at least
>
>   fence o,w
...
> which matches what's above.  I think "fence o,w" is sufficient for a mmiowb on
> RISC-V.  I'll make the change.

This sounds reasonable according to the documentation, but with your
longer explanation of the barriers, I think the __iormb/__iowmb definitions
above are wrong. What you actually need I think is

void writel(u32 v, volatile void __iomem 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-05 Thread Palmer Dabbelt
On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt  wrote:
>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
>>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:
>
>
 +#ifndef _ASM_RISCV_CACHE_H
 +#define _ASM_RISCV_CACHE_H
 +
 +#define L1_CACHE_SHIFT 6
 +
 +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>>>
>>> Is this the only valid cache line size on riscv, or just the largest
>>> one that is allowed?
>>
>> The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no
>> restriction on L1 cache line size (we tried to keep microarchitecture out of
>> the ISA specification).  We provide the actual cache parameters as part of 
>> the
>> device tree, but it looks like this needs to be known staticly in some places
>> so we can't use that everywhere.
>>
>> We could always make this a Kconfig parameter.
>
> The cache line size is used in a couple of places, let's go through the most
> common ones to see where that abstraction might be leaky and you actually
> get an architectural effect:
>
> - On SMP machines, cacheline_aligned_in_smp is used to annotate
>   data structures used in lockless algorithms, typically with one CPU writing
>   to some members of a structure, and another CPU reading from it but
>   not writing the same members. Depending on the architecture, having a
>   larger actual alignment than L1_CACHE_BYTES will either lead to
>   bad performance from cache line ping pong, or actual data corruption.

On RISC-V it's just a performance problem, so at least it's not catastrophic.

> - On systems with DMA masters that are not fully coherent,
>   cacheline_aligned is used to annotate data structures used
>   for DMA buffers, to make sure that the cache maintenance operations
>   in dma_sync_*_for_*() helpers don't corrup data outside of the
>   DMA buffer. You don't seem to support noncoherent DMA masters
>   or the cache maintenance operations required to use those, so this
>   might not be a problem until someone adds an extension for those.
>
> - Depending on the bus interconnect, a coherent DMA master might
>   not be able to update partial cache lines, so you need the same
>   annotation.

Well, our (SiFive's) bus is easy to master so hopefully we won't end up doing
that.  There is, of course, the rest of the world -- but that's just a bridge
we'll have to cross later (if such an implementation arises).

> - The kmalloc() family of memory allocators aligns data to the cache
>   line size, for both DMA and SMP synchronization above.

Ya, but luckily just a performance problem on RISC-V.

> - Many architectures have cache line prefetch, flush, zero or copy
>   instructions  that are used for important performance optimizations
>   but that are typically defined on a cacheline granularity. I don't
>   think you currently have any of them, but it seems likely that there
>   will be demand for them later.

We actually have an implicit prefetch (loads to x0, the zero register), but
it still has all the load side-effects so nothing uses it.

> Having a larger than necessary alignment can waste substantial amounts
> of memory for arrays of cache line aligned structures (typically
> per-cpu arrays), but otherwise should not cause harm.

I bugged our L1 guy and he says 64-byte lines are a bit of a magic number
because of how they line up with DIMMs.  Since there's no spec to define this,
there's no correct answer.  I'd be amenable to making this a Kconfig option,
but I think we'll leave it alone for now.  It does match the extant
implementations.

 diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
 new file mode 100644
 index ..d942555a7a08
 --- /dev/null
 +++ b/arch/riscv/include/asm/io.h
 @@ -0,0 +1,36 @@
>>>
 +#ifndef _ASM_RISCV_IO_H
 +#define _ASM_RISCV_IO_H
 +
 +#include 
>>>
>>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>>> helpers using inline assembly, to prevent the compiler for breaking
>>> up accesses into byte accesses.
>>>
>>> Also, most architectures require to some synchronization after a
>>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>>> before a writel() to flush write buffers when a DMA gets triggered.
>>
>> Makes sense.  These were all OK on existing implementations (as there's no
>> writable PMAs, so all MMIO regions are strictly ordered), but that's not
>> actually what the RISC-V ISA says.  I patterned this on arm64
>>
>>   
>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>>
>> where I think the only odd thing is our definition of mmiowb
>>
>>   +/* IO barriers.  These only fence on the IO bits because they're only 
>> required
>>   + * to order device access.  We're defining mmiowb because our AMO 
>> instructions
>>   + * (which are used 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-05 Thread Palmer Dabbelt
On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt  wrote:
>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
>>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:
>
>
 +#ifndef _ASM_RISCV_CACHE_H
 +#define _ASM_RISCV_CACHE_H
 +
 +#define L1_CACHE_SHIFT 6
 +
 +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>>>
>>> Is this the only valid cache line size on riscv, or just the largest
>>> one that is allowed?
>>
>> The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no
>> restriction on L1 cache line size (we tried to keep microarchitecture out of
>> the ISA specification).  We provide the actual cache parameters as part of 
>> the
>> device tree, but it looks like this needs to be known staticly in some places
>> so we can't use that everywhere.
>>
>> We could always make this a Kconfig parameter.
>
> The cache line size is used in a couple of places, let's go through the most
> common ones to see where that abstraction might be leaky and you actually
> get an architectural effect:
>
> - On SMP machines, cacheline_aligned_in_smp is used to annotate
>   data structures used in lockless algorithms, typically with one CPU writing
>   to some members of a structure, and another CPU reading from it but
>   not writing the same members. Depending on the architecture, having a
>   larger actual alignment than L1_CACHE_BYTES will either lead to
>   bad performance from cache line ping pong, or actual data corruption.

On RISC-V it's just a performance problem, so at least it's not catastrophic.

> - On systems with DMA masters that are not fully coherent,
>   cacheline_aligned is used to annotate data structures used
>   for DMA buffers, to make sure that the cache maintenance operations
>   in dma_sync_*_for_*() helpers don't corrup data outside of the
>   DMA buffer. You don't seem to support noncoherent DMA masters
>   or the cache maintenance operations required to use those, so this
>   might not be a problem until someone adds an extension for those.
>
> - Depending on the bus interconnect, a coherent DMA master might
>   not be able to update partial cache lines, so you need the same
>   annotation.

Well, our (SiFive's) bus is easy to master so hopefully we won't end up doing
that.  There is, of course, the rest of the world -- but that's just a bridge
we'll have to cross later (if such an implementation arises).

> - The kmalloc() family of memory allocators aligns data to the cache
>   line size, for both DMA and SMP synchronization above.

Ya, but luckily just a performance problem on RISC-V.

> - Many architectures have cache line prefetch, flush, zero or copy
>   instructions  that are used for important performance optimizations
>   but that are typically defined on a cacheline granularity. I don't
>   think you currently have any of them, but it seems likely that there
>   will be demand for them later.

We actually have an implicit prefetch (loads to x0, the zero register), but
it still has all the load side-effects so nothing uses it.

> Having a larger than necessary alignment can waste substantial amounts
> of memory for arrays of cache line aligned structures (typically
> per-cpu arrays), but otherwise should not cause harm.

I bugged our L1 guy and he says 64-byte lines are a bit of a magic number
because of how they line up with DIMMs.  Since there's no spec to define this,
there's no correct answer.  I'd be amenable to making this a Kconfig option,
but I think we'll leave it alone for now.  It does match the extant
implementations.

 diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
 new file mode 100644
 index ..d942555a7a08
 --- /dev/null
 +++ b/arch/riscv/include/asm/io.h
 @@ -0,0 +1,36 @@
>>>
 +#ifndef _ASM_RISCV_IO_H
 +#define _ASM_RISCV_IO_H
 +
 +#include 
>>>
>>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>>> helpers using inline assembly, to prevent the compiler for breaking
>>> up accesses into byte accesses.
>>>
>>> Also, most architectures require to some synchronization after a
>>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>>> before a writel() to flush write buffers when a DMA gets triggered.
>>
>> Makes sense.  These were all OK on existing implementations (as there's no
>> writable PMAs, so all MMIO regions are strictly ordered), but that's not
>> actually what the RISC-V ISA says.  I patterned this on arm64
>>
>>   
>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>>
>> where I think the only odd thing is our definition of mmiowb
>>
>>   +/* IO barriers.  These only fence on the IO bits because they're only 
>> required
>>   + * to order device access.  We're defining mmiowb because our AMO 
>> instructions
>>   + * (which are used to implement locks) don't specify 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-02 Thread Palmer Dabbelt
On Tue, 23 May 2017 14:23:50 PDT (-0700), b...@kernel.crashing.org wrote:
> On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote:
>> > +
>> > +#include 
>>
>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>> helpers using inline assembly, to prevent the compiler for breaking
>> up accesses into byte accesses.
>>
>> Also, most architectures require to some synchronization after a
>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>> before a writel() to flush write buffers when a DMA gets triggered.
>
> Right, I was about to comment on that one.

Well, you're both correct: what was there just isn't correct.  Our
implementations were safe because they don't have aggressive MMIO systems, but
that won't remain true for long.

I've gone ahead and added a proper IO implementation patterned on arm64.  It'll
be part of the v2 patch set.  Here's the bulk of the patch, if you're curious

  
https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa

> The question Palmer is about the ordering semantics of non-cached
> storage.
>
> What kind of ordering is provided architecturally ? Especially
> between cachable and non-cachable loads and stores ?

The memory model on RISC-V is pretty weak.  Without fences there are no
ordering constraints.  We provide 2 "ordering spaces" (an odd name I just made
up, there's one address space): the IO space and the regular memory space.  The
base RISC-V ordering primitive is a fence, which takes a predecessor set and
successor set.  Fences can look like

  fence IORW,IORW

where the left side is the predecessor set and the right side is the successor
set.  The fence enforces ordering between any operations in the two sets: all
operations in the predecessor set must be globally visible before any operation
in the successor set becomes visible anywhere.

For example, if you're emitting a DMA transaction you'd have to do something
like

  build_message_in_memory()
  fence w,o
  set_control_register()

with the fence ensuring all the memory writes are visible before the control
register write.

More information can be found in the ISA manuals

  
https://github.com/riscv/riscv-isa-manual/releases/download/riscv-user-2.2/riscv-spec-v2.2.pdf
  
https://github.com/riscv/riscv-isa-manual/releases/download/riscv-priv-1.10/riscv-privileged-v1.10.pdf

> Also, you have PCIe right ? What is the behaviour of MSIs ?
>
> Does your HW provide a guarantee that in the case of a series of DMA
> writes to memory by a device followed by an MSI, the CPU getting the
> MSI will only get it after all the previous DMA writes have reached
> coherency ? (Unlike LSIs where the driver is required to do an MMIO
> read from the device, MSIs are expected to be ordered with data).

We do have PCIe, but I'm not particularly familiar with it as I haven't spent
any time hacking on our PCIe hardware or driver.  My understanding here is that
PCIe defines that MSIs must not be reordered before the DMA writes, so the
implementation is required to enforce this ordering.  Thus it's a problem for
the PCIe controller implementation (which isn't covered by RISC-V) and therefor
doesn't need any ordering enforced by the driver.

If I'm correct in the assumption that the hardware is required to enforce these
ordering constraints then I think this isn't a RISC-V issue.  I'll go bug our
PCIe guys to make sure everything is kosher in that case, but it's an ordering
constraint that is possible to enforce in our coherence protocol so it's not a
fundamental problem (and just an implementation one at that).

If the hardware isn't required to enforce the ordering, then we'll need a fence
before handling the data.

> Another things with the read*() accessors. It's not uncommon for
> a driver to do:
>
>   writel(1, reset_reg);
>   readl(reset_reg); /* flush posted writes */
>   udelay(10);
>   writel(0, reset_reg);
>
> Now, in the above case, what can typically happen if you aren't careful
> is that the readl which is intended to "push" the previous writel, will
> not actually do its job because the return value hasn't been "consumed"
> by the processor. Thus, the CPU will stick that on some kind of load
> queue and won't actually wait for the return value before hitting the
> delay loop.
>
> Thus you might end up in a situation where the writel of 1 to the
> device is itself reaching the device way after you started the delay
> loop, and thus end up violating the delay requirement of the HW.
>
> On powerpc we solve that by using a special instruction construct
> inside the read* accessors that prevents the CPU from executing
> subsequent instructions until the read value has been returned.
>
> You may want to consider something similar.

Ooh, that's a fun one :).   I bugged Andrew (the ISA wizard), and this might
require some clarification in the ISA manual.  The code we emit will look
something like

  fence io,o
  st 1, RESET_REG
  ld RESET_REG
  fence o,io

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-02 Thread Palmer Dabbelt
On Tue, 23 May 2017 14:23:50 PDT (-0700), b...@kernel.crashing.org wrote:
> On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote:
>> > +
>> > +#include 
>>
>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>> helpers using inline assembly, to prevent the compiler for breaking
>> up accesses into byte accesses.
>>
>> Also, most architectures require to some synchronization after a
>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>> before a writel() to flush write buffers when a DMA gets triggered.
>
> Right, I was about to comment on that one.

Well, you're both correct: what was there just isn't correct.  Our
implementations were safe because they don't have aggressive MMIO systems, but
that won't remain true for long.

I've gone ahead and added a proper IO implementation patterned on arm64.  It'll
be part of the v2 patch set.  Here's the bulk of the patch, if you're curious

  
https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa

> The question Palmer is about the ordering semantics of non-cached
> storage.
>
> What kind of ordering is provided architecturally ? Especially
> between cachable and non-cachable loads and stores ?

The memory model on RISC-V is pretty weak.  Without fences there are no
ordering constraints.  We provide 2 "ordering spaces" (an odd name I just made
up, there's one address space): the IO space and the regular memory space.  The
base RISC-V ordering primitive is a fence, which takes a predecessor set and
successor set.  Fences can look like

  fence IORW,IORW

where the left side is the predecessor set and the right side is the successor
set.  The fence enforces ordering between any operations in the two sets: all
operations in the predecessor set must be globally visible before any operation
in the successor set becomes visible anywhere.

For example, if you're emitting a DMA transaction you'd have to do something
like

  build_message_in_memory()
  fence w,o
  set_control_register()

with the fence ensuring all the memory writes are visible before the control
register write.

More information can be found in the ISA manuals

  
https://github.com/riscv/riscv-isa-manual/releases/download/riscv-user-2.2/riscv-spec-v2.2.pdf
  
https://github.com/riscv/riscv-isa-manual/releases/download/riscv-priv-1.10/riscv-privileged-v1.10.pdf

> Also, you have PCIe right ? What is the behaviour of MSIs ?
>
> Does your HW provide a guarantee that in the case of a series of DMA
> writes to memory by a device followed by an MSI, the CPU getting the
> MSI will only get it after all the previous DMA writes have reached
> coherency ? (Unlike LSIs where the driver is required to do an MMIO
> read from the device, MSIs are expected to be ordered with data).

We do have PCIe, but I'm not particularly familiar with it as I haven't spent
any time hacking on our PCIe hardware or driver.  My understanding here is that
PCIe defines that MSIs must not be reordered before the DMA writes, so the
implementation is required to enforce this ordering.  Thus it's a problem for
the PCIe controller implementation (which isn't covered by RISC-V) and therefor
doesn't need any ordering enforced by the driver.

If I'm correct in the assumption that the hardware is required to enforce these
ordering constraints then I think this isn't a RISC-V issue.  I'll go bug our
PCIe guys to make sure everything is kosher in that case, but it's an ordering
constraint that is possible to enforce in our coherence protocol so it's not a
fundamental problem (and just an implementation one at that).

If the hardware isn't required to enforce the ordering, then we'll need a fence
before handling the data.

> Another things with the read*() accessors. It's not uncommon for
> a driver to do:
>
>   writel(1, reset_reg);
>   readl(reset_reg); /* flush posted writes */
>   udelay(10);
>   writel(0, reset_reg);
>
> Now, in the above case, what can typically happen if you aren't careful
> is that the readl which is intended to "push" the previous writel, will
> not actually do its job because the return value hasn't been "consumed"
> by the processor. Thus, the CPU will stick that on some kind of load
> queue and won't actually wait for the return value before hitting the
> delay loop.
>
> Thus you might end up in a situation where the writel of 1 to the
> device is itself reaching the device way after you started the delay
> loop, and thus end up violating the delay requirement of the HW.
>
> On powerpc we solve that by using a special instruction construct
> inside the read* accessors that prevents the CPU from executing
> subsequent instructions until the read value has been returned.
>
> You may want to consider something similar.

Ooh, that's a fun one :).   I bugged Andrew (the ISA wizard), and this might
require some clarification in the ISA manual.  The code we emit will look
something like

  fence io,o
  st 1, RESET_REG
  ld RESET_REG
  fence o,io

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-01 Thread Arnd Bergmann
On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt  wrote:
> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:


>>> +#ifndef _ASM_RISCV_CACHE_H
>>> +#define _ASM_RISCV_CACHE_H
>>> +
>>> +#define L1_CACHE_SHIFT 6
>>> +
>>> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>>
>> Is this the only valid cache line size on riscv, or just the largest
>> one that is allowed?
>
> The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no
> restriction on L1 cache line size (we tried to keep microarchitecture out of
> the ISA specification).  We provide the actual cache parameters as part of the
> device tree, but it looks like this needs to be known staticly in some places
> so we can't use that everywhere.
>
> We could always make this a Kconfig parameter.

The cache line size is used in a couple of places, let's go through the most
common ones to see where that abstraction might be leaky and you actually
get an architectural effect:

- On SMP machines, cacheline_aligned_in_smp is used to annotate
  data structures used in lockless algorithms, typically with one CPU writing
  to some members of a structure, and another CPU reading from it but
  not writing the same members. Depending on the architecture, having a
  larger actual alignment than L1_CACHE_BYTES will either lead to
  bad performance from cache line ping pong, or actual data corruption.

- On systems with DMA masters that are not fully coherent,
  cacheline_aligned is used to annotate data structures used
  for DMA buffers, to make sure that the cache maintenance operations
  in dma_sync_*_for_*() helpers don't corrup data outside of the
  DMA buffer. You don't seem to support noncoherent DMA masters
  or the cache maintenance operations required to use those, so this
  might not be a problem until someone adds an extension for those.

- Depending on the bus interconnect, a coherent DMA master might
  not be able to update partial cache lines, so you need the same
  annotation.

- The kmalloc() family of memory allocators aligns data to the cache
  line size, for both DMA and SMP synchronization above.

- Many architectures have cache line prefetch, flush, zero or copy
  instructions  that are used for important performance optimizations
  but that are typically defined on a cacheline granularity. I don't
  think you currently have any of them, but it seems likely that there
  will be demand for them later.

Having a larger than necessary alignment can waste substantial amounts
of memory for arrays of cache line aligned structures (typically
per-cpu arrays), but otherwise should not cause harm.

>>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>>> new file mode 100644
>>> index ..d942555a7a08
>>> --- /dev/null
>>> +++ b/arch/riscv/include/asm/io.h
>>> @@ -0,0 +1,36 @@
>>
>>> +#ifndef _ASM_RISCV_IO_H
>>> +#define _ASM_RISCV_IO_H
>>> +
>>> +#include 
>>
>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>> helpers using inline assembly, to prevent the compiler for breaking
>> up accesses into byte accesses.
>>
>> Also, most architectures require to some synchronization after a
>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>> before a writel() to flush write buffers when a DMA gets triggered.
>
> Makes sense.  These were all OK on existing implementations (as there's no
> writable PMAs, so all MMIO regions are strictly ordered), but that's not
> actually what the RISC-V ISA says.  I patterned this on arm64
>
>   
> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>
> where I think the only odd thing is our definition of mmiowb
>
>   +/* IO barriers.  These only fence on the IO bits because they're only 
> required
>   + * to order device access.  We're defining mmiowb because our AMO 
> instructions
>   + * (which are used to implement locks) don't specify ordering.  From 
> Chapter 7
>   + * of v2.2 of the user ISA:
>   + * "The bits order accesses to one of the two address domains, memory or 
> I/O,
>   + * depending on which address domain the atomic instruction is accessing. 
> No
>   + * ordering constraint is implied to accesses to the other domain, and a 
> FENCE
>   + * instruction should be used to order across both domains."
>   + */
>   +
>   +#define __iormb()   __asm__ __volatile__ ("fence i,io" : : : 
> "memory");
>   +#define __iowmb()   __asm__ __volatile__ ("fence io,o" : : : 
> "memory");

Looks ok, yes.

>   +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : : 
> "memory");
>
> which I think is correct.

I can never remember what exactly this one does.

>>> +#ifdef __KERNEL__
>>> +
>>> +#ifdef CONFIG_MMU
>>> +
>>> +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size);
>>> +
>>> +#define ioremap_nocache(addr, size) 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-06-01 Thread Arnd Bergmann
On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt  wrote:
> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:


>>> +#ifndef _ASM_RISCV_CACHE_H
>>> +#define _ASM_RISCV_CACHE_H
>>> +
>>> +#define L1_CACHE_SHIFT 6
>>> +
>>> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>>
>> Is this the only valid cache line size on riscv, or just the largest
>> one that is allowed?
>
> The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no
> restriction on L1 cache line size (we tried to keep microarchitecture out of
> the ISA specification).  We provide the actual cache parameters as part of the
> device tree, but it looks like this needs to be known staticly in some places
> so we can't use that everywhere.
>
> We could always make this a Kconfig parameter.

The cache line size is used in a couple of places, let's go through the most
common ones to see where that abstraction might be leaky and you actually
get an architectural effect:

- On SMP machines, cacheline_aligned_in_smp is used to annotate
  data structures used in lockless algorithms, typically with one CPU writing
  to some members of a structure, and another CPU reading from it but
  not writing the same members. Depending on the architecture, having a
  larger actual alignment than L1_CACHE_BYTES will either lead to
  bad performance from cache line ping pong, or actual data corruption.

- On systems with DMA masters that are not fully coherent,
  cacheline_aligned is used to annotate data structures used
  for DMA buffers, to make sure that the cache maintenance operations
  in dma_sync_*_for_*() helpers don't corrup data outside of the
  DMA buffer. You don't seem to support noncoherent DMA masters
  or the cache maintenance operations required to use those, so this
  might not be a problem until someone adds an extension for those.

- Depending on the bus interconnect, a coherent DMA master might
  not be able to update partial cache lines, so you need the same
  annotation.

- The kmalloc() family of memory allocators aligns data to the cache
  line size, for both DMA and SMP synchronization above.

- Many architectures have cache line prefetch, flush, zero or copy
  instructions  that are used for important performance optimizations
  but that are typically defined on a cacheline granularity. I don't
  think you currently have any of them, but it seems likely that there
  will be demand for them later.

Having a larger than necessary alignment can waste substantial amounts
of memory for arrays of cache line aligned structures (typically
per-cpu arrays), but otherwise should not cause harm.

>>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>>> new file mode 100644
>>> index ..d942555a7a08
>>> --- /dev/null
>>> +++ b/arch/riscv/include/asm/io.h
>>> @@ -0,0 +1,36 @@
>>
>>> +#ifndef _ASM_RISCV_IO_H
>>> +#define _ASM_RISCV_IO_H
>>> +
>>> +#include 
>>
>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>> helpers using inline assembly, to prevent the compiler for breaking
>> up accesses into byte accesses.
>>
>> Also, most architectures require to some synchronization after a
>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>> before a writel() to flush write buffers when a DMA gets triggered.
>
> Makes sense.  These were all OK on existing implementations (as there's no
> writable PMAs, so all MMIO regions are strictly ordered), but that's not
> actually what the RISC-V ISA says.  I patterned this on arm64
>
>   
> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>
> where I think the only odd thing is our definition of mmiowb
>
>   +/* IO barriers.  These only fence on the IO bits because they're only 
> required
>   + * to order device access.  We're defining mmiowb because our AMO 
> instructions
>   + * (which are used to implement locks) don't specify ordering.  From 
> Chapter 7
>   + * of v2.2 of the user ISA:
>   + * "The bits order accesses to one of the two address domains, memory or 
> I/O,
>   + * depending on which address domain the atomic instruction is accessing. 
> No
>   + * ordering constraint is implied to accesses to the other domain, and a 
> FENCE
>   + * instruction should be used to order across both domains."
>   + */
>   +
>   +#define __iormb()   __asm__ __volatile__ ("fence i,io" : : : 
> "memory");
>   +#define __iowmb()   __asm__ __volatile__ ("fence io,o" : : : 
> "memory");

Looks ok, yes.

>   +#define mmiowb()__asm__ __volatile__ ("fence io,io" : : : 
> "memory");
>
> which I think is correct.

I can never remember what exactly this one does.

>>> +#ifdef __KERNEL__
>>> +
>>> +#ifdef CONFIG_MMU
>>> +
>>> +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size);
>>> +
>>> +#define ioremap_nocache(addr, size) ioremap((addr), (size))
>>> +#define 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-05-31 Thread Palmer Dabbelt
On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:
>> +/**
>> + * atomic_read - read atomic variable
>> + * @v: pointer of type atomic_t
>> + *
>> + * Atomically reads the value of @v.
>> + */
>> +static inline int atomic_read(const atomic_t *v)
>> +{
>> +   return *((volatile int *)(&(v->counter)));
>> +}
>> +/**
>> + * atomic_set - set atomic variable
>> + * @v: pointer of type atomic_t
>> + * @i: required value
>> + *
>> + * Atomically sets the value of @v to @i.
>> + */
>> +static inline void atomic_set(atomic_t *v, int i)
>> +{
>> +   v->counter = i;
>> +}
>
> These commonly use READ_ONCE() and WRITE_ONCE,
> I'd recommend doing the same here to be on the safe side.

Makes sense.  
https://github.com/riscv/riscv-linux/commit/77647f9e4dccab68c69a212a63c9efe1db2b7b1c

>> diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h
>> new file mode 100644
>> index ..10d894ac3137
>> --- /dev/null
>> +++ b/arch/riscv/include/asm/bug.h
>> @@ -0,0 +1,81 @@
>> +/*
>>
>> +#ifndef _ASM_RISCV_BUG_H
>> +#define _ASM_RISCV_BUG_H
>
>> +#ifdef CONFIG_GENERIC_BUG
>> +#define __BUG_INSN _AC(0x00100073, UL) /* sbreak */
>
> Please have a look at the modifications I did for !CONFIG_BUG
> on x86, arm and arm64. It's generally better to define BUG to a
> trap even when CONFIG_BUG is disabled, otherwise you run
> into undefined behavior in some code, and gcc will print annoying
> warnings about that.

OK, seems like a good thing.  
https://github.com/riscv/riscv-linux/commit/67db001653614c6555424b3812d7edfba12a6d4c

>> +#ifndef _ASM_RISCV_CACHE_H
>> +#define _ASM_RISCV_CACHE_H
>> +
>> +#define L1_CACHE_SHIFT 6
>> +
>> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>
> Is this the only valid cache line size on riscv, or just the largest
> one that is allowed?

The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no
restriction on L1 cache line size (we tried to keep microarchitecture out of
the ISA specification).  We provide the actual cache parameters as part of the
device tree, but it looks like this needs to be known staticly in some places
so we can't use that everywhere.

We could always make this a Kconfig parameter.

>> +
>> +static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr)
>> +{
>> +   return (dma_addr_t)paddr;
>> +}
>> +
>> +static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t 
>> dev_addr)
>> +{
>> +   return (phys_addr_t)dev_addr;
>> +}
>
> What do you need these for? If possible, try to remove them.
>
>> +static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t 
>> size, enum dma_data_direction dir)
>> +{
>> +   /*
>> +* RISC-V is cache-coherent, so this is mostly a no-op.
>> +* However, we do need to ensure that dma_cache_sync()
>> +* enforces order, hence the mb().
>> +*/
>> +   mb();
>> +}
>
> Do you even support any drivers that use
> dma_alloc_noncoherent()/dma_cache_sync()?
>
> I would guess you can just leave this out.

These must have been vestigial code, they appear safe to remove.

  
https://github.com/riscv/riscv-linux/commit/d1c88783d5ff66464a25173f7a4af139f0ebf5e2

>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>> new file mode 100644
>> index ..d942555a7a08
>> --- /dev/null
>> +++ b/arch/riscv/include/asm/io.h
>> @@ -0,0 +1,36 @@
>
>> +#ifndef _ASM_RISCV_IO_H
>> +#define _ASM_RISCV_IO_H
>> +
>> +#include 
>
> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
> helpers using inline assembly, to prevent the compiler for breaking
> up accesses into byte accesses.
>
> Also, most architectures require to some synchronization after a
> non-relaxed readl() to prevent prefetching of DMA buffers, and
> before a writel() to flush write buffers when a DMA gets triggered.

Makes sense.  These were all OK on existing implementations (as there's no
writable PMAs, so all MMIO regions are strictly ordered), but that's not
actually what the RISC-V ISA says.  I patterned this on arm64

  
https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa

where I think the only odd thing is our definition of mmiowb

  +/* IO barriers.  These only fence on the IO bits because they're only 
required
  + * to order device access.  We're defining mmiowb because our AMO 
instructions
  + * (which are used to implement locks) don't specify ordering.  From Chapter 
7
  + * of v2.2 of the user ISA:
  + * "The bits order accesses to one of the two address domains, memory or I/O,
  + * depending on which address domain the atomic instruction is accessing. No
  + * ordering constraint is implied to accesses to the other domain, and a 
FENCE
  + * instruction should be used to order across both domains."
  + */
  +
  +#define __iormb()   __asm__ __volatile__ ("fence i,io" 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-05-31 Thread Palmer Dabbelt
On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:
>> +/**
>> + * atomic_read - read atomic variable
>> + * @v: pointer of type atomic_t
>> + *
>> + * Atomically reads the value of @v.
>> + */
>> +static inline int atomic_read(const atomic_t *v)
>> +{
>> +   return *((volatile int *)(&(v->counter)));
>> +}
>> +/**
>> + * atomic_set - set atomic variable
>> + * @v: pointer of type atomic_t
>> + * @i: required value
>> + *
>> + * Atomically sets the value of @v to @i.
>> + */
>> +static inline void atomic_set(atomic_t *v, int i)
>> +{
>> +   v->counter = i;
>> +}
>
> These commonly use READ_ONCE() and WRITE_ONCE,
> I'd recommend doing the same here to be on the safe side.

Makes sense.  
https://github.com/riscv/riscv-linux/commit/77647f9e4dccab68c69a212a63c9efe1db2b7b1c

>> diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h
>> new file mode 100644
>> index ..10d894ac3137
>> --- /dev/null
>> +++ b/arch/riscv/include/asm/bug.h
>> @@ -0,0 +1,81 @@
>> +/*
>>
>> +#ifndef _ASM_RISCV_BUG_H
>> +#define _ASM_RISCV_BUG_H
>
>> +#ifdef CONFIG_GENERIC_BUG
>> +#define __BUG_INSN _AC(0x00100073, UL) /* sbreak */
>
> Please have a look at the modifications I did for !CONFIG_BUG
> on x86, arm and arm64. It's generally better to define BUG to a
> trap even when CONFIG_BUG is disabled, otherwise you run
> into undefined behavior in some code, and gcc will print annoying
> warnings about that.

OK, seems like a good thing.  
https://github.com/riscv/riscv-linux/commit/67db001653614c6555424b3812d7edfba12a6d4c

>> +#ifndef _ASM_RISCV_CACHE_H
>> +#define _ASM_RISCV_CACHE_H
>> +
>> +#define L1_CACHE_SHIFT 6
>> +
>> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>
> Is this the only valid cache line size on riscv, or just the largest
> one that is allowed?

The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no
restriction on L1 cache line size (we tried to keep microarchitecture out of
the ISA specification).  We provide the actual cache parameters as part of the
device tree, but it looks like this needs to be known staticly in some places
so we can't use that everywhere.

We could always make this a Kconfig parameter.

>> +
>> +static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr)
>> +{
>> +   return (dma_addr_t)paddr;
>> +}
>> +
>> +static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t 
>> dev_addr)
>> +{
>> +   return (phys_addr_t)dev_addr;
>> +}
>
> What do you need these for? If possible, try to remove them.
>
>> +static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t 
>> size, enum dma_data_direction dir)
>> +{
>> +   /*
>> +* RISC-V is cache-coherent, so this is mostly a no-op.
>> +* However, we do need to ensure that dma_cache_sync()
>> +* enforces order, hence the mb().
>> +*/
>> +   mb();
>> +}
>
> Do you even support any drivers that use
> dma_alloc_noncoherent()/dma_cache_sync()?
>
> I would guess you can just leave this out.

These must have been vestigial code, they appear safe to remove.

  
https://github.com/riscv/riscv-linux/commit/d1c88783d5ff66464a25173f7a4af139f0ebf5e2

>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>> new file mode 100644
>> index ..d942555a7a08
>> --- /dev/null
>> +++ b/arch/riscv/include/asm/io.h
>> @@ -0,0 +1,36 @@
>
>> +#ifndef _ASM_RISCV_IO_H
>> +#define _ASM_RISCV_IO_H
>> +
>> +#include 
>
> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
> helpers using inline assembly, to prevent the compiler for breaking
> up accesses into byte accesses.
>
> Also, most architectures require to some synchronization after a
> non-relaxed readl() to prevent prefetching of DMA buffers, and
> before a writel() to flush write buffers when a DMA gets triggered.

Makes sense.  These were all OK on existing implementations (as there's no
writable PMAs, so all MMIO regions are strictly ordered), but that's not
actually what the RISC-V ISA says.  I patterned this on arm64

  
https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa

where I think the only odd thing is our definition of mmiowb

  +/* IO barriers.  These only fence on the IO bits because they're only 
required
  + * to order device access.  We're defining mmiowb because our AMO 
instructions
  + * (which are used to implement locks) don't specify ordering.  From Chapter 
7
  + * of v2.2 of the user ISA:
  + * "The bits order accesses to one of the two address domains, memory or I/O,
  + * depending on which address domain the atomic instruction is accessing. No
  + * ordering constraint is implied to accesses to the other domain, and a 
FENCE
  + * instruction should be used to order across both domains."
  + */
  +
  +#define __iormb()   __asm__ __volatile__ ("fence i,io" : : : 
"memory");
  

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-05-23 Thread Benjamin Herrenschmidt
On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote:
> > +
> > +#include 
> 
> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
> helpers using inline assembly, to prevent the compiler for breaking
> up accesses into byte accesses.
> 
> Also, most architectures require to some synchronization after a
> non-relaxed readl() to prevent prefetching of DMA buffers, and
> before a writel() to flush write buffers when a DMA gets triggered.

Right, I was about to comment on that one.

The question Palmer is about the ordering semantics of non-cached
storage.

What kind of ordering is provided architecturally ? Especially
between cachable and non-cachable loads and stores ?

Also, you have PCIe right ? What is the behaviour of MSIs ?

Does your HW provide a guarantee that in the case of a series of DMA
writes to memory by a device followed by an MSI, the CPU getting the
MSI will only get it after all the previous DMA writes have reached
coherency ? (Unlike LSIs where the driver is required to do an MMIO
read from the device, MSIs are expected to be ordered with data).

Another things with the read*() accessors. It's not uncommon for
a driver to do:

writel(1, reset_reg);
readl(reset_reg); /* flush posted writes */
udelay(10);
writel(0, reset_reg);

Now, in the above case, what can typically happen if you aren't careful
is that the readl which is intended to "push" the previous writel, will
not actually do its job because the return value hasn't been "consumed"
by the processor. Thus, the CPU will stick that on some kind of load
queue and won't actually wait for the return value before hitting the
delay loop.

Thus you might end up in a situation where the writel of 1 to the
device is itself reaching the device way after you started the delay
loop, and thus end up violating the delay requirement of the HW.

On powerpc we solve that by using a special instruction construct
inside the read* accessors that prevents the CPU from executing
subsequent instructions until the read value has been returned.

You may want to consider something similar.

Cheers,
Ben.



Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-05-23 Thread Benjamin Herrenschmidt
On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote:
> > +
> > +#include 
> 
> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
> helpers using inline assembly, to prevent the compiler for breaking
> up accesses into byte accesses.
> 
> Also, most architectures require to some synchronization after a
> non-relaxed readl() to prevent prefetching of DMA buffers, and
> before a writel() to flush write buffers when a DMA gets triggered.

Right, I was about to comment on that one.

The question Palmer is about the ordering semantics of non-cached
storage.

What kind of ordering is provided architecturally ? Especially
between cachable and non-cachable loads and stores ?

Also, you have PCIe right ? What is the behaviour of MSIs ?

Does your HW provide a guarantee that in the case of a series of DMA
writes to memory by a device followed by an MSI, the CPU getting the
MSI will only get it after all the previous DMA writes have reached
coherency ? (Unlike LSIs where the driver is required to do an MMIO
read from the device, MSIs are expected to be ordered with data).

Another things with the read*() accessors. It's not uncommon for
a driver to do:

writel(1, reset_reg);
readl(reset_reg); /* flush posted writes */
udelay(10);
writel(0, reset_reg);

Now, in the above case, what can typically happen if you aren't careful
is that the readl which is intended to "push" the previous writel, will
not actually do its job because the return value hasn't been "consumed"
by the processor. Thus, the CPU will stick that on some kind of load
queue and won't actually wait for the return value before hitting the
delay loop.

Thus you might end up in a situation where the writel of 1 to the
device is itself reaching the device way after you started the delay
loop, and thus end up violating the delay requirement of the HW.

On powerpc we solve that by using a special instruction construct
inside the read* accessors that prevents the CPU from executing
subsequent instructions until the read value has been returned.

You may want to consider something similar.

Cheers,
Ben.



Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-05-23 Thread Arnd Bergmann
On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:
> +/**
> + * atomic_read - read atomic variable
> + * @v: pointer of type atomic_t
> + *
> + * Atomically reads the value of @v.
> + */
> +static inline int atomic_read(const atomic_t *v)
> +{
> +   return *((volatile int *)(&(v->counter)));
> +}
> +/**
> + * atomic_set - set atomic variable
> + * @v: pointer of type atomic_t
> + * @i: required value
> + *
> + * Atomically sets the value of @v to @i.
> + */
> +static inline void atomic_set(atomic_t *v, int i)
> +{
> +   v->counter = i;
> +}

These commonly use READ_ONCE() and WRITE_ONCE,
I'd recommend doing the same here to be on the safe side.

> +/**
> + * atomic64_read - read atomic64 variable
> + * @v: pointer of type atomic64_t
> + *
> + * Atomically reads the value of @v.
> + */
> +static inline s64 atomic64_read(const atomic64_t *v)
> +{
> +   return *((volatile long *)(&(v->counter)));
> +}
> +
> +/**
> + * atomic64_set - set atomic64 variable
> + * @v: pointer to type atomic64_t
> + * @i: required value
> + *
> + * Atomically sets the value of @v to @i.
> + */
> +static inline void atomic64_set(atomic64_t *v, s64 i)
> +{
> +   v->counter = i;
> +}

same here

> diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h
> new file mode 100644
> index ..10d894ac3137
> --- /dev/null
> +++ b/arch/riscv/include/asm/bug.h
> @@ -0,0 +1,81 @@
> +/*
>
> +#ifndef _ASM_RISCV_BUG_H
> +#define _ASM_RISCV_BUG_H

> +#ifdef CONFIG_GENERIC_BUG
> +#define __BUG_INSN _AC(0x00100073, UL) /* sbreak */

Please have a look at the modifications I did for !CONFIG_BUG
on x86, arm and arm64. It's generally better to define BUG to a
trap even when CONFIG_BUG is disabled, otherwise you run
into undefined behavior in some code, and gcc will print annoying
warnings about that.

> +#ifndef _ASM_RISCV_CACHE_H
> +#define _ASM_RISCV_CACHE_H
> +
> +#define L1_CACHE_SHIFT 6
> +
> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)

Is this the only valid cache line size on riscv, or just the largest
one that is allowed?

> +
> +static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr)
> +{
> +   return (dma_addr_t)paddr;
> +}
> +
> +static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t 
> dev_addr)
> +{
> +   return (phys_addr_t)dev_addr;
> +}

What do you need these for? If possible, try to remove them.

> +static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t 
> size, enum dma_data_direction dir)
> +{
> +   /*
> +* RISC-V is cache-coherent, so this is mostly a no-op.
> +* However, we do need to ensure that dma_cache_sync()
> +* enforces order, hence the mb().
> +*/
> +   mb();
> +}

Do you even support any drivers that use
dma_alloc_noncoherent()/dma_cache_sync()?

I would guess you can just leave this out.

> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
> new file mode 100644
> index ..d942555a7a08
> --- /dev/null
> +++ b/arch/riscv/include/asm/io.h
> @@ -0,0 +1,36 @@

> +#ifndef _ASM_RISCV_IO_H
> +#define _ASM_RISCV_IO_H
> +
> +#include 

I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
helpers using inline assembly, to prevent the compiler for breaking
up accesses into byte accesses.

Also, most architectures require to some synchronization after a
non-relaxed readl() to prevent prefetching of DMA buffers, and
before a writel() to flush write buffers when a DMA gets triggered.

> +#ifdef __KERNEL__
> +
> +#ifdef CONFIG_MMU
> +
> +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size);
> +
> +#define ioremap_nocache(addr, size) ioremap((addr), (size))
> +#define ioremap_wc(addr, size) ioremap((addr), (size))
> +#define ioremap_wt(addr, size) ioremap((addr), (size))

Is this a hard architecture limitation? Normally you really want
write-combined access on frame buffer memory and a few other
cases for performance reasons, and ioremap_wc() gets used
for by memremap() for addressing RAM in some cases, and you
normally don't want to have PTEs for the same memory using
cached and uncached page flags

> diff --git a/arch/riscv/include/asm/serial.h b/arch/riscv/include/asm/serial.h
> new file mode 100644
> index ..d783dbe80a4b
> --- /dev/null
> +++ b/arch/riscv/include/asm/serial.h
> @@ -0,0 +1,43 @@
> +/*
> + * Copyright (C) 2014 Regents of the University of California
> + *
> + *   This program is free software; you can redistribute it and/or
> + *   modify it under the terms of the GNU General Public License
> + *   as published by the Free Software Foundation, version 2.
> + *
> + *   This program is distributed in the hope that it will be useful, but
> + *   WITHOUT ANY WARRANTY; without even the implied warranty of
> + *   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
> + *   NON INFRINGEMENT.  See the GNU General Public License for
> + *   more 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-05-23 Thread Arnd Bergmann
On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt  wrote:
> +/**
> + * atomic_read - read atomic variable
> + * @v: pointer of type atomic_t
> + *
> + * Atomically reads the value of @v.
> + */
> +static inline int atomic_read(const atomic_t *v)
> +{
> +   return *((volatile int *)(&(v->counter)));
> +}
> +/**
> + * atomic_set - set atomic variable
> + * @v: pointer of type atomic_t
> + * @i: required value
> + *
> + * Atomically sets the value of @v to @i.
> + */
> +static inline void atomic_set(atomic_t *v, int i)
> +{
> +   v->counter = i;
> +}

These commonly use READ_ONCE() and WRITE_ONCE,
I'd recommend doing the same here to be on the safe side.

> +/**
> + * atomic64_read - read atomic64 variable
> + * @v: pointer of type atomic64_t
> + *
> + * Atomically reads the value of @v.
> + */
> +static inline s64 atomic64_read(const atomic64_t *v)
> +{
> +   return *((volatile long *)(&(v->counter)));
> +}
> +
> +/**
> + * atomic64_set - set atomic64 variable
> + * @v: pointer to type atomic64_t
> + * @i: required value
> + *
> + * Atomically sets the value of @v to @i.
> + */
> +static inline void atomic64_set(atomic64_t *v, s64 i)
> +{
> +   v->counter = i;
> +}

same here

> diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h
> new file mode 100644
> index ..10d894ac3137
> --- /dev/null
> +++ b/arch/riscv/include/asm/bug.h
> @@ -0,0 +1,81 @@
> +/*
>
> +#ifndef _ASM_RISCV_BUG_H
> +#define _ASM_RISCV_BUG_H

> +#ifdef CONFIG_GENERIC_BUG
> +#define __BUG_INSN _AC(0x00100073, UL) /* sbreak */

Please have a look at the modifications I did for !CONFIG_BUG
on x86, arm and arm64. It's generally better to define BUG to a
trap even when CONFIG_BUG is disabled, otherwise you run
into undefined behavior in some code, and gcc will print annoying
warnings about that.

> +#ifndef _ASM_RISCV_CACHE_H
> +#define _ASM_RISCV_CACHE_H
> +
> +#define L1_CACHE_SHIFT 6
> +
> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)

Is this the only valid cache line size on riscv, or just the largest
one that is allowed?

> +
> +static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr)
> +{
> +   return (dma_addr_t)paddr;
> +}
> +
> +static inline phys_addr_t dma_to_phys(struct device *dev, dma_addr_t 
> dev_addr)
> +{
> +   return (phys_addr_t)dev_addr;
> +}

What do you need these for? If possible, try to remove them.

> +static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t 
> size, enum dma_data_direction dir)
> +{
> +   /*
> +* RISC-V is cache-coherent, so this is mostly a no-op.
> +* However, we do need to ensure that dma_cache_sync()
> +* enforces order, hence the mb().
> +*/
> +   mb();
> +}

Do you even support any drivers that use
dma_alloc_noncoherent()/dma_cache_sync()?

I would guess you can just leave this out.

> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
> new file mode 100644
> index ..d942555a7a08
> --- /dev/null
> +++ b/arch/riscv/include/asm/io.h
> @@ -0,0 +1,36 @@

> +#ifndef _ASM_RISCV_IO_H
> +#define _ASM_RISCV_IO_H
> +
> +#include 

I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
helpers using inline assembly, to prevent the compiler for breaking
up accesses into byte accesses.

Also, most architectures require to some synchronization after a
non-relaxed readl() to prevent prefetching of DMA buffers, and
before a writel() to flush write buffers when a DMA gets triggered.

> +#ifdef __KERNEL__
> +
> +#ifdef CONFIG_MMU
> +
> +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size);
> +
> +#define ioremap_nocache(addr, size) ioremap((addr), (size))
> +#define ioremap_wc(addr, size) ioremap((addr), (size))
> +#define ioremap_wt(addr, size) ioremap((addr), (size))

Is this a hard architecture limitation? Normally you really want
write-combined access on frame buffer memory and a few other
cases for performance reasons, and ioremap_wc() gets used
for by memremap() for addressing RAM in some cases, and you
normally don't want to have PTEs for the same memory using
cached and uncached page flags

> diff --git a/arch/riscv/include/asm/serial.h b/arch/riscv/include/asm/serial.h
> new file mode 100644
> index ..d783dbe80a4b
> --- /dev/null
> +++ b/arch/riscv/include/asm/serial.h
> @@ -0,0 +1,43 @@
> +/*
> + * Copyright (C) 2014 Regents of the University of California
> + *
> + *   This program is free software; you can redistribute it and/or
> + *   modify it under the terms of the GNU General Public License
> + *   as published by the Free Software Foundation, version 2.
> + *
> + *   This program is distributed in the hope that it will be useful, but
> + *   WITHOUT ANY WARRANTY; without even the implied warranty of
> + *   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
> + *   NON INFRINGEMENT.  See the GNU General Public License for
> + *   more details.
> + */
> +
>