Re: [PATCH 2/2] KVM: PPC: Book3S HV: lockless tlbie for HPT hcalls

2018-04-05 Thread Nicholas Piggin
On Fri,  6 Apr 2018 03:56:31 +1000
Nicholas Piggin  wrote:

> tlbies to an LPAR do not have to be serialised since POWER4,
> MMU_FTR_LOCKLESS_TLBIE can be used to avoid the spin lock in
> do_tlbies.
> 
> Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
> in HPT mode. 32 instances of the powerpc fork benchmark from selftests
> were run with --fork, and the results measured.
> 
> Without this patch, total throughput was about 13.5K/sec, and this is
> the top of the host profile:
> 
>74.52%  [k] do_tlbies
> 2.95%  [k] kvmppc_book3s_hv_page_fault
> 1.80%  [k] calc_checksum
> 1.80%  [k] kvmppc_vcpu_run_hv
> 1.49%  [k] kvmppc_run_core
> 
> After this patch, throughput was about 51K/sec, with this profile:
> 
>21.28%  [k] do_tlbies
> 5.26%  [k] kvmppc_run_core
> 4.88%  [k] kvmppc_book3s_hv_page_fault
> 3.30%  [k] _raw_spin_lock_irqsave
> 3.25%  [k] gup_pgd_range
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
> b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 78e6a392330f..0221a0f74f07 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -439,6 +439,9 @@ static inline int try_lock_tlbie(unsigned int *lock)
>   unsigned int tmp, old;
>   unsigned int token = LOCK_TOKEN;
>  
> + if (mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE))
> + return 1;
> +
>   asm volatile("1:lwarx   %1,0,%2\n"
>"  cmpwi   cr0,%1,0\n"
>"  bne 2f\n"
> @@ -452,6 +455,12 @@ static inline int try_lock_tlbie(unsigned int *lock)
>   return old == 0;
>  }
>  
> +static inline void unlock_tlbie_after_sync(unsigned int *lock)
> +{
> + if (mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE))
> + return;
> +}
> +
>  static void do_tlbies(struct kvm *kvm, unsigned long *rbvalues,
> long npages, int global, bool need_sync)
>  {
> @@ -483,7 +492,7 @@ static void do_tlbies(struct kvm *kvm, unsigned long 
> *rbvalues,
>   }
>  
>   asm volatile("eieio; tlbsync; ptesync" : : : "memory");
> - kvm->arch.tlbie_lock = 0;
> + unlock_tlbie_after_sync(>arch.tlbie_lock);

Well that's a silly bug in the !LOCKLESS path, that was supposed
to move to unlock, of course. Will fix it up after some time for
comments.

Thanks,
Nick


[PATCH 2/2] powerpc/mm/memtrace: Let the arch hotunplug code flush cache

2018-04-05 Thread Balbir Singh
Don't do this via custom code, instead now that we have support
in the arch hotplug/hotunplug code, rely on those routines
to do the right thing.

Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory 
tracing")
because the older code uses ppc64_caches.l1d.size instead of
ppc64_caches.l1d.line_size

Signed-off-by: Balbir Singh 
---
 arch/powerpc/platforms/powernv/memtrace.c | 17 -
 1 file changed, 17 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index de470caf0784..fc222a0c2ac4 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -82,19 +82,6 @@ static const struct file_operations memtrace_fops = {
.open   = simple_open,
 };
 
-static void flush_memory_region(u64 base, u64 size)
-{
-   unsigned long line_size = ppc64_caches.l1d.size;
-   u64 end = base + size;
-   u64 addr;
-
-   base = round_down(base, line_size);
-   end = round_up(end, line_size);
-
-   for (addr = base; addr < end; addr += line_size)
-   asm volatile("dcbf 0,%0" : "=r" (addr) :: "memory");
-}
-
 static int check_memblock_online(struct memory_block *mem, void *arg)
 {
if (mem->state != MEM_ONLINE)
@@ -132,10 +119,6 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, 
u64 nr_pages)
walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE,
  change_memblock_state);
 
-   /* RCU grace period? */
-   flush_memory_region((u64)__va(start_pfn << PAGE_SHIFT),
-   nr_pages << PAGE_SHIFT);
-
lock_device_hotplug();
remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
unlock_device_hotplug();
-- 
2.13.6



[PATCH 1/2] powerpc/mm: Flush cache on memory hot(un)plug

2018-04-05 Thread Balbir Singh
This patch adds support for flushing potentially dirty
cache lines when memory is hot-plugged/hot-un-plugged.
The support is currently limited to 64 bit systems.

The bug was exposed when mappings for a device were
actually hot-unplugged and plugged in back later.
A similar issue was observed during the development
of memtrace, but memtrace does it's own flushing of
region via a custom routine.

These patches do a flush both on hotplug/unplug to
clear any stale data in the cache w.r.t mappings,
there is a small race window where a clean cache
line may be created again just prior to tearing
down the mapping.

The patches were tested by disabling the flush
routines in memtrace and doing I/O on the trace
file. The system immediately checkstops (quite
reliablly if prior to the hot-unplug of the memtrace
region, we memset the regions we are about to
hot unplug). After these patches no custom flushing
is needed in the memtrace code.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/mm/mem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 85245ef97e72..0a8959b15b39 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -143,6 +143,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, 
struct vmem_altmap *
start, start + size, rc);
return -EFAULT;
}
+   flush_inval_dcache_range(start, start + size);
 
return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
 }
@@ -169,6 +170,7 @@ int __meminit arch_remove_memory(u64 start, u64 size, 
struct vmem_altmap *altmap
 
/* Remove htab bolted mappings for this section of memory */
start = (unsigned long)__va(start);
+   flush_inval_dcache_range(start, start + size);
ret = remove_section_mapping(start, start + size);
 
/* Ensure all vmalloc mappings are flushed in case they also
-- 
2.13.6



[PATCH v3 4/4] powerpc/powernv: Create platform devs for nvdimm buses

2018-04-05 Thread Oliver O'Halloran
Scan the devicetree for an nvdimm-bus compatible and create
a platform device for them.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/opal.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index c15182765ff5..c37485a3c5c9 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -821,6 +821,9 @@ static int __init opal_init(void)
/* Create i2c platform devices */
opal_pdev_init("ibm,opal-i2c");
 
+   /* Handle non-volatile memory devices */
+   opal_pdev_init("pmem-region");
+
/* Setup a heatbeat thread if requested by OPAL */
opal_init_heartbeat();
 
-- 
2.9.5



[PATCH v3 3/4] doc/devicetree: Persistent memory region bindings

2018-04-05 Thread Oliver O'Halloran
Add device-tree binding documentation for the nvdimm region driver.

Cc: devicet...@vger.kernel.org
Signed-off-by: Oliver O'Halloran 
---
v2: Changed name from nvdimm-region to pmem-region.
Cleaned up the example binding and fixed the overlapping regions.
Added support for multiple regions in a single reg.
v3: Removed platform bus boilerplate from the example.
Changed description of the volatile and reg properties
to make them more clear.
---
 .../devicetree/bindings/pmem/pmem-region.txt   | 65 ++
 MAINTAINERS|  1 +
 2 files changed, 66 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pmem/pmem-region.txt

diff --git a/Documentation/devicetree/bindings/pmem/pmem-region.txt 
b/Documentation/devicetree/bindings/pmem/pmem-region.txt
new file mode 100644
index ..5cfa4f016a00
--- /dev/null
+++ b/Documentation/devicetree/bindings/pmem/pmem-region.txt
@@ -0,0 +1,65 @@
+Device-tree bindings for persistent memory regions
+-
+
+Persistent memory refers to a class of memory devices that are:
+
+   a) Usable as main system memory (i.e. cacheable), and
+   b) Retain their contents across power failure.
+
+Given b) it is best to think of persistent memory as a kind of memory mapped
+storage device. To ensure data integrity the operating system needs to manage
+persistent regions separately to the normal memory pool. To aid with that this
+binding provides a standardised interface for discovering where persistent
+memory regions exist inside the physical address space.
+
+Bindings for the region nodes:
+-
+
+Required properties:
+   - compatible = "pmem-region"
+
+   - reg = ;
+   The reg property should specificy an address range that is
+   translatable to a system physical address range. This address
+   range should be mappable as normal system memory would be
+   (i.e cacheable).
+
+   If the reg property contains multiple address ranges
+   each address range will be treated as though it was specified
+   in a separate device node. Having multiple address ranges in a
+   node implies no special relationship between the two ranges.
+
+Optional properties:
+   - Any relevant NUMA assocativity properties for the target platform.
+
+   - volatile; This property indicates that this region is actually
+ backed by non-persistent memory. This lets the OS know that it
+ may skip the cache flushes required to ensure data is made
+ persistent after a write.
+
+ If this property is absent then the OS must assume that the region
+ is backed by non-volatile memory.
+
+Examples:
+
+
+   /*
+* This node specifies one 4KB region spanning from
+* 0x5000 to 0x5fff that is backed by non-volatile memory.
+*/
+   pmem@5000 {
+   compatible = "pmem-region";
+   reg = <0x5000 0x1000>;
+   };
+
+   /*
+* This node specifies two 4KB regions that are backed by
+* volatile (normal) memory.
+*/
+   pmem@6000 {
+   compatible = "pmem-region";
+   reg = < 0x6000 0x1000
+   0x8000 0x1000 >;
+   volatile;
+   };
+
diff --git a/MAINTAINERS b/MAINTAINERS
index df240740ca78..cbd289d58644 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8020,6 +8020,7 @@ L:linux-nvd...@lists.01.org
 Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/of_pmem.c
+F: Documentation/devicetree/bindings/pmem/pmem-region.txt
 
 LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
 M: Dan Williams 
-- 
2.9.5



[PATCH v3 2/4] libnvdimm: Add device-tree based driver

2018-04-05 Thread Oliver O'Halloran
This patch adds peliminary device-tree bindings for persistent memory
regions. The driver registers a libnvdimm bus for each pmem-region
node and each address range under the node is converted to a region
within that bus.

Signed-off-by: Oliver O'Halloran 
---
v2: Made each bus have a separate node rather having a shared bus.
Renamed to of_pmem rather than of_nvdimm.
Changed log level of happy-path messages to debug.
v3: Replaced of_nd_region_* prefix with of_pmem_region_* to make
the driver specific parts more distinct from the libnvdimm
parts.
---
 MAINTAINERS  |   7 +++
 drivers/nvdimm/Kconfig   |  10 
 drivers/nvdimm/Makefile  |   1 +
 drivers/nvdimm/of_pmem.c | 119 +++
 4 files changed, 137 insertions(+)
 create mode 100644 drivers/nvdimm/of_pmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8133e97980f1..df240740ca78 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8014,6 +8014,13 @@ Q:   
https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/pmem*
 
+LIBNVDIMM: DEVICETREE BINDINGS
+M: Oliver O'Halloran 
+L: linux-nvd...@lists.01.org
+Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
+S: Supported
+F: drivers/nvdimm/of_pmem.c
+
 LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
 M: Dan Williams 
 L: linux-nvd...@lists.01.org
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index a65f2e1d9f53..2d6862bf7436 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -102,4 +102,14 @@ config NVDIMM_DAX
 
  Select Y if unsure
 
+config OF_PMEM
+   tristate "Device-tree support for persistent memory regions"
+   depends on OF
+   default LIBNVDIMM
+   help
+ Allows regions of persistent memory to be described in the
+ device-tree.
+
+ Select Y if unsure.
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 70d5f3ad9909..e8847045dac0 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
 obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
+obj-$(CONFIG_OF_PMEM) += of_pmem.o
 
 nd_pmem-y := pmem.o
 
diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
new file mode 100644
index ..85013bad35de
--- /dev/null
+++ b/drivers/nvdimm/of_pmem.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#define pr_fmt(fmt) "of_pmem: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const struct attribute_group *region_attr_groups[] = {
+   _region_attribute_group,
+   _device_attribute_group,
+   NULL,
+};
+
+static const struct attribute_group *bus_attr_groups[] = {
+   _bus_attribute_group,
+   NULL,
+};
+
+struct of_pmem_private {
+   struct nvdimm_bus_descriptor bus_desc;
+   struct nvdimm_bus *bus;
+};
+
+static int of_pmem_region_probe(struct platform_device *pdev)
+{
+   struct of_pmem_private *priv;
+   struct device_node *np;
+   struct nvdimm_bus *bus;
+   bool is_volatile;
+   int i;
+
+   np = dev_of_node(>dev);
+   if (!np)
+   return -ENXIO;
+
+   priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+
+   priv->bus_desc.attr_groups = bus_attr_groups;
+   priv->bus_desc.provider_name = "of_pmem";
+   priv->bus_desc.module = THIS_MODULE;
+   priv->bus_desc.of_node = np;
+
+   priv->bus = bus = nvdimm_bus_register(>dev, >bus_desc);
+   if (!bus) {
+   kfree(priv);
+   return -ENODEV;
+   }
+   platform_set_drvdata(pdev, priv);
+
+   is_volatile = !!of_find_property(np, "volatile", NULL);
+   dev_dbg(>dev, "Registering %s regions from %pOF\n",
+   is_volatile ? "volatile" : "non-volatile",  np);
+
+   for (i = 0; i < pdev->num_resources; i++) {
+   struct nd_region_desc ndr_desc;
+   struct nd_region *region;
+
+   /*
+* NB: libnvdimm copies the data from ndr_desc into it's own
+* structures so passing a stack pointer is fine.
+*/
+   memset(_desc, 0, sizeof(ndr_desc));
+   ndr_desc.attr_groups = region_attr_groups;
+   ndr_desc.numa_node = of_node_to_nid(np);
+   ndr_desc.res = >resource[i];
+   ndr_desc.of_node = np;
+   set_bit(ND_REGION_PAGEMAP, _desc.flags);
+
+   if (is_volatile)
+   region = nvdimm_volatile_region_create(bus, _desc);
+   else
+   region = nvdimm_pmem_region_create(bus, _desc);
+
+   if (!region)
+   dev_warn(>dev, "Unable to register region %pR 

[PATCH v3 1/4] libnvdimm: Add of_node to region and bus descriptors

2018-04-05 Thread Oliver O'Halloran
We want to be able to cross reference the region and bus devices
with the device tree node that they were spawned from. libNVDIMM
handles creating the actual devices for these internally, so we
need to pass in a pointer to the relevant node in the descriptor.

Signed-off-by: Oliver O'Halloran 
Acked-by: Dan Williams 
---
 drivers/nvdimm/bus.c | 1 +
 drivers/nvdimm/region_devs.c | 1 +
 include/linux/libnvdimm.h| 3 +++
 3 files changed, 5 insertions(+)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 78eabc3a1ab1..c6106914f396 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -358,6 +358,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device 
*parent,
nvdimm_bus->dev.release = nvdimm_bus_release;
nvdimm_bus->dev.groups = nd_desc->attr_groups;
nvdimm_bus->dev.bus = _bus_type;
+   nvdimm_bus->dev.of_node = nd_desc->of_node;
dev_set_name(_bus->dev, "ndbus%d", nvdimm_bus->id);
rc = device_register(_bus->dev);
if (rc) {
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 1593e1806b16..30d5dc8b9bb2 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1014,6 +1014,7 @@ static struct nd_region *nd_region_create(struct 
nvdimm_bus *nvdimm_bus,
dev->parent = _bus->dev;
dev->type = dev_type;
dev->groups = ndr_desc->attr_groups;
+   dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
nd_device_register(dev);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index ff855ed965fb..f61cb5050297 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -76,12 +76,14 @@ typedef int (*ndctl_fn)(struct nvdimm_bus_descriptor 
*nd_desc,
struct nvdimm *nvdimm, unsigned int cmd, void *buf,
unsigned int buf_len, int *cmd_rc);
 
+struct device_node;
 struct nvdimm_bus_descriptor {
const struct attribute_group **attr_groups;
unsigned long bus_dsm_mask;
unsigned long cmd_mask;
struct module *module;
char *provider_name;
+   struct device_node *of_node;
ndctl_fn ndctl;
int (*flush_probe)(struct nvdimm_bus_descriptor *nd_desc);
int (*clear_to_send)(struct nvdimm_bus_descriptor *nd_desc,
@@ -123,6 +125,7 @@ struct nd_region_desc {
int num_lanes;
int numa_node;
unsigned long flags;
+   struct device_node *of_node;
 };
 
 struct device;
-- 
2.9.5



Re: [RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-05 Thread Benjamin Herrenschmidt
On Thu, 2018-04-05 at 19:25 -0700, Dan Williams wrote:
> > > Please also include my niggly nit picky trivial annoying bike shed
> > > color for the driver name to *not* use the "nd_region" suffix for a
> > > driver registering "nvdimm_bus" objects. "of_pmem_range" or
> > > "of_pmem_bus" or almost anything else would be fine.
> > 
> > Oh sure, would using of_pmem_region to match the compatible be ok?
> 
> That works for me.

The prefix "of" is not generally used in matching properties,...

my own pot of paint :)

Cheers,
Ben.



Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Anshuman Khandual
On 04/06/2018 02:48 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-04-05 at 21:34 +0300, Michael S. Tsirkin wrote:
>>> In this specific case, because that would make qemu expect an iommu,
>>> and there isn't one.
>>
>>
>> I think that you can set iommu_platform in qemu without an iommu.
> 
> No I mean the platform has one but it's not desirable for it to be used
> due to the performance hit.

Also the only requirement is to bounce the I/O buffers through SWIOTLB
implemented as DMA API which the virtio core understands. There is no
need for an IOMMU to be involved for the device representation in this
case IMHO.



Re: [RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-05 Thread Dan Williams
On Thu, Apr 5, 2018 at 7:14 PM, Oliver  wrote:
>
> On Fri, Apr 6, 2018 at 12:43 AM, Dan Williams  
> wrote:
> > On Thu, Apr 5, 2018 at 5:43 AM, Oliver  wrote:
> >> On Thu, Apr 5, 2018 at 10:11 PM, Michael Ellerman  
> >> wrote:
> >>> Oliver  writes:
> >>> ...
> 
>  For context Balbir is working with me on some of the pmem stuff. You
>  probably want an Ack from Rob rather than one of us.
> >>>
> >>> I'll ack it if you make all the niggly nit picky trivial annoying
> >>> changes I asked for :D
> >>
> >> *groan*
> >>
> >> Fine, I'll respin it tomorrow. If anyone else has comments now would
> >> be the time to make them.
> >
> > Please also include my niggly nit picky trivial annoying bike shed
> > color for the driver name to *not* use the "nd_region" suffix for a
> > driver registering "nvdimm_bus" objects. "of_pmem_range" or
> > "of_pmem_bus" or almost anything else would be fine.
>
> Oh sure, would using of_pmem_region to match the compatible be ok?

That works for me.


Re: [RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-05 Thread Oliver
On Fri, Apr 6, 2018 at 12:43 AM, Dan Williams  wrote:
> On Thu, Apr 5, 2018 at 5:43 AM, Oliver  wrote:
>> On Thu, Apr 5, 2018 at 10:11 PM, Michael Ellerman  
>> wrote:
>>> Oliver  writes:
>>> ...

 For context Balbir is working with me on some of the pmem stuff. You
 probably want an Ack from Rob rather than one of us.
>>>
>>> I'll ack it if you make all the niggly nit picky trivial annoying
>>> changes I asked for :D
>>
>> *groan*
>>
>> Fine, I'll respin it tomorrow. If anyone else has comments now would
>> be the time to make them.
>
> Please also include my niggly nit picky trivial annoying bike shed
> color for the driver name to *not* use the "nd_region" suffix for a
> driver registering "nvdimm_bus" objects. "of_pmem_range" or
> "of_pmem_bus" or almost anything else would be fine.

Oh sure, would using of_pmem_region to match the compatible be ok?


Re: [RESEND 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Nicholas Piggin
On Thu, 05 Apr 2018 16:40:26 -0400
Jeff Moyer  wrote:

> Nicholas Piggin  writes:
> 
> > On Thu, 5 Apr 2018 15:53:07 +1000
> > Balbir Singh  wrote:  
> >> I'm thinking about it, I wonder what "bytes remaining" mean in pmem context
> >> in the context of a machine check exception. Also, do we want to be byte
> >> accurate or cache-line accurate for the bytes remaining? The former is much
> >> easier than the latter :)  
> >
> > The ideal would be a linear measure of how much of your copy reached
> > (or can reach) non-volatile storage with nothing further copied. You
> > may have to allow for some relaxing of the semantics depending on
> > what the architecture can support.  
> 
> I think you've got that backwards.  memcpy_mcsafe is used to copy *from*
> persistent memory.  The idea is to catch errors when reading pmem, not
> writing to it.
> 
> > What's the problem with just counting bytes copied like usercopy --
> > why is that harder than cacheline accuracy?  
> 
> He said the former (i.e. bytes) is easier.  So, I think you're on the
> same page.  :)

Oh well that makes a lot more sense in my mind now, thanks :)


[PATCH 5/5] powerpc: Remove core support for Marvell mv64x60 hostbridges

2018-04-05 Thread Mark Greer
There are no longer any platforms that use Marvell's mv64x60
hostbridges so remove the supporting kernel code.

CC: Dale Farnsworth 
Signed-off-by: Mark Greer 
---
 Documentation/devicetree/bindings/marvell.txt | 516 -
 arch/powerpc/sysdev/Makefile  |   3 -
 arch/powerpc/sysdev/mv64x60.h |  13 -
 arch/powerpc/sysdev/mv64x60_dev.c | 535 --
 arch/powerpc/sysdev/mv64x60_pci.c | 171 
 arch/powerpc/sysdev/mv64x60_pic.c | 297 --
 arch/powerpc/sysdev/mv64x60_udbg.c| 152 
 7 files changed, 1687 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/marvell.txt
 delete mode 100644 arch/powerpc/sysdev/mv64x60.h
 delete mode 100644 arch/powerpc/sysdev/mv64x60_dev.c
 delete mode 100644 arch/powerpc/sysdev/mv64x60_pci.c
 delete mode 100644 arch/powerpc/sysdev/mv64x60_pic.c
 delete mode 100644 arch/powerpc/sysdev/mv64x60_udbg.c

diff --git a/Documentation/devicetree/bindings/marvell.txt 
b/Documentation/devicetree/bindings/marvell.txt
deleted file mode 100644
index 7f722316458a..
--- a/Documentation/devicetree/bindings/marvell.txt
+++ /dev/null
@@ -1,516 +0,0 @@
-Marvell Discovery mv64[345]6x System Controller chips
-===
-
-The Marvell mv64[345]60 series of system controller chips contain
-many of the peripherals needed to implement a complete computer
-system.  In this section, we define device tree nodes to describe
-the system controller chip itself and each of the peripherals
-which it contains.  Compatible string values for each node are
-prefixed with the string "marvell,", for Marvell Technology Group Ltd.
-
-1) The /system-controller node
-
-  This node is used to represent the system-controller and must be
-  present when the system uses a system controller chip. The top-level
-  system-controller node contains information that is global to all
-  devices within the system controller chip. The node name begins
-  with "system-controller" followed by the unit address, which is
-  the base address of the memory-mapped register set for the system
-  controller chip.
-
-  Required properties:
-
-- ranges : Describes the translation of system controller addresses
-  for memory mapped registers.
-- clock-frequency: Contains the main clock frequency for the system
-  controller chip.
-- reg : This property defines the address and size of the
-  memory-mapped registers contained within the system controller
-  chip.  The address specified in the "reg" property should match
-  the unit address of the system-controller node.
-- #address-cells : Address representation for system controller
-  devices.  This field represents the number of cells needed to
-  represent the address of the memory-mapped registers of devices
-  within the system controller chip.
-- #size-cells : Size representation for the memory-mapped
-  registers within the system controller chip.
-- #interrupt-cells : Defines the width of cells used to represent
-  interrupts.
-
-  Optional properties:
-
-- model : The specific model of the system controller chip.  Such
-  as, "mv64360", "mv64460", or "mv64560".
-- compatible : A string identifying the compatibility identifiers
-  of the system controller chip.
-
-  The system-controller node contains child nodes for each system
-  controller device that the platform uses.  Nodes should not be created
-  for devices which exist on the system controller chip but are not used
-
-  Example Marvell Discovery mv64360 system-controller node:
-
-system-controller@f100 { /* Marvell Discovery mv64360 */
-   #address-cells = <1>;
-   #size-cells = <1>;
-   model = "mv64360";  /* Default */
-   compatible = "marvell,mv64360";
-   clock-frequency = <1>;
-   reg = <0xf100 0x1>;
-   virtual-reg = <0xf100>;
-   ranges = <0x8800 0x8800 0x100 /* PCI 0 I/O Space */
-   0x8000 0x8000 0x800 /* PCI 0 MEM Space */
-   0xa000 0xa000 0x400 /* User FLASH */
-   0x 0xf100 0x001 /* Bridge's regs */
-   0xf200 0xf200 0x004>;/* Integrated SRAM */
-
-   [ child node definitions... ]
-}
-
-2) Child nodes of /system-controller
-
-   a) Marvell Discovery MDIO bus
-
-   The MDIO is a bus to which the PHY devices are connected.  For each
-   device that exists on this bus, a child node should be created.  See
-   the definition of the PHY node below for an example of how to define
-   a PHY.
-
-   Required properties:
- - #address-cells : Should be <1>
- - #size-cells : Should be <0>
- - compatible : Should be 

[PATCH 0/5] powerpc: Remove support for Marvell mv64x60 hostbridges

2018-04-05 Thread Mark Greer
Hello.

As far as I can tell, the c2k platform is abandoned so it should be
removed.  Once it is removed, there are no more platforms that use
the mv64x60 hostbridge so remove that code too (and related drivers).

If and when this series of patches is accepted, I will submit patches
to the appropriate maintainers to remote drivers/tty/serial/mpsc.c and
drivers/watchdog/mv64x60_wdt.c.  The i2c and ethernet drivers are used
by ARM SoCs/platforms so they will be left alone.

Based on git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux,
'merge' branch which is currently 6c5c24035003 (Automatic merge of
branches 'master', 'next' and 'fixes' into merge)

Thanks,

Mark
--

Mark Greer (5):
  powerpc/embedded6xx: Remove C2K board support
  powerpc/boot: Remove support for Marvell MPSC serial controller
  powerpc/boot: Remove support for Marvell mv64x60 i2c controller
  powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
  powerpc: Remove core support for Marvell mv64x60 hostbridges

 Documentation/devicetree/bindings/marvell.txt | 516 ---
 arch/powerpc/boot/Makefile|   7 +-
 arch/powerpc/boot/cuboot-c2k.c| 189 -
 arch/powerpc/boot/dts/c2k.dts | 366 
 arch/powerpc/boot/mpsc.c  | 169 
 arch/powerpc/boot/mv64x60.c   | 581 --
 arch/powerpc/boot/mv64x60.h   |  70 
 arch/powerpc/boot/mv64x60_i2c.c   | 204 -
 arch/powerpc/boot/ops.h   |   1 -
 arch/powerpc/boot/serial.c|   4 -
 arch/powerpc/configs/c2k_defconfig| 390 -
 arch/powerpc/platforms/embedded6xx/Kconfig|  10 -
 arch/powerpc/platforms/embedded6xx/Makefile   |   1 -
 arch/powerpc/platforms/embedded6xx/c2k.c  | 148 ---
 arch/powerpc/sysdev/Makefile  |   3 -
 arch/powerpc/sysdev/mv64x60.h |  13 -
 arch/powerpc/sysdev/mv64x60_dev.c | 535 
 arch/powerpc/sysdev/mv64x60_pci.c | 171 
 arch/powerpc/sysdev/mv64x60_pic.c | 297 -
 arch/powerpc/sysdev/mv64x60_udbg.c| 152 ---
 20 files changed, 3 insertions(+), 3824 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/marvell.txt
 delete mode 100644 arch/powerpc/boot/cuboot-c2k.c
 delete mode 100644 arch/powerpc/boot/dts/c2k.dts
 delete mode 100644 arch/powerpc/boot/mpsc.c
 delete mode 100644 arch/powerpc/boot/mv64x60.c
 delete mode 100644 arch/powerpc/boot/mv64x60.h
 delete mode 100644 arch/powerpc/boot/mv64x60_i2c.c
 delete mode 100644 arch/powerpc/configs/c2k_defconfig
 delete mode 100644 arch/powerpc/platforms/embedded6xx/c2k.c
 delete mode 100644 arch/powerpc/sysdev/mv64x60.h
 delete mode 100644 arch/powerpc/sysdev/mv64x60_dev.c
 delete mode 100644 arch/powerpc/sysdev/mv64x60_pci.c
 delete mode 100644 arch/powerpc/sysdev/mv64x60_pic.c
 delete mode 100644 arch/powerpc/sysdev/mv64x60_udbg.c

-- 
2.16.2



[PATCH 1/5] powerpc/embedded6xx: Remove C2K board support

2018-04-05 Thread Mark Greer
The C2K platform appears to be orphaned so remove code supporting it.

CC: Remi Machet 
Signed-off-by: Mark Greer 
---
 arch/powerpc/boot/Makefile  |   5 +-
 arch/powerpc/boot/cuboot-c2k.c  | 189 --
 arch/powerpc/boot/dts/c2k.dts   | 366 --
 arch/powerpc/configs/c2k_defconfig  | 390 
 arch/powerpc/platforms/embedded6xx/Kconfig  |  10 -
 arch/powerpc/platforms/embedded6xx/Makefile |   1 -
 arch/powerpc/platforms/embedded6xx/c2k.c| 148 ---
 7 files changed, 2 insertions(+), 1107 deletions(-)
 delete mode 100644 arch/powerpc/boot/cuboot-c2k.c
 delete mode 100644 arch/powerpc/boot/dts/c2k.dts
 delete mode 100644 arch/powerpc/configs/c2k_defconfig
 delete mode 100644 arch/powerpc/platforms/embedded6xx/c2k.c

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 26d5d2a5b8e9..70bf9b409fae 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -143,8 +143,8 @@ src-plat-$(CONFIG_PPC_82xx) += cuboot-pq2.c fixed-head.S 
ep8248e.c cuboot-824x.c
 src-plat-$(CONFIG_PPC_83xx) += cuboot-83xx.c fixed-head.S redboot-83xx.c
 src-plat-$(CONFIG_FSL_SOC_BOOKE) += cuboot-85xx.c cuboot-85xx-cpm2.c
 src-plat-$(CONFIG_EMBEDDED6xx) += cuboot-pq2.c cuboot-mpc7448hpc2.c \
-   cuboot-c2k.c gamecube-head.S \
-   gamecube.c wii-head.S wii.c holly.c \
+   gamecube-head.S gamecube.c \
+   wii-head.S wii.c holly.c \
fixed-head.S mvme5100.c
 src-plat-$(CONFIG_AMIGAONE) += cuboot-amigaone.c
 src-plat-$(CONFIG_PPC_PS3) += ps3-head.S ps3-hvcall.S ps3.c
@@ -339,7 +339,6 @@ image-$(CONFIG_MVME7100)+= dtbImage.mvme7100
 # Board ports in arch/powerpc/platform/embedded6xx/Kconfig
 image-$(CONFIG_STORCENTER) += cuImage.storcenter
 image-$(CONFIG_MPC7448HPC2)+= cuImage.mpc7448hpc2
-image-$(CONFIG_PPC_C2K)+= cuImage.c2k
 image-$(CONFIG_GAMECUBE)   += dtbImage.gamecube
 image-$(CONFIG_WII)+= dtbImage.wii
 image-$(CONFIG_MVME5100)   += dtbImage.mvme5100
diff --git a/arch/powerpc/boot/cuboot-c2k.c b/arch/powerpc/boot/cuboot-c2k.c
deleted file mode 100644
index 9309c51f1d65..
--- a/arch/powerpc/boot/cuboot-c2k.c
+++ /dev/null
@@ -1,189 +0,0 @@
-/*
- * GEFanuc C2K platform code.
- *
- * Author: Remi Machet 
- *
- * Originated from prpmc2800.c
- *
- * 2008 (c) Stanford University
- * 2007 (c) MontaVista, Software, Inc.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
- */
-
-#include "types.h"
-#include "stdio.h"
-#include "io.h"
-#include "ops.h"
-#include "elf.h"
-#include "mv64x60.h"
-#include "cuboot.h"
-#include "ppcboot.h"
-
-static u8 *bridge_base;
-
-static void c2k_bridge_setup(u32 mem_size)
-{
-   u32 i, v[30], enables, acc_bits;
-   u32 pci_base_hi, pci_base_lo, size, buf[2];
-   unsigned long cpu_base;
-   int rc;
-   void *devp, *mv64x60_devp;
-   u8 *bridge_pbase, is_coherent;
-   struct mv64x60_cpu2pci_win *tbl;
-   int bus;
-
-   bridge_pbase = mv64x60_get_bridge_pbase();
-   is_coherent = mv64x60_is_coherent();
-
-   if (is_coherent)
-   acc_bits = MV64x60_PCI_ACC_CNTL_SNOOP_WB
-   | MV64x60_PCI_ACC_CNTL_SWAP_NONE
-   | MV64x60_PCI_ACC_CNTL_MBURST_32_BYTES
-   | MV64x60_PCI_ACC_CNTL_RDSIZE_32_BYTES;
-   else
-   acc_bits = MV64x60_PCI_ACC_CNTL_SNOOP_NONE
-   | MV64x60_PCI_ACC_CNTL_SWAP_NONE
-   | MV64x60_PCI_ACC_CNTL_MBURST_128_BYTES
-   | MV64x60_PCI_ACC_CNTL_RDSIZE_256_BYTES;
-
-   mv64x60_config_ctlr_windows(bridge_base, bridge_pbase, is_coherent);
-   mv64x60_devp = find_node_by_compatible(NULL, "marvell,mv64360");
-   if (mv64x60_devp == NULL)
-   fatal("Error: Missing marvell,mv64360 device tree node\n\r");
-
-   enables = in_le32((u32 *)(bridge_base + MV64x60_CPU_BAR_ENABLE));
-   enables |= 0x007ffe00; /* Disable all cpu->pci windows */
-   out_le32((u32 *)(bridge_base + MV64x60_CPU_BAR_ENABLE), enables);
-
-   /* Get the cpu -> pci i/o & mem mappings from the device tree */
-   devp = NULL;
-   for (bus = 0; ; bus++) {
-   char name[] = "pci ";
-
-   name[strlen(name)-1] = bus+'0';
-
-   devp = find_node_by_alias(name);
-   if (devp == NULL)
-   break;
-
-   if (bus >= 2)
-   fatal("Error: Only 2 PCI controllers are 

[PATCH 2/5] powerpc/boot: Remove support for Marvell MPSC serial controller

2018-04-05 Thread Mark Greer
There are no longer any platforms that use Marvell's MPSC serial
controller so remove its driver.

Signed-off-by: Mark Greer 
---
 arch/powerpc/boot/Makefile |   2 +-
 arch/powerpc/boot/mpsc.c   | 169 -
 arch/powerpc/boot/ops.h|   1 -
 arch/powerpc/boot/serial.c |   4 --
 4 files changed, 1 insertion(+), 175 deletions(-)
 delete mode 100644 arch/powerpc/boot/mpsc.c

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 70bf9b409fae..58f2dbfba275 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -120,7 +120,7 @@ src-wlib-$(CONFIG_40x) += 4xx.c planetcore.c
 src-wlib-$(CONFIG_44x) += 4xx.c ebony.c bamboo.c
 src-wlib-$(CONFIG_PPC_8xx) += mpc8xx.c planetcore.c fsl-soc.c
 src-wlib-$(CONFIG_PPC_82xx) += pq2.c fsl-soc.c planetcore.c
-src-wlib-$(CONFIG_EMBEDDED6xx) += mpsc.c mv64x60.c mv64x60_i2c.c ugecon.c 
fsl-soc.c
+src-wlib-$(CONFIG_EMBEDDED6xx) += mv64x60.c mv64x60_i2c.c ugecon.c fsl-soc.c
 src-wlib-$(CONFIG_XILINX_VIRTEX) += uartlite.c
 src-wlib-$(CONFIG_CPM) += cpm-serial.c
 
diff --git a/arch/powerpc/boot/mpsc.c b/arch/powerpc/boot/mpsc.c
deleted file mode 100644
index 425ad88cce8d..
--- a/arch/powerpc/boot/mpsc.c
+++ /dev/null
@@ -1,169 +0,0 @@
-/*
- * MPSC/UART driver for the Marvell mv64360, mv64460, ...
- *
- * Author: Mark A. Greer 
- *
- * 2007 (c) MontaVista Software, Inc. This file is licensed under
- * the terms of the GNU General Public License version 2. This program
- * is licensed "as is" without any warranty of any kind, whether express
- * or implied.
- */
-
-#include 
-#include 
-#include "types.h"
-#include "string.h"
-#include "stdio.h"
-#include "io.h"
-#include "ops.h"
-
-
-#define MPSC_CHR_1 0x000c
-
-#define MPSC_CHR_2 0x0010
-#define MPSC_CHR_2_TA  (1<<7)
-#define MPSC_CHR_2_TCS (1<<9)
-#define MPSC_CHR_2_RA  (1<<23)
-#define MPSC_CHR_2_CRD (1<<25)
-#define MPSC_CHR_2_EH  (1<<31)
-
-#define MPSC_CHR_4 0x0018
-#define MPSC_CHR_4_Z   (1<<29)
-
-#define MPSC_CHR_5 0x001c
-#define MPSC_CHR_5_CTL1_INTR   (1<<12)
-#define MPSC_CHR_5_CTL1_VALID  (1<<15)
-
-#define MPSC_CHR_100x0030
-
-#define MPSC_INTR_CAUSE0x
-#define MPSC_INTR_CAUSE_RCC(1<<6)
-#define MPSC_INTR_MASK 0x0080
-
-#define SDMA_SDCM  0x0008
-#define SDMA_SDCM_AR   (1<<15)
-#define SDMA_SDCM_AT   (1<<31)
-
-static volatile char *mpsc_base;
-static volatile char *mpscintr_base;
-static u32 chr1, chr2;
-
-static int mpsc_open(void)
-{
-   chr1 = in_le32((u32 *)(mpsc_base + MPSC_CHR_1)) & 0x00ff;
-   chr2 = in_le32((u32 *)(mpsc_base + MPSC_CHR_2)) & ~(MPSC_CHR_2_TA
-   | MPSC_CHR_2_TCS | MPSC_CHR_2_RA | MPSC_CHR_2_CRD
-   | MPSC_CHR_2_EH);
-   out_le32((u32 *)(mpsc_base + MPSC_CHR_4), MPSC_CHR_4_Z);
-   out_le32((u32 *)(mpsc_base + MPSC_CHR_5),
-   MPSC_CHR_5_CTL1_INTR | MPSC_CHR_5_CTL1_VALID);
-   out_le32((u32 *)(mpsc_base + MPSC_CHR_2), chr2 | MPSC_CHR_2_EH);
-   return 0;
-}
-
-static void mpsc_putc(unsigned char c)
-{
-   while (in_le32((u32 *)(mpsc_base + MPSC_CHR_2)) & MPSC_CHR_2_TCS);
-
-   out_le32((u32 *)(mpsc_base + MPSC_CHR_1), chr1 | c);
-   out_le32((u32 *)(mpsc_base + MPSC_CHR_2), chr2 | MPSC_CHR_2_TCS);
-}
-
-static unsigned char mpsc_getc(void)
-{
-   u32 cause = 0;
-   unsigned char c;
-
-   while (!(cause & MPSC_INTR_CAUSE_RCC))
-   cause = in_le32((u32 *)(mpscintr_base + MPSC_INTR_CAUSE));
-
-   c = in_8((u8 *)(mpsc_base + MPSC_CHR_10 + 2));
-   out_8((u8 *)(mpsc_base + MPSC_CHR_10 + 2), c);
-   out_le32((u32 *)(mpscintr_base + MPSC_INTR_CAUSE),
-   cause & ~MPSC_INTR_CAUSE_RCC);
-
-   return c;
-}
-
-static u8 mpsc_tstc(void)
-{
-   return (u8)((in_le32((u32 *)(mpscintr_base + MPSC_INTR_CAUSE))
-   & MPSC_INTR_CAUSE_RCC) != 0);
-}
-
-static void mpsc_stop_dma(volatile char *sdma_base)
-{
-   out_le32((u32 *)(mpsc_base + MPSC_CHR_2),MPSC_CHR_2_TA | MPSC_CHR_2_RA);
-   out_le32((u32 *)(sdma_base + SDMA_SDCM), SDMA_SDCM_AR | SDMA_SDCM_AT);
-
-   while ((in_le32((u32 *)(sdma_base + SDMA_SDCM))
-   & (SDMA_SDCM_AR | SDMA_SDCM_AT)) != 0)
-   udelay(100);
-}
-
-static volatile char *mpsc_get_virtreg_of_phandle(void *devp, char *prop)
-{
-   void *v;
-   int n;
-
-   n = getprop(devp, prop, , sizeof(v));
-   if (n != sizeof(v))
-   goto err_out;
-
-   devp = find_node_by_linuxphandle((u32)v);
-   if (devp == NULL)
-   goto err_out;
-
-   n = getprop(devp, "virtual-reg", , sizeof(v));
-   if (n == sizeof(v))
-   return v;
-
-err_out:
-   return NULL;
-}
-
-int mpsc_console_init(void *devp, struct 

[PATCH 4/5] powerpc/boot: Remove core support for Marvell mv64x60 hostbridges

2018-04-05 Thread Mark Greer
There are no longer any platforms that use Marvell's mv64x60
hostbridges so remove the supporting boot code.

Signed-off-by: Mark Greer 
---
 arch/powerpc/boot/Makefile  |   2 +-
 arch/powerpc/boot/mv64x60.c | 581 
 arch/powerpc/boot/mv64x60.h |  70 --
 3 files changed, 1 insertion(+), 652 deletions(-)
 delete mode 100644 arch/powerpc/boot/mv64x60.c
 delete mode 100644 arch/powerpc/boot/mv64x60.h

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index bf6a46055ba7..fa16626849f4 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -120,7 +120,7 @@ src-wlib-$(CONFIG_40x) += 4xx.c planetcore.c
 src-wlib-$(CONFIG_44x) += 4xx.c ebony.c bamboo.c
 src-wlib-$(CONFIG_PPC_8xx) += mpc8xx.c planetcore.c fsl-soc.c
 src-wlib-$(CONFIG_PPC_82xx) += pq2.c fsl-soc.c planetcore.c
-src-wlib-$(CONFIG_EMBEDDED6xx) += mv64x60.c ugecon.c fsl-soc.c
+src-wlib-$(CONFIG_EMBEDDED6xx) += ugecon.c fsl-soc.c
 src-wlib-$(CONFIG_XILINX_VIRTEX) += uartlite.c
 src-wlib-$(CONFIG_CPM) += cpm-serial.c
 
diff --git a/arch/powerpc/boot/mv64x60.c b/arch/powerpc/boot/mv64x60.c
deleted file mode 100644
index d9bb302b91d2..
--- a/arch/powerpc/boot/mv64x60.c
+++ /dev/null
@@ -1,581 +0,0 @@
-/*
- * Marvell hostbridge routines
- *
- * Author: Mark A. Greer 
- *
- * 2004, 2005, 2007 (c) MontaVista Software, Inc. This file is licensed under
- * the terms of the GNU General Public License version 2. This program
- * is licensed "as is" without any warranty of any kind, whether express
- * or implied.
- */
-
-#include 
-#include 
-#include "types.h"
-#include "elf.h"
-#include "page.h"
-#include "string.h"
-#include "stdio.h"
-#include "io.h"
-#include "ops.h"
-#include "mv64x60.h"
-
-#define PCI_DEVFN(slot,func)   slot) & 0x1f) << 3) | ((func) & 0x07))
-
-#define MV64x60_CPU2MEM_WINDOWS4
-#define MV64x60_CPU2MEM_0_BASE 0x0008
-#define MV64x60_CPU2MEM_0_SIZE 0x0010
-#define MV64x60_CPU2MEM_1_BASE 0x0208
-#define MV64x60_CPU2MEM_1_SIZE 0x0210
-#define MV64x60_CPU2MEM_2_BASE 0x0018
-#define MV64x60_CPU2MEM_2_SIZE 0x0020
-#define MV64x60_CPU2MEM_3_BASE 0x0218
-#define MV64x60_CPU2MEM_3_SIZE 0x0220
-
-#define MV64x60_ENET2MEM_BAR_ENABLE0x2290
-#define MV64x60_ENET2MEM_0_BASE0x2200
-#define MV64x60_ENET2MEM_0_SIZE0x2204
-#define MV64x60_ENET2MEM_1_BASE0x2208
-#define MV64x60_ENET2MEM_1_SIZE0x220c
-#define MV64x60_ENET2MEM_2_BASE0x2210
-#define MV64x60_ENET2MEM_2_SIZE0x2214
-#define MV64x60_ENET2MEM_3_BASE0x2218
-#define MV64x60_ENET2MEM_3_SIZE0x221c
-#define MV64x60_ENET2MEM_4_BASE0x2220
-#define MV64x60_ENET2MEM_4_SIZE0x2224
-#define MV64x60_ENET2MEM_5_BASE0x2228
-#define MV64x60_ENET2MEM_5_SIZE0x222c
-#define MV64x60_ENET2MEM_ACC_PROT_00x2294
-#define MV64x60_ENET2MEM_ACC_PROT_10x2298
-#define MV64x60_ENET2MEM_ACC_PROT_20x229c
-
-#define MV64x60_MPSC2MEM_BAR_ENABLE0xf250
-#define MV64x60_MPSC2MEM_0_BASE0xf200
-#define MV64x60_MPSC2MEM_0_SIZE0xf204
-#define MV64x60_MPSC2MEM_1_BASE0xf208
-#define MV64x60_MPSC2MEM_1_SIZE0xf20c
-#define MV64x60_MPSC2MEM_2_BASE0xf210
-#define MV64x60_MPSC2MEM_2_SIZE0xf214
-#define MV64x60_MPSC2MEM_3_BASE0xf218
-#define MV64x60_MPSC2MEM_3_SIZE0xf21c
-#define MV64x60_MPSC_0_REMAP   0xf240
-#define MV64x60_MPSC_1_REMAP   0xf244
-#define MV64x60_MPSC2MEM_ACC_PROT_00xf254
-#define MV64x60_MPSC2MEM_ACC_PROT_10xf258
-#define MV64x60_MPSC2REGS_BASE 0xf25c
-
-#define MV64x60_IDMA2MEM_BAR_ENABLE0x0a80
-#define MV64x60_IDMA2MEM_0_BASE0x0a00
-#define MV64x60_IDMA2MEM_0_SIZE0x0a04
-#define MV64x60_IDMA2MEM_1_BASE0x0a08
-#define MV64x60_IDMA2MEM_1_SIZE0x0a0c
-#define MV64x60_IDMA2MEM_2_BASE0x0a10
-#define MV64x60_IDMA2MEM_2_SIZE0x0a14
-#define MV64x60_IDMA2MEM_3_BASE0x0a18
-#define MV64x60_IDMA2MEM_3_SIZE0x0a1c
-#define MV64x60_IDMA2MEM_4_BASE0x0a20
-#define MV64x60_IDMA2MEM_4_SIZE0x0a24
-#define MV64x60_IDMA2MEM_5_BASE0x0a28
-#define MV64x60_IDMA2MEM_5_SIZE 

[PATCH 3/5] powerpc/boot: Remove support for Marvell mv64x60 i2c controller

2018-04-05 Thread Mark Greer
There are no longer any platforms that use Marvell's mv64x60's i2c
controller so remove its driver.

Signed-off-by: Mark Greer 
---
 arch/powerpc/boot/Makefile  |   2 +-
 arch/powerpc/boot/mv64x60_i2c.c | 204 
 2 files changed, 1 insertion(+), 205 deletions(-)
 delete mode 100644 arch/powerpc/boot/mv64x60_i2c.c

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 58f2dbfba275..bf6a46055ba7 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -120,7 +120,7 @@ src-wlib-$(CONFIG_40x) += 4xx.c planetcore.c
 src-wlib-$(CONFIG_44x) += 4xx.c ebony.c bamboo.c
 src-wlib-$(CONFIG_PPC_8xx) += mpc8xx.c planetcore.c fsl-soc.c
 src-wlib-$(CONFIG_PPC_82xx) += pq2.c fsl-soc.c planetcore.c
-src-wlib-$(CONFIG_EMBEDDED6xx) += mv64x60.c mv64x60_i2c.c ugecon.c fsl-soc.c
+src-wlib-$(CONFIG_EMBEDDED6xx) += mv64x60.c ugecon.c fsl-soc.c
 src-wlib-$(CONFIG_XILINX_VIRTEX) += uartlite.c
 src-wlib-$(CONFIG_CPM) += cpm-serial.c
 
diff --git a/arch/powerpc/boot/mv64x60_i2c.c b/arch/powerpc/boot/mv64x60_i2c.c
deleted file mode 100644
index 52a3212b6638..
--- a/arch/powerpc/boot/mv64x60_i2c.c
+++ /dev/null
@@ -1,204 +0,0 @@
-/*
- * Bootloader version of the i2c driver for the MV64x60.
- *
- * Author: Dale Farnsworth 
- * Maintained by: Mark A. Greer 
- *
- * 2003, 2007 (c) MontaVista, Software, Inc.  This file is licensed under
- * the terms of the GNU General Public License version 2.  This program is
- * licensed "as is" without any warranty of any kind, whether express or
- * implied.
- */
-
-#include 
-#include 
-#include "types.h"
-#include "elf.h"
-#include "page.h"
-#include "string.h"
-#include "stdio.h"
-#include "io.h"
-#include "ops.h"
-#include "mv64x60.h"
-
-/* Register defines */
-#define MV64x60_I2C_REG_SLAVE_ADDR 0x00
-#define MV64x60_I2C_REG_DATA   0x04
-#define MV64x60_I2C_REG_CONTROL0x08
-#define MV64x60_I2C_REG_STATUS 0x0c
-#define MV64x60_I2C_REG_BAUD   0x0c
-#define MV64x60_I2C_REG_EXT_SLAVE_ADDR 0x10
-#define MV64x60_I2C_REG_SOFT_RESET 0x1c
-
-#define MV64x60_I2C_CONTROL_ACK0x04
-#define MV64x60_I2C_CONTROL_IFLG   0x08
-#define MV64x60_I2C_CONTROL_STOP   0x10
-#define MV64x60_I2C_CONTROL_START  0x20
-#define MV64x60_I2C_CONTROL_TWSIEN 0x40
-#define MV64x60_I2C_CONTROL_INTEN  0x80
-
-#define MV64x60_I2C_STATUS_BUS_ERR 0x00
-#define MV64x60_I2C_STATUS_MAST_START  0x08
-#define MV64x60_I2C_STATUS_MAST_REPEAT_START   0x10
-#define MV64x60_I2C_STATUS_MAST_WR_ADDR_ACK0x18
-#define MV64x60_I2C_STATUS_MAST_WR_ADDR_NO_ACK 0x20
-#define MV64x60_I2C_STATUS_MAST_WR_ACK 0x28
-#define MV64x60_I2C_STATUS_MAST_WR_NO_ACK  0x30
-#define MV64x60_I2C_STATUS_MAST_LOST_ARB   0x38
-#define MV64x60_I2C_STATUS_MAST_RD_ADDR_ACK0x40
-#define MV64x60_I2C_STATUS_MAST_RD_ADDR_NO_ACK 0x48
-#define MV64x60_I2C_STATUS_MAST_RD_DATA_ACK0x50
-#define MV64x60_I2C_STATUS_MAST_RD_DATA_NO_ACK 0x58
-#define MV64x60_I2C_STATUS_MAST_WR_ADDR_2_ACK  0xd0
-#define MV64x60_I2C_STATUS_MAST_WR_ADDR_2_NO_ACK   0xd8
-#define MV64x60_I2C_STATUS_MAST_RD_ADDR_2_ACK  0xe0
-#define MV64x60_I2C_STATUS_MAST_RD_ADDR_2_NO_ACK   0xe8
-#define MV64x60_I2C_STATUS_NO_STATUS   0xf8
-
-static u8 *ctlr_base;
-
-static int mv64x60_i2c_wait_for_status(int wanted)
-{
-   int i;
-   int status;
-
-   for (i=0; i<1000; i++) {
-   udelay(10);
-   status = in_le32((u32 *)(ctlr_base + MV64x60_I2C_REG_STATUS))
-   & 0xff;
-   if (status == wanted)
-   return status;
-   }
-   return -status;
-}
-
-static int mv64x60_i2c_control(int control, int status)
-{
-   out_le32((u32 *)(ctlr_base + MV64x60_I2C_REG_CONTROL), control & 0xff);
-   return mv64x60_i2c_wait_for_status(status);
-}
-
-static int mv64x60_i2c_read_byte(int control, int status)
-{
-   out_le32((u32 *)(ctlr_base + MV64x60_I2C_REG_CONTROL), control & 0xff);
-   if (mv64x60_i2c_wait_for_status(status) < 0)
-   return -1;
-   return in_le32((u32 *)(ctlr_base + MV64x60_I2C_REG_DATA)) & 0xff;
-}
-
-static int mv64x60_i2c_write_byte(int data, int control, int status)
-{
-   out_le32((u32 *)(ctlr_base + MV64x60_I2C_REG_DATA), data & 0xff);
-   out_le32((u32 *)(ctlr_base + MV64x60_I2C_REG_CONTROL), control & 0xff);
-   return mv64x60_i2c_wait_for_status(status);
-}
-
-int mv64x60_i2c_read(u32 devaddr, u8 *buf, u32 offset, u32 offset_size,
-u32 count)

Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Benjamin Herrenschmidt
On Thu, 2018-04-05 at 21:34 +0300, Michael S. Tsirkin wrote:
> > In this specific case, because that would make qemu expect an iommu,
> > and there isn't one.
> 
> 
> I think that you can set iommu_platform in qemu without an iommu.

No I mean the platform has one but it's not desirable for it to be used
due to the performance hit.

Cheers,
Ben.
> 
> > Anshuman, you need to provide more background here. I don't have time
> > right now it's late, but explain about the fact that this is for a
> > specific type of secure VM which has only a limited pool of (insecure)
> > memory that can be shared with qemu, so all IOs need to bounce via that
> > pool, which can be achieved by using swiotlb.
> > 
> > Note: this isn't urgent, we can discuss alternative approaches, this is
> > just to start the conversation.
> > 
> > Cheers,
> > Ben.


Re: [RESEND 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Jeff Moyer
Nicholas Piggin  writes:

> On Thu, 5 Apr 2018 15:53:07 +1000
> Balbir Singh  wrote:
>> I'm thinking about it, I wonder what "bytes remaining" mean in pmem context
>> in the context of a machine check exception. Also, do we want to be byte
>> accurate or cache-line accurate for the bytes remaining? The former is much
>> easier than the latter :)
>
> The ideal would be a linear measure of how much of your copy reached
> (or can reach) non-volatile storage with nothing further copied. You
> may have to allow for some relaxing of the semantics depending on
> what the architecture can support.

I think you've got that backwards.  memcpy_mcsafe is used to copy *from*
persistent memory.  The idea is to catch errors when reading pmem, not
writing to it.

> What's the problem with just counting bytes copied like usercopy --
> why is that harder than cacheline accuracy?

He said the former (i.e. bytes) is easier.  So, I think you're on the
same page.  :)

Cheers,
Jeff


[PATCH v4 03/19] powerpc: Mark variable `l` as unused, remove `path`

2018-04-05 Thread Mathieu Malaterre
Add gcc attribute unused for `l` variable, replace `path` variable directly
with prom_scratch. Fix warnings treated as errors with W=1:

  arch/powerpc/kernel/prom_init.c:607:6: error: variable ‘l’ set but not used 
[-Werror=unused-but-set-variable]
  arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set but not 
used [-Werror=unused-but-set-variable]

Suggested-by: Michael Ellerman 
Signed-off-by: Mathieu Malaterre 
---
v4: redo v3 since path variable can be avoided
v3: really move path within ifdef DEBUG_PROM
v2: move path within ifdef DEBUG_PROM

 arch/powerpc/kernel/prom_init.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index f8a9a50ff9b5..4b223a9470be 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -604,7 +604,7 @@ static void __init early_cmdline_parse(void)
const char *opt;
 
char *p;
-   int l = 0;
+   int l __maybe_unused = 0;
 
prom_cmd_line[0] = 0;
p = prom_cmd_line;
@@ -1386,7 +1386,7 @@ static void __init reserve_mem(u64 base, u64 size)
 static void __init prom_init_mem(void)
 {
phandle node;
-   char *path, type[64];
+   char type[64];
unsigned int plen;
cell_t *p, *endp;
__be32 val;
@@ -1407,7 +1407,6 @@ static void __init prom_init_mem(void)
prom_debug("root_size_cells: %x\n", rsc);
 
prom_debug("scanning memory:\n");
-   path = prom_scratch;
 
for (node = 0; prom_next_node(); ) {
type[0] = 0;
@@ -1432,9 +1431,9 @@ static void __init prom_init_mem(void)
endp = p + (plen / sizeof(cell_t));
 
 #ifdef DEBUG_PROM
-   memset(path, 0, PROM_SCRATCH_SIZE);
-   call_prom("package-to-path", 3, 1, node, path, 
PROM_SCRATCH_SIZE-1);
-   prom_debug("  node %s :\n", path);
+   memset(prom_scratch, 0, PROM_SCRATCH_SIZE);
+   call_prom("package-to-path", 3, 1, node, prom_scratch, 
PROM_SCRATCH_SIZE - 1);
+   prom_debug("  node %s :\n", prom_scratch);
 #endif /* DEBUG_PROM */
 
while ((endp - p) >= (rac + rsc)) {
-- 
2.11.0



Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Michael S. Tsirkin
On Fri, Apr 06, 2018 at 01:09:43AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2018-04-05 at 17:54 +0300, Michael S. Tsirkin wrote:
> > On Thu, Apr 05, 2018 at 08:09:30PM +0530, Anshuman Khandual wrote:
> > > On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> > > > There are certian platforms which would like to use SWIOTLB based DMA 
> > > > API
> > > > for bouncing purpose without actually requiring an IOMMU back end. But 
> > > > the
> > > > virtio core does not allow such mechanism. Right now DMA MAP API is only
> > > > selected for devices which have an IOMMU and then the QEMU/host back end
> > > > will process all incoming SG buffer addresses as IOVA instead of simple
> > > > GPA which is the case for simple bounce buffers after being processed 
> > > > with
> > > > SWIOTLB API. To enable this usage, it introduces an architecture 
> > > > specific
> > > > function which will just make virtio core front end select DMA 
> > > > operations
> > > > structure.
> > > > 
> > > > Signed-off-by: Anshuman Khandual 
> > > 
> > > + "Michael S. Tsirkin" 
> > 
> > I'm confused by this.
> > 
> > static bool vring_use_dma_api(struct virtio_device *vdev)
> > {
> > if (!virtio_has_iommu_quirk(vdev))
> > return true;
> > 
> > 
> > Why doesn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > hypervisor side sufficient?
> 
> In this specific case, because that would make qemu expect an iommu,
> and there isn't one.


I think that you can set iommu_platform in qemu without an iommu.


> Anshuman, you need to provide more background here. I don't have time
> right now it's late, but explain about the fact that this is for a
> specific type of secure VM which has only a limited pool of (insecure)
> memory that can be shared with qemu, so all IOs need to bounce via that
> pool, which can be achieved by using swiotlb.
> 
> Note: this isn't urgent, we can discuss alternative approaches, this is
> just to start the conversation.
> 
> Cheers,
> Ben.


[PATCH 2/2] KVM: PPC: Book3S HV: lockless tlbie for HPT hcalls

2018-04-05 Thread Nicholas Piggin
tlbies to an LPAR do not have to be serialised since POWER4,
MMU_FTR_LOCKLESS_TLBIE can be used to avoid the spin lock in
do_tlbies.

Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
in HPT mode. 32 instances of the powerpc fork benchmark from selftests
were run with --fork, and the results measured.

Without this patch, total throughput was about 13.5K/sec, and this is
the top of the host profile:

   74.52%  [k] do_tlbies
2.95%  [k] kvmppc_book3s_hv_page_fault
1.80%  [k] calc_checksum
1.80%  [k] kvmppc_vcpu_run_hv
1.49%  [k] kvmppc_run_core

After this patch, throughput was about 51K/sec, with this profile:

   21.28%  [k] do_tlbies
5.26%  [k] kvmppc_run_core
4.88%  [k] kvmppc_book3s_hv_page_fault
3.30%  [k] _raw_spin_lock_irqsave
3.25%  [k] gup_pgd_range

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 78e6a392330f..0221a0f74f07 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -439,6 +439,9 @@ static inline int try_lock_tlbie(unsigned int *lock)
unsigned int tmp, old;
unsigned int token = LOCK_TOKEN;
 
+   if (mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE))
+   return 1;
+
asm volatile("1:lwarx   %1,0,%2\n"
 "  cmpwi   cr0,%1,0\n"
 "  bne 2f\n"
@@ -452,6 +455,12 @@ static inline int try_lock_tlbie(unsigned int *lock)
return old == 0;
 }
 
+static inline void unlock_tlbie_after_sync(unsigned int *lock)
+{
+   if (mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE))
+   return;
+}
+
 static void do_tlbies(struct kvm *kvm, unsigned long *rbvalues,
  long npages, int global, bool need_sync)
 {
@@ -483,7 +492,7 @@ static void do_tlbies(struct kvm *kvm, unsigned long 
*rbvalues,
}
 
asm volatile("eieio; tlbsync; ptesync" : : : "memory");
-   kvm->arch.tlbie_lock = 0;
+   unlock_tlbie_after_sync(>arch.tlbie_lock);
} else {
if (need_sync)
asm volatile("ptesync" : : : "memory");
-- 
2.16.3



[PATCH 1/2] KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode

2018-04-05 Thread Nicholas Piggin
This crashes with a "Bad real address for load" attempting to load
from the vmalloc region in realmode (faulting address is in DAR).

  Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
  LE SMP NR_CPUS=2048 NUMA PowerNV
  CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted 4.16.0-01530-g43d1859f0994
  NIP:  c00155ac LR: c00c2430 CTR: c0015580
  REGS: c00fff76dd80 TRAP: 0200   Not tainted  (4.16.0-01530-g43d1859f0994)
  MSR:  90201003   CR: 4808  XER: 
  CFAR: 000102900ef0 DAR: d00017fffd941a28 DSISR: 0040 SOFTE: 3
  NIP [c00155ac] perf_trace_tlbie+0x2c/0x1a0
  LR [c00c2430] do_tlbies+0x230/0x2f0

I suspect the reason is the per-cpu data is not in the linear chunk.
This could be restored if that was able to be fixed, but for now,
just remove the tracepoints.

Fixes: 0428491cba ("powerpc/mm: Trace tlbie(l) instructions")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index e1c083fbe434..78e6a392330f 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -470,8 +470,6 @@ static void do_tlbies(struct kvm *kvm, unsigned long 
*rbvalues,
for (i = 0; i < npages; ++i) {
asm volatile(PPC_TLBIE_5(%0,%1,0,0,0) : :
 "r" (rbvalues[i]), "r" (kvm->arch.lpid));
-   trace_tlbie(kvm->arch.lpid, 0, rbvalues[i],
-   kvm->arch.lpid, 0, 0, 0);
}
 
if (cpu_has_feature(CPU_FTR_P9_TLBIE_BUG)) {
@@ -492,8 +490,6 @@ static void do_tlbies(struct kvm *kvm, unsigned long 
*rbvalues,
for (i = 0; i < npages; ++i) {
asm volatile(PPC_TLBIEL(%0,%1,0,0,0) : :
 "r" (rbvalues[i]), "r" (0));
-   trace_tlbie(kvm->arch.lpid, 1, rbvalues[i],
-   0, 0, 0, 0);
}
asm volatile("ptesync" : : : "memory");
}
-- 
2.16.3



[PATCH 0/2] KVM powerpc tlbie scalability improvement

2018-04-05 Thread Nicholas Piggin
Any reason we still need to take the tlbie lock on modern
processors? 

Nicholas Piggin (2):
  KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode
  KVM: PPC: Book3S HV: lockless tlbie for HPT hcalls

 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

-- 
2.16.3



[PATCH 5/5] smp: Lazy synchronization for EQS CPUs in kick_all_cpus_sync()

2018-04-05 Thread Yury Norov
kick_all_cpus_sync() forces all CPUs to sync caches by sending broadcast
IPI.  If CPU is in extended quiescent state (idle task or nohz_full
userspace), this work may be done at the exit of this state. Delaying
synchronization helps to save power if CPU is in idle state and decrease
latency for real-time tasks.

This patch introduces rcu_get_eqs_cpus() and uses it in
kick_all_cpus_sync() to delay synchronization.

For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the CPU
running isolated task is fatal, as it breaks isolation. The approach with
lazy synchronization helps to maintain isolated state.

I've tested it with test from task isolation series on ThunderX2 for
more than 10 hours (10k giga-ticks) without breaking isolation.

Signed-off-by: Yury Norov 
---
 include/linux/rcutiny.h |  2 ++
 include/linux/rcutree.h |  1 +
 kernel/rcu/tiny.c   |  9 +
 kernel/rcu/tree.c   | 23 +++
 kernel/smp.c| 21 +
 5 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index ce9beec35e34..dc7e2ea731fa 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -36,6 +36,8 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 /* Never flag non-existent other CPUs! */
 static inline bool rcu_eqs_special_set(int cpu) { return false; }
 
+void rcu_get_eqs_cpus(struct cpumask *cpus, int choose_eqs);
+
 static inline unsigned long get_state_synchronize_rcu(void)
 {
return 0;
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index fd996cdf1833..7a34eb8c0df3 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -74,6 +74,7 @@ static inline void synchronize_rcu_bh_expedited(void)
 void rcu_barrier(void);
 void rcu_barrier_bh(void);
 void rcu_barrier_sched(void);
+void rcu_get_eqs_cpus(struct cpumask *cpus, int choose_eqs);
 unsigned long get_state_synchronize_rcu(void);
 void cond_synchronize_rcu(unsigned long oldstate);
 unsigned long get_state_synchronize_sched(void);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index a64eee0db39e..d4e94e1b0570 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -128,6 +128,15 @@ void rcu_check_callbacks(int user)
rcu_note_voluntary_context_switch(current);
 }
 
+/*
+ * For tiny RCU, all CPUs are active (non-EQS).
+ */
+void rcu_get_eqs_cpus(struct cpumask *cpus, int choose_eqs)
+{
+   if (!choose_eqs)
+   cpumask_copy(cpus, cpu_online_mask);
+}
+
 /*
  * Invoke the RCU callbacks on the specified rcu_ctrlkblk structure
  * whose grace period has elapsed.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 363f91776b66..cb0d3afe7ea8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -419,6 +419,29 @@ bool rcu_eqs_special_set(int cpu)
return true;
 }
 
+/*
+ * Get EQS CPUs. If @choose_eqs is 0, set of active (non-EQS)
+ * CPUs is returned instead.
+ *
+ * Call with disabled preemption. Make sure @cpus is cleared.
+ */
+void rcu_get_eqs_cpus(struct cpumask *cpus, int choose_eqs)
+{
+   int cpu, in_eqs;
+   struct rcu_dynticks *rdtp;
+
+   for_each_online_cpu(cpu) {
+   rdtp = _cpu(rcu_dynticks, cpu);
+   in_eqs = rcu_dynticks_in_eqs(atomic_read(>dynticks));
+
+   if (in_eqs && choose_eqs)
+   cpumask_set_cpu(cpu, cpus);
+
+   if (!in_eqs && !choose_eqs)
+   cpumask_set_cpu(cpu, cpus);
+   }
+}
+
 /*
  * Let the RCU core know that this CPU has gone through the scheduler,
  * which is a quiescent state.  This is called when the need for a
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..5e6cfb57da22 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -708,19 +708,24 @@ static void do_nothing(void *unused)
 /**
  * kick_all_cpus_sync - Force all cpus out of idle
  *
- * Used to synchronize the update of pm_idle function pointer. It's
- * called after the pointer is updated and returns after the dummy
- * callback function has been executed on all cpus. The execution of
- * the function can only happen on the remote cpus after they have
- * left the idle function which had been called via pm_idle function
- * pointer. So it's guaranteed that nothing uses the previous pointer
- * anymore.
+ * - on current CPU call smp_mb() explicitly;
+ * - on CPUs in extended quiescent state (idle or nohz_full userspace), memory
+ *   is synchronized at the exit of that mode, so do nothing (it's safe to 
delay
+ *   synchronization because EQS CPUs don't run kernel code);
+ * - on other CPUs fire IPI for synchronization, which implies barrier.
  */
 void kick_all_cpus_sync(void)
 {
+   struct cpumask active_cpus;
+
/* Make sure the change is visible before we kick the cpus */
smp_mb();
-   smp_call_function(do_nothing, NULL, 1);
+
+   cpumask_clear(_cpus);
+   preempt_disable();
+   

[PATCH 4/5] rcu: arm64: add rcu_dynticks_eqs_exit_sync()

2018-04-05 Thread Yury Norov
The following patch of the series enables delaying of kernel memory
synchronization for CPUs running in extended quiescent state (EQS)
till the exit of that state.

In previous patch ISB was added in EQS exit path to ensure that
any change made by kernel patching framework is visible. But after
that isb(), EQS is still enabled for a while, and there's a chance
that some other core will modify text in parallel, and EQS core
will be not notified about it, as EQS will mask IPI:

CPU0CPU1

ISB
patch_some_text()
kick_all_active_cpus_sync()
exit EQS

// not synchronized!
use_of_patched_text()

This patch introduces rcu_dynticks_eqs_exit_sync() function and uses
it in arm64 code to call ipi() after the exit from quiescent state.

Suggested-by: Mark Rutland 
Signed-off-by: Yury Norov 
---
 arch/arm64/kernel/Makefile | 2 ++
 arch/arm64/kernel/rcu.c| 8 
 kernel/rcu/tree.c  | 4 
 3 files changed, 14 insertions(+)
 create mode 100644 arch/arm64/kernel/rcu.c

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 9b55a3f24be7..c87a203524ab 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -54,6 +54,8 @@ arm64-obj-$(CONFIG_ARM64_RELOC_TEST)  += arm64-reloc-test.o
 arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o
 arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 arm64-obj-$(CONFIG_ARM_SDE_INTERFACE)  += sdei.o
+arm64-obj-$(CONFIG_TREE_RCU)   += rcu.o
+arm64-obj-$(CONFIG_PREEMPT_RCU)+= rcu.o
 
 arm64-obj-$(CONFIG_KVM_INDIRECT_VECTORS)+= bpi.o
 
diff --git a/arch/arm64/kernel/rcu.c b/arch/arm64/kernel/rcu.c
new file mode 100644
index ..67fe33c0ea03
--- /dev/null
+++ b/arch/arm64/kernel/rcu.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+
+void rcu_dynticks_eqs_exit_sync(void)
+{
+   isb();
+};
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2a734692a581..363f91776b66 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -264,6 +264,8 @@ void rcu_bh_qs(void)
 #define rcu_eqs_special_exit() do { } while (0)
 #endif
 
+void __weak rcu_dynticks_eqs_exit_sync(void) {};
+
 static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
.dynticks_nesting = 1,
.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
@@ -308,6 +310,8 @@ static void rcu_dynticks_eqs_exit(void)
 * critical section.
 */
seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, >dynticks);
+   rcu_dynticks_eqs_exit_sync();
+
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
 !(seq & RCU_DYNTICK_CTRL_CTR));
if (seq & RCU_DYNTICK_CTRL_MASK) {
-- 
2.14.1



[PATCH 3/5] arm64: early ISB at exit from extended quiescent state

2018-04-05 Thread Yury Norov
This series enables delaying of kernel memory synchronization
for CPUs running in extended quiescent state (EQS) till the exit
of that state.

ARM64 uses IPI mechanism to notify all cores in  SMP system that
kernel text is changed; and IPI handler calls isb() to synchronize.

If we don't deliver IPI to EQS CPUs anymore, we should add ISB early
in EQS exit path.

There are 2 such paths. One starts in do_idle() loop, and other
in el0_svc entry. For do_idle(), isb() is added in
arch_cpu_idle_exit() hook. And for SVC handler, isb is called in
el0_svc_naked.

Suggested-by: Will Deacon 
Signed-off-by: Yury Norov 
---
 arch/arm64/kernel/entry.S   | 16 +++-
 arch/arm64/kernel/process.c |  7 +++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index c8d9ec363ddd..b1e1c19b4432 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -48,7 +48,7 @@
.endm
 
.macro el0_svc_restore_syscall_args
-#if defined(CONFIG_CONTEXT_TRACKING)
+#if !defined(CONFIG_TINY_RCU) || defined(CONFIG_CONTEXT_TRACKING)
restore_syscall_args
 #endif
.endm
@@ -483,6 +483,19 @@ __bad_stack:
ASM_BUG()
.endm
 
+/*
+ * If CPU is in extended quiescent state we need isb to ensure that
+ * possible change of kernel text is visible by the core.
+ */
+   .macro  isb_if_eqs
+#ifndef CONFIG_TINY_RCU
+   bl  rcu_is_watching
+   cbnzx0, 1f
+   isb // pairs with 
aarch64_insn_patch_text
+1:
+#endif
+   .endm
+
 el0_sync_invalid:
inv_entry 0, BAD_SYNC
 ENDPROC(el0_sync_invalid)
@@ -949,6 +962,7 @@ alternative_else_nop_endif
 
 el0_svc_naked: // compat entry point
stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and 
syscall number
+   isb_if_eqs
enable_daif
ct_user_exit
el0_svc_restore_syscall_args
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index f08a2ed9db0d..74cad496b07b 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -88,6 +88,13 @@ void arch_cpu_idle(void)
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
+void arch_cpu_idle_exit(void)
+{
+   /* Pairs with aarch64_insn_patch_text() for EQS CPUs. */
+   if (!rcu_is_watching())
+   isb();
+}
+
 #ifdef CONFIG_HOTPLUG_CPU
 void arch_cpu_idle_dead(void)
 {
-- 
2.14.1



[PATCH 2/5] arm64: entry: introduce restore_syscall_args macro

2018-04-05 Thread Yury Norov
Syscall arguments are passed in registers x0..x7. If assembler
code has to call C functions before passing control to syscall
handler, it should restore original state of that registers
after the call.

Currently, syscall arguments restoring is opencoded in el0_svc_naked
and __sys_trace. This patch introduces restore_syscall_args macro to
use it there.

Also, parameter 'syscall = 0' is removed from ct_user_exit to make
el0_svc_naked call restore_syscall_args explicitly. This is needed
because the following patch of the series adds another call to C
function in el0_svc_naked, and restoring of syscall args becomes not
only a matter of ct_user_exit.

Signed-off-by: Yury Norov 
---
 arch/arm64/kernel/entry.S | 37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 9c06b4b80060..c8d9ec363ddd 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -37,22 +37,29 @@
 #include 
 
 /*
- * Context tracking subsystem.  Used to instrument transitions
- * between user and kernel mode.
+ * Save/restore needed during syscalls.  Restore syscall arguments from
+ * the values already saved on stack during kernel_entry.
  */
-   .macro ct_user_exit, syscall = 0
-#ifdef CONFIG_CONTEXT_TRACKING
-   bl  context_tracking_user_exit
-   .if \syscall == 1
-   /*
-* Save/restore needed during syscalls.  Restore syscall arguments from
-* the values already saved on stack during kernel_entry.
-*/
+   .macro restore_syscall_args
ldp x0, x1, [sp]
ldp x2, x3, [sp, #S_X2]
ldp x4, x5, [sp, #S_X4]
ldp x6, x7, [sp, #S_X6]
-   .endif
+   .endm
+
+   .macro el0_svc_restore_syscall_args
+#if defined(CONFIG_CONTEXT_TRACKING)
+   restore_syscall_args
+#endif
+   .endm
+
+/*
+ * Context tracking subsystem.  Used to instrument transitions
+ * between user and kernel mode.
+ */
+   .macro ct_user_exit
+#ifdef CONFIG_CONTEXT_TRACKING
+   bl  context_tracking_user_exit
 #endif
.endm
 
@@ -943,7 +950,8 @@ alternative_else_nop_endif
 el0_svc_naked: // compat entry point
stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and 
syscall number
enable_daif
-   ct_user_exit 1
+   ct_user_exit
+   el0_svc_restore_syscall_args
 
tst x16, #_TIF_SYSCALL_WORK // check for syscall hooks
b.ne__sys_trace
@@ -976,10 +984,7 @@ __sys_trace:
mov x1, sp  // pointer to regs
cmp wscno, wsc_nr   // check upper syscall limit
b.hs__ni_sys_trace
-   ldp x0, x1, [sp]// restore the syscall args
-   ldp x2, x3, [sp, #S_X2]
-   ldp x4, x5, [sp, #S_X4]
-   ldp x6, x7, [sp, #S_X6]
+   restore_syscall_args
ldr x16, [stbl, xscno, lsl #3]  // address in the syscall table
blr x16 // call sys_* routine
 
-- 
2.14.1



[PATCH 1/5] arm64: entry: isb in el1_irq

2018-04-05 Thread Yury Norov
Kernel text patching framework relies on IPI to ensure that other
SMP cores observe the change. Target core calls isb() in IPI handler
path, but not at the beginning of el1_irq entry. There's a chance
that modified instruction will appear prior isb(), and so will not be
observed.

This patch inserts isb early at el1_irq entry to avoid that chance.

Signed-off-by: Yury Norov 
---
 arch/arm64/kernel/entry.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ec2ee720e33e..9c06b4b80060 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -593,6 +593,7 @@ ENDPROC(el1_sync)
 
.align  6
 el1_irq:
+   isb // pairs with 
aarch64_insn_patch_text
kernel_entry 1
enable_da_f
 #ifdef CONFIG_TRACE_IRQFLAGS
-- 
2.14.1



[PATCH v2 0/2] smp: don't kick CPUs running idle or nohz_full tasks

2018-04-05 Thread Yury Norov
kick_all_cpus_sync() is used to broadcast IPIs to all online CPUs to force
them synchronize caches, TLB etc. It is called only 3 times - from mm/slab
arm64 and powerpc code.

We can delay synchronization work for CPUs in extended quiescent state
(idle or nohz_full userspace). 

As Paul E. McKenney wrote: 

--

Currently, IPIs are used to force other CPUs to invalidate their TLBs
in response to a kernel virtual-memory mapping change.  This works, but 
degrades both battery lifetime (for idle CPUs) and real-time response
(for nohz_full CPUs), and in addition results in unnecessary IPIs due to
the fact that CPUs executing in usermode are unaffected by stale kernel
mappings.  It would be better to cause a CPU executing in usermode to
wait until it is entering kernel mode to do the flush, first to avoid
interrupting usemode tasks and second to handle multiple flush requests
with a single flush in the case of a long-running user task.

--

v2 is big rework to address comments in v1:
 - rcu_eqs_special() declaration in public header is dropped, it is not
   used in new implementation. Though, I hope Paul will pick it in his
   tree;
 - for arm64, few isb() added to ensure kernel text synchronization
   (patches 1-4);
 - rcu_get_eqs_cpus() introduced and used to mask EQS CPUs before 
   generating broadcast IPIs;
 - RCU_DYNTICK_CTRL_MASK is not touched because memory barrier is
   implicitly issued in EQS exit path;
 - powerpc is not an exception anymore. I think it's safe to delay
   synchronization for it as well, and I didn't get comments from ppc
   community.
v1:
  https://lkml.org/lkml/2018/3/25/109

Based on next-20180405

Yury Norov (5):
  arm64: entry: isb in el1_irq
  arm64: entry: introduce restore_syscall_args macro
  arm64: ISB early at exit from extended quiescent state
  rcu: arm64: add rcu_dynticks_eqs_exit_sync()
  smp: Lazy synchronization for EQS CPUs in kick_all_cpus_sync()

 arch/arm64/kernel/Makefile  |  2 ++
 arch/arm64/kernel/entry.S   | 52 +++--
 arch/arm64/kernel/process.c |  7 ++
 arch/arm64/kernel/rcu.c |  8 +++
 include/linux/rcutiny.h |  2 ++
 include/linux/rcutree.h |  1 +
 kernel/rcu/tiny.c   |  9 
 kernel/rcu/tree.c   | 27 +++
 kernel/smp.c| 21 +++---
 9 files changed, 105 insertions(+), 24 deletions(-)
 create mode 100644 arch/arm64/kernel/rcu.c

-- 
2.14.1



Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-04-05 Thread Laurent Dufour
On 04/04/2018 23:53, David Rientjes wrote:
> On Wed, 4 Apr 2018, Laurent Dufour wrote:
> 
>>> I also think the following is needed:
>>>
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
>>> vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
>>> VM_STACK_INCOMPLETE_SETUP;
>>> vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
>>> INIT_LIST_HEAD(>anon_vma_chain);
>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>> +   seqcount_init(>vm_sequence);
>>> +   atomic_set(>vm_ref_count, 0);
>>> +#endif
>>>
>>> err = insert_vm_struct(mm, vma);
>>> if (err)
>>
>> No, this not needed because the vma is allocated with kmem_cache_zalloc() so
>> vm_ref_count is 0, and insert_vm_struc() will later call
>> __vma_link_rb() which will call seqcount_init().
>>
>> Furhtermore, in case of error, the vma structure is freed without calling
>> get_vma() so there is risk of lockdep warning.
>>
> 
> Perhaps you're working from a different tree than I am, or you fixed the 
> lockdep warning differently when adding to dup_mmap() and mmap_region().
> 
> I got the following two lockdep errors.
> 
> I fixed it locally by doing the seqcount_init() and atomic_set() 
> everywhere a vma could be initialized.

That's weird, I don't get that on my side with lockdep activated.

There is a call to seqcount_init() in dup_mmap(), in mmap_region() and
__vma_link_rb() and that's enough to cover all the case.

That's being said, it'll be better call seqcount_init each time as soon as a
vma structure is allocated. For the vm_ref_count value, as most of the time the
vma is zero allocated, I don't think this is needed.
I just have to check when new_vma = *old_vma is done, but this often just
follow a vma allocation.
> 
> INFO: trying to register non-static key.
> the code is fine but needs lockdep annotation.
> turning off the locking correctness validator.
> CPU: 12 PID: 1 Comm: init Not tainted
> Call Trace:
>  [] dump_stack+0x67/0x98
>  [] register_lock_class+0x1e6/0x4e0
>  [] __lock_acquire+0xb9/0x1710
>  [] lock_acquire+0xba/0x200
>  [] mprotect_fixup+0x10f/0x310
>  [] setup_arg_pages+0x12d/0x230
>  [] load_elf_binary+0x44a/0x1740
>  [] search_binary_handler+0x9b/0x1e0
>  [] load_script+0x206/0x270
>  [] search_binary_handler+0x9b/0x1e0
>  [] do_execveat_common.isra.32+0x6b5/0x9d0
>  [] do_execve+0x2c/0x30
>  [] run_init_process+0x2b/0x30
>  [] kernel_init+0x54/0x110
>  [] ret_from_fork+0x3a/0x50
> 
> and
> 
> INFO: trying to register non-static key.
> the code is fine but needs lockdep annotation.
> turning off the locking correctness validator.
> CPU: 21 PID: 1926 Comm: mkdir Not tainted
> Call Trace:
>  [] dump_stack+0x67/0x98
>  [] register_lock_class+0x1e6/0x4e0
>  [] __lock_acquire+0xb9/0x1710
>  [] lock_acquire+0xba/0x200
>  [] unmap_page_range+0x89/0xaa0
>  [] unmap_single_vma+0x8f/0x100
>  [] unmap_vmas+0x4b/0x90
>  [] exit_mmap+0xa3/0x1c0
>  [] mmput+0x73/0x120
>  [] do_exit+0x2bd/0xd60
>  [] SyS_exit+0x17/0x20
>  [] do_syscall_64+0x6d/0x1a0
>  [] entry_SYSCALL_64_after_hwframe+0x26/0x9b
> 
> I think it would just be better to generalize vma allocation to initialize 
> certain fields and init both spf fields properly for 
> CONFIG_SPECULATIVE_PAGE_FAULT.  It's obviously too delicate as is.
> 



Re: [RFC 1/2] powerpc/swiotlb: Dont free up allocated SWIOTLB slab on POWER

2018-04-05 Thread Ram Pai
On Wed, Apr 04, 2018 at 10:48:31PM +1000, Michael Ellerman wrote:
> Anshuman Khandual  writes:
> > Even though SWIOTLB slab gets allocated and initialized on powerpc with
> > swiotlb_init() called during mem_init(), it gets released away again on
> > POWER platform because 'ppc_swiotlb_enable' never gets set. The function
> > swiotlb_detect_4g() checks for 4GB memory and then sets the variable
> > 'ppc_swiotlb_enable' which prevents freeing up the SWIOTLB slab. Lets
> > make POWER platform call swiotlb_detect_4g() during setup_arch() which
> > will keep the SWIOTLB slab through out the runtime.
> >
> > A previous commit cf5621032f ("powerpc/64: Limit ZONE_DMA32 to 4GiB in
> > swiotlb_detect_4g()") enforced 4GB limit on ZONE_DMA32 which is is not
> > applicable on POWER (CONFIG_PPC_BOOK3S_64) platform. Lets remove this
> > unnecessary restriction.
> 
> You're using "POWER" to mean Book3S 64-bit, but "POWER" is something
> else (the ISA for POWER1/POWER2).
> 
> So please just say "Book3S 64-bit" or talk about a specific CPU, eg.
> Power9.
> 
> > After the patch, SWIOTLB slab does not get released.
> >
> > [0.410992] software IO TLB [mem 0xfbff-0x] (64MB) mapped
> > at [767f6cb3-4a10114f]
> 
> But we don't want SWIOTLB on any existing Book3S 64-bit platforms, so
> leaving it enabled is just wasting memory.
> 
> > diff --git a/arch/powerpc/kernel/setup-common.c 
> > b/arch/powerpc/kernel/setup-common.c
> > index d73ec518ef80..c4db844e0b0d 100644
> > --- a/arch/powerpc/kernel/setup-common.c
> > +++ b/arch/powerpc/kernel/setup-common.c
> > @@ -944,6 +944,7 @@ void __init setup_arch(char **cmdline_p)
> > /* Initialize the MMU context management stuff. */
> > mmu_context_init();
> >  
> > +   swiotlb_detect_4g();
> 
> You shouldn't be calling this, your use case has nothing to do with 4GB.
> 
> Instead when you detect that you're running under the ultravisor then
> you should set ppc_swiotlb_enable = 1.

Please assume that you will have a interface to detect if you are
running in a protected-environment(AKA secure-environment).

So maybe you can rename swiotlb_detect_4g() to swiotlb_detect() 
and modify that function to enable or disable ppc_swiotlb_enable
depending on the availability of 4g or availability of 
protected-environment?

RP



[PATCH v3] powerpc/64: Fix section mismatch warnings for early boot symbols

2018-04-05 Thread Mauricio Faria de Oliveira
Some of the boot code located at the start of kernel text is "init"
class, in that it only runs at boot time, however marking it as normal
init code is problematic because that puts it into a different section
located at the very end of kernel text.

e.g., in case the TOC is not set up, we may not be able to tolerate a
branch trampoline to reach the init function.

Credits: code and message are based on 2016 patch by Nicholas Piggin,
and slightly modified so not to rename the powerpc code/symbol names.

Subject: [PATCH] powerpc/64: quieten section mismatch warnings
From: Nicholas Piggin 
Date: Fri Dec 23 00:14:19 AEDT 2016

This resolves the following section mismatch warnings:

WARNING: vmlinux.o(.text+0x2fa8): Section mismatch in reference from the 
variable __boot_from_prom to the function .init.text:prom_init()
The function __boot_from_prom() references
the function __init prom_init().
This is often because __boot_from_prom lacks a __init
annotation or the annotation of prom_init is wrong.

WARNING: vmlinux.o(.text+0x3238): Section mismatch in reference from the 
variable start_here_multiplatform to the function .init.text:early_setup()
The function start_here_multiplatform() references
the function __init early_setup().
This is often because start_here_multiplatform lacks a __init
annotation or the annotation of early_setup is wrong.

WARNING: vmlinux.o(.text+0x326c): Section mismatch in reference from the 
variable start_here_common to the function .init.text:start_kernel()
The function start_here_common() references
the function __init start_kernel().
This is often because start_here_common lacks a __init
annotation or the annotation of start_kernel is wrong.

Signed-off-by: Mauricio Faria de Oliveira 
---
v3: reword some comments and include errors in commit message
v2: fix build error due to missing parenthesis
 scripts/mod/modpost.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 4ff08a0..d10c9d8 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -1173,8 +1173,15 @@ static const struct sectioncheck *section_mismatch(
  *   fromsec = text section
  *   refsymname = *.constprop.*
  *
+ * Pattern 6:
+ *   powerpc64 has boot functions that reference init, but must remain in text.
+ *   This pattern is identified by
+ *   tosec   = init section
+ *   fromsym = __boot_from_prom, start_here_common, start_here_multiplatform
+ *
  **/
-static int secref_whitelist(const struct sectioncheck *mismatch,
+static int secref_whitelist(const struct elf_info *elf,
+   const struct sectioncheck *mismatch,
const char *fromsec, const char *fromsym,
const char *tosec, const char *tosym)
 {
@@ -1211,6 +1218,17 @@ static int secref_whitelist(const struct sectioncheck 
*mismatch,
match(fromsym, optim_symbols))
return 0;
 
+   /* Check for pattern 6 */
+   if (elf->hdr->e_machine == EM_PPC64)
+   if (match(tosec, init_sections) &&
+   (!strncmp(fromsym, "__boot_from_prom",
+   strlen("__boot_from_prom")) ||
+!strncmp(fromsym, "start_here_common",
+   strlen("start_here_common")) ||
+!strncmp(fromsym, "start_here_multiplatform",
+   strlen("start_here_multiplatform"
+   return 0;
+
return 1;
 }
 
@@ -1551,7 +1569,7 @@ static void default_mismatch_handler(const char *modname, 
struct elf_info *elf,
tosym = sym_name(elf, to);
 
/* check whitelist - we may ignore it */
-   if (secref_whitelist(mismatch,
+   if (secref_whitelist(elf, mismatch,
 fromsec, fromsym, tosec, tosym)) {
report_sec_mismatch(modname, mismatch,
fromsec, r->r_offset, fromsym,
-- 
1.8.3.1



Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Benjamin Herrenschmidt
On Thu, 2018-04-05 at 17:54 +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 08:09:30PM +0530, Anshuman Khandual wrote:
> > On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> > > There are certian platforms which would like to use SWIOTLB based DMA API
> > > for bouncing purpose without actually requiring an IOMMU back end. But the
> > > virtio core does not allow such mechanism. Right now DMA MAP API is only
> > > selected for devices which have an IOMMU and then the QEMU/host back end
> > > will process all incoming SG buffer addresses as IOVA instead of simple
> > > GPA which is the case for simple bounce buffers after being processed with
> > > SWIOTLB API. To enable this usage, it introduces an architecture specific
> > > function which will just make virtio core front end select DMA operations
> > > structure.
> > > 
> > > Signed-off-by: Anshuman Khandual 
> > 
> > + "Michael S. Tsirkin" 
> 
> I'm confused by this.
> 
> static bool vring_use_dma_api(struct virtio_device *vdev)
> {
> if (!virtio_has_iommu_quirk(vdev))
> return true;
> 
> 
> Why doesn't setting VIRTIO_F_IOMMU_PLATFORM on the
> hypervisor side sufficient?

In this specific case, because that would make qemu expect an iommu,
and there isn't one.

Anshuman, you need to provide more background here. I don't have time
right now it's late, but explain about the fact that this is for a
specific type of secure VM which has only a limited pool of (insecure)
memory that can be shared with qemu, so all IOs need to bounce via that
pool, which can be achieved by using swiotlb.

Note: this isn't urgent, we can discuss alternative approaches, this is
just to start the conversation.

Cheers,
Ben.



Re: [RESEND 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Dan Williams
On Wed, Apr 4, 2018 at 11:45 PM, Nicholas Piggin  wrote:
> On Thu, 5 Apr 2018 15:53:07 +1000
> Balbir Singh  wrote:
>
>> On Thu, 5 Apr 2018 15:04:05 +1000
>> Nicholas Piggin  wrote:
>>
>> > On Wed, 4 Apr 2018 20:00:52 -0700
>> > Dan Williams  wrote:
>> >
>> > > [ adding Matthew, Christoph, and Tony  ]
>> > >
>> > > On Wed, Apr 4, 2018 at 4:57 PM, Nicholas Piggin  
>> > > wrote:
>> > > > On Thu,  5 Apr 2018 09:19:42 +1000
>> > > > Balbir Singh  wrote:
>> > > >
>> > > >> The pmem infrastructure uses memcpy_mcsafe in the pmem
>> > > >> layer so as to convert machine check excpetions into
>> > > >> a return value on failure in case a machine check
>> > > >> exception is encoutered during the memcpy.
>> > > >>
>> > > >> This patch largely borrows from the copyuser_power7
>> > > >> logic and does not add the VMX optimizations, largely
>> > > >> to keep the patch simple. If needed those optimizations
>> > > >> can be folded in.
>> > > >
>> > > > So memcpy_mcsafe doesn't return number of bytes copied?
>> > > > Huh, well that makes it simple.
>> > >
>> > > Well, not in current kernels, but we need to add that support or
>> > > remove the direct call to copy_to_iter() in fs/dax.c. I'm looking
>> > > right now to add "bytes remaining" support to the x86 memcpy_mcsafe(),
>> > > but for copy_to_user we also need to handle bytes remaining for write
>> > > faults. That fix is hopefully something that can land in an early
>> > > 4.17-rc, but it won't be ready for -rc1.
>> >
>> > I wonder if the powerpc implementation should just go straight to
>> > counting bytes. Backporting to this interface would be trivial, but
>> > it would just mean there's only one variant of the code to support.
>> > That's up to Balbir though.
>> >
>>
>> I'm thinking about it, I wonder what "bytes remaining" mean in pmem context
>> in the context of a machine check exception. Also, do we want to be byte
>> accurate or cache-line accurate for the bytes remaining? The former is much
>> easier than the latter :)
>
> The ideal would be a linear measure of how much of your copy reached
> (or can reach) non-volatile storage with nothing further copied. You
> may have to allow for some relaxing of the semantics depending on
> what the architecture can support.
>
> What's the problem with just counting bytes copied like usercopy --
> why is that harder than cacheline accuracy?
>
>> I'd rather implement the existing interface and port/support the new 
>> interface
>> as it becomes available
>
> Fair enough.

I have patches already in progress to change the interface. My
preference is to hold off on adding a new implementation that will
need to be immediately reworked. When I say "immediate" I mean that
should be able to post what I have for review within the next few
days.

Whether this is all too late for 4.17 is another question...


Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Michael S. Tsirkin
On Thu, Apr 05, 2018 at 08:09:30PM +0530, Anshuman Khandual wrote:
> On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> > There are certian platforms which would like to use SWIOTLB based DMA API
> > for bouncing purpose without actually requiring an IOMMU back end. But the
> > virtio core does not allow such mechanism. Right now DMA MAP API is only
> > selected for devices which have an IOMMU and then the QEMU/host back end
> > will process all incoming SG buffer addresses as IOVA instead of simple
> > GPA which is the case for simple bounce buffers after being processed with
> > SWIOTLB API. To enable this usage, it introduces an architecture specific
> > function which will just make virtio core front end select DMA operations
> > structure.
> > 
> > Signed-off-by: Anshuman Khandual 
> 
> + "Michael S. Tsirkin" 

I'm confused by this.

static bool vring_use_dma_api(struct virtio_device *vdev)
{
if (!virtio_has_iommu_quirk(vdev))
return true;


Why doesn't setting VIRTIO_F_IOMMU_PLATFORM on the
hypervisor side sufficient?




[PATCH] powerpc/powernv: do not process OPAL events from hard interrupt context

2018-04-05 Thread Nicholas Piggin
Using irq_work for processing batches of OPAL events can cause latency
spikes. irq_work is typically used just to schedule work from NMI
context or for work that specifically needs to execute in interrupt
context. If we are to remain with this approach to OPAL events,
softirqs would be a better fit.

But they are not required either. OPAL events are not particularly
performance or latency critical, and we already have kopald to poll
and run events, so have kopald run them all. Rather than scheduling
them as irq_work, just run them directly from kopald. Enable and
disable interrupts between processing each event.

Event handlers themselves should still use threaded handlers,
workqueues, etc. as necessary to avoid high interrupts-off latencies
within any single interrupt.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/opal-irqchip.c | 85 ---
 arch/powerpc/platforms/powernv/opal.c | 23 
 arch/powerpc/platforms/powernv/powernv.h  |  3 +-
 3 files changed, 51 insertions(+), 60 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c 
b/arch/powerpc/platforms/powernv/opal-irqchip.c
index 9d1b8c0aaf93..646bfac8e3f5 100644
--- a/arch/powerpc/platforms/powernv/opal-irqchip.c
+++ b/arch/powerpc/platforms/powernv/opal-irqchip.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -38,37 +37,47 @@ struct opal_event_irqchip {
unsigned long mask;
 };
 static struct opal_event_irqchip opal_event_irqchip;
-
+static u64 last_outstanding_events;
 static unsigned int opal_irq_count;
 static unsigned int *opal_irqs;
 
-static void opal_handle_irq_work(struct irq_work *work);
-static u64 last_outstanding_events;
-static struct irq_work opal_event_irq_work = {
-   .func = opal_handle_irq_work,
-};
-
-void opal_handle_events(uint64_t events)
+void opal_handle_events(void)
 {
-   int virq, hwirq = 0;
-   u64 mask = opal_event_irqchip.mask;
+   __be64 events = 0;
+   u64 e;
+
+   e = last_outstanding_events & opal_event_irqchip.mask;
+again:
+   while (e) {
+   int virq, hwirq;
+
+   hwirq = fls64(e) - 1;
+   e &= ~BIT_ULL(hwirq);
+
+   local_irq_disable();
+   virq = irq_find_mapping(opal_event_irqchip.domain, hwirq);
+   if (virq) {
+   irq_enter();
+   generic_handle_irq(virq);
+   irq_exit();
+   }
+   local_irq_enable();
 
-   if (!in_irq() && (events & mask)) {
-   last_outstanding_events = events;
-   irq_work_queue(_event_irq_work);
-   return;
+   cond_resched();
}
+   last_outstanding_events = 0;
+   if (opal_poll_events() != OPAL_SUCCESS)
+   return;
+   e = be64_to_cpu(events) & opal_event_irqchip.mask;
+   if (e)
+   goto again;
+}
 
-   while (events & mask) {
-   hwirq = fls64(events) - 1;
-   if (BIT_ULL(hwirq) & mask) {
-   virq = irq_find_mapping(opal_event_irqchip.domain,
-   hwirq);
-   if (virq)
-   generic_handle_irq(virq);
-   }
-   events &= ~BIT_ULL(hwirq);
-   }
+bool opal_recheck_events(void)
+{
+   if (last_outstanding_events & opal_event_irqchip.mask)
+   return true;
+   return false;
 }
 
 static void opal_event_mask(struct irq_data *d)
@@ -78,24 +87,9 @@ static void opal_event_mask(struct irq_data *d)
 
 static void opal_event_unmask(struct irq_data *d)
 {
-   __be64 events;
-
set_bit(d->hwirq, _event_irqchip.mask);
-
-   opal_poll_events();
-   last_outstanding_events = be64_to_cpu(events);
-
-   /*
-* We can't just handle the events now with opal_handle_events().
-* If we did we would deadlock when opal_event_unmask() is called from
-* handle_level_irq() with the irq descriptor lock held, because
-* calling opal_handle_events() would call generic_handle_irq() and
-* then handle_level_irq() which would try to take the descriptor lock
-* again. Instead queue the events for later.
-*/
if (last_outstanding_events & opal_event_irqchip.mask)
-   /* Need to retrigger the interrupt */
-   irq_work_queue(_event_irq_work);
+   opal_wake_poller();
 }
 
 static int opal_event_set_type(struct irq_data *d, unsigned int flow_type)
@@ -136,16 +130,13 @@ static irqreturn_t opal_interrupt(int irq, void *data)
__be64 events;
 
opal_handle_interrupt(virq_to_hw(irq), );
-   opal_handle_events(be64_to_cpu(events));
+   last_outstanding_events = be64_to_cpu(events);
+   if (last_outstanding_events & opal_event_irqchip.mask)
+   opal_wake_poller();
 

Re: [RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-05 Thread Dan Williams
On Thu, Apr 5, 2018 at 5:43 AM, Oliver  wrote:
> On Thu, Apr 5, 2018 at 10:11 PM, Michael Ellerman  wrote:
>> Oliver  writes:
>> ...
>>>
>>> For context Balbir is working with me on some of the pmem stuff. You
>>> probably want an Ack from Rob rather than one of us.
>>
>> I'll ack it if you make all the niggly nit picky trivial annoying
>> changes I asked for :D
>
> *groan*
>
> Fine, I'll respin it tomorrow. If anyone else has comments now would
> be the time to make them.

Please also include my niggly nit picky trivial annoying bike shed
color for the driver name to *not* use the "nd_region" suffix for a
driver registering "nvdimm_bus" objects. "of_pmem_range" or
"of_pmem_bus" or almost anything else would be fine.


Re: powerpc/64s/idle: POWER9 restore AMOR after deep sleep

2018-04-05 Thread Michael Ellerman
On Thu, 2018-04-05 at 06:10:00 UTC, Nicholas Piggin wrote:
> POWER8 restores AMOR when waking from deep sleep, but POWER9 does not,
> because it does not go through the subcore restore.
> 
> Have POWER9 restore it in core restore.
> 
> Cc: Vaidyanathan Srinivasan 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c1b25a17d24925b0961c319cfc3fd7

cheers


Re: [1/2] powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit

2018-04-05 Thread Michael Ellerman
On Thu, 2018-04-05 at 05:57:54 UTC, Nicholas Piggin wrote:
> The pkey code added a CPU_FTR_PKEY bit, but did not add it to the
> dt_cpu_ftrs feature set. Although capability is supported by all
> processors in the base dt_cpu_ftrs set for 64s, it's a significant
> and sufficiently well defined feature to make it optional. So add
> it as a quirk for now, which can be versioned out then controlled
> by the firmware (once dt_cpu_ftrs gains versioning support).
> 
> Fixes: cf43d3b264 ("powerpc: Enable pkey subsystem ")
> Cc: Ram Pai 
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c130153e453cba0f37ad10fa18a1aa

cheers


Re: powerpc/64s: Fix dt_cpu_ftrs to have restore_cpu clear unwanted LPCR bits

2018-04-05 Thread Michael Ellerman
On Thu, 2018-04-05 at 05:50:49 UTC, Nicholas Piggin wrote:
> Presently the dt_cpu_ftrs restore_cpu will only add bits to the LPCR
> for secondaries, but some bits must be removed (e.g., UPRT for HPT).
> Not clearing these bits on secondaries causes checkstops when booting
> with disable_radix.
> 
> restore_cpu can not just set LPCR, because it is also called by the
> idle wakeup code which relies on opal_slw_set_reg to restore the value
> of LPCR, at least on P8 which does not save LPCR to stack in the idle
> code.
> 
> Fix this by including a mask of bits to clear from LPCR as well, which
> is used by restore_cpu.
> 
> This is a little messy now, but it's a minimal fix that can be
> backported.  Longer term, the idle SPR save/restore code can be
> reworked to completely avoid calls to restore_cpu, then restore_cpu
> would be able to unconditionally set LPCR to match boot processor
> environment.
> 
> Fixes: 5a61ef74f269f ("powerpc/64s: Support new device tree binding for 
> discovering CPU features")
> Cc: sta...@vger.kernel.org # v4.12+
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a57ac411832384eb93df4bfed2bf64

cheers


Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Anshuman Khandual
On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> There are certian platforms which would like to use SWIOTLB based DMA API
> for bouncing purpose without actually requiring an IOMMU back end. But the
> virtio core does not allow such mechanism. Right now DMA MAP API is only
> selected for devices which have an IOMMU and then the QEMU/host back end
> will process all incoming SG buffer addresses as IOVA instead of simple
> GPA which is the case for simple bounce buffers after being processed with
> SWIOTLB API. To enable this usage, it introduces an architecture specific
> function which will just make virtio core front end select DMA operations
> structure.
> 
> Signed-off-by: Anshuman Khandual 

+ "Michael S. Tsirkin" 



[PATCH] powerpc/64: irq_work avoid immediate interrupt when raised with hard irqs enabled

2018-04-05 Thread Nicholas Piggin
irq_work_raise should not schedule the hardware decrementer interrupt
unless it is called from NMI context. Doing so often just results in an
immediate masked decrementer interrupt:

   <...>-55090d...4us : update_curr_rt <-dequeue_task_rt
   <...>-55090d...5us : dbs_update_util_handler <-update_curr_rt
   <...>-55090d...6us : arch_irq_work_raise <-irq_work_queue
   <...>-55090d...7us : soft_nmi_interrupt <-soft_nmi_common
   <...>-55090d...7us : printk_nmi_enter <-soft_nmi_interrupt
   <...>-55090d.Z.8us : rcu_nmi_enter <-soft_nmi_interrupt
   <...>-55090d.Z.9us : rcu_nmi_exit <-soft_nmi_interrupt
   <...>-55090d...9us : printk_nmi_exit <-soft_nmi_interrupt
   <...>-55090d...   10us : cpuacct_charge <-update_curr_rt

Set the decrementer pending in the irq_happened mask directly, rather
than having the masked decrementer handler do it.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/time.c | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index a32823dcd9a4..9d1cc183c974 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -510,6 +510,35 @@ static inline void clear_irq_work_pending(void)
"i" (offsetof(struct paca_struct, irq_work_pending)));
 }
 
+void arch_irq_work_raise(void)
+{
+   WARN_ON(!irqs_disabled());
+
+   preempt_disable();
+   set_irq_work_pending_flag();
+   /*
+* Regular iterrupts will check pending irq_happened as they return,
+* or process context when it next enables interrupts, so the
+* decrementer can be scheduled there.
+*
+* NMI interrupts do not, so setting the decrementer hardware
+* interrupt to fire ensures the work runs upon RI (if it's to a
+* MSR[EE]=1 context). We do not want to do this in other contexts
+* because if interrupts are hard enabled, the decrementer will
+* fire immediately here and just go to the masked handler to be
+* recorded in irq_happened.
+*
+* BookE does not support this yet, it must audit all NMI
+* interrupt handlers call nmi_enter().
+*/
+   if (IS_ENABLED(CONFIG_BOOKE) || in_nmi()) {
+   set_dec(1);
+   } else {
+   local_paca->irq_happened |= PACA_IRQ_DEC;
+   }
+   preempt_enable();
+}
+
 #else /* 32-bit */
 
 DEFINE_PER_CPU(u8, irq_work_pending);
@@ -518,16 +547,18 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 #define test_irq_work_pending()
__this_cpu_read(irq_work_pending)
 #define clear_irq_work_pending()   __this_cpu_write(irq_work_pending, 0)
 
-#endif /* 32 vs 64 bit */
-
 void arch_irq_work_raise(void)
 {
+   WARN_ON(!irqs_disabled());
+
preempt_disable();
set_irq_work_pending_flag();
set_dec(1);
preempt_enable();
 }
 
+#endif /* 32 vs 64 bit */
+
 #else  /* CONFIG_IRQ_WORK */
 
 #define test_irq_work_pending()0
-- 
2.16.3



Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-04-05 Thread Laurent Dufour
On 04/04/2018 23:59, Jerome Glisse wrote:
> On Wed, Apr 04, 2018 at 06:26:44PM +0200, Laurent Dufour wrote:
>>
>>
>> On 03/04/2018 21:39, Jerome Glisse wrote:
>>> On Tue, Mar 13, 2018 at 06:59:45PM +0100, Laurent Dufour wrote:
 When dealing with the speculative fault path we should use the VMA's field
 cached value stored in the vm_fault structure.

 Currently vm_normal_page() is using the pointer to the VMA to fetch the
 vm_flags value. This patch provides a new __vm_normal_page() which is
 receiving the vm_flags flags value as parameter.

 Note: The speculative path is turned on for architecture providing support
 for special PTE flag. So only the first block of vm_normal_page is used
 during the speculative path.
>>>
>>> Might be a good idea to explicitly have SPECULATIVE Kconfig option depends
>>> on ARCH_PTE_SPECIAL and a comment for !HAVE_PTE_SPECIAL in the function
>>> explaining that speculative page fault should never reach that point.
>>
>> Unfortunately there is no ARCH_PTE_SPECIAL in the config file, it is defined 
>> in
>> the per architecture header files.
>> So I can't do anything in the Kconfig file
> 
> Maybe adding a new Kconfig symbol for ARCH_PTE_SPECIAL very much like
> others ARCH_HAS_
> 
>>
>> However, I can check that at build time, and doing such a check in
>> __vm_normal_page sounds to be a good place, like that:
>>
>> @@ -869,6 +870,14 @@ struct page *__vm_normal_page(struct vm_area_struct 
>> *vma,
>> unsigned long addr,
>>
>> /* !HAVE_PTE_SPECIAL case follows: */
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +   /* This part should never get called when the speculative page fault
>> +* handler is turned on. This is mainly because we can't rely on
>> +* vm_start.
>> +*/
>> +#error CONFIG_SPECULATIVE_PAGE_FAULT requires HAVE_PTE_SPECIAL
>> +#endif
>> +
>> if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>> if (vma_flags & VM_MIXEDMAP) {
>> if (!pfn_valid(pfn))
>>
> 
> I am not a fan of #if/#else/#endif in code. But that's a taste thing.
> I honnestly think that adding a Kconfig for special pte is the cleanest
> solution.

I do agree, but this should be done in a separate series.

I'll see how this could be done but there are some arch (like powerpc) where
this is a bit obfuscated for unknown reason.

For the time being, I'll remove the check and just let the comment in place.



[PATCH] powerpc/64s: Fix section mismatch warnings from setup_rfi_flush()

2018-04-05 Thread Michael Ellerman
The recent LPM changes to setup_rfi_flush() are causing some section
mismatch warnings because we removed the __init annotation on
setup_rfi_flush():

  The function setup_rfi_flush() references
  the function __init ppc64_bolted_size().
  the function __init memblock_alloc_base().

The references are actually in init_fallback_flush(), but that is
inlined into setup_rfi_flush().

These references are safe because:
 - only pseries calls setup_rfi_flush() at runtime
 - pseries always passes L1D_FLUSH_FALLBACK at boot
 - so the fallback flush area will always be allocated
 - so the check in init_fallback_flush() will always return early:
   /* Only allocate the fallback flush area once (at boot time). */
   if (l1d_flush_fallback_area)
return;

 - and therefore we won't actually call the freed init routines.

We should rework the code to make it safer by default rather than
relying on the above, but for now as a quick-fix just add a __ref
annotation to squash the warning.

Fixes: abf110f3e1ce ("powerpc/rfi-flush: Make it possible to call 
setup_rfi_flush() again")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/setup_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 66f2b6299c40..44c30dd38067 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -880,7 +880,7 @@ void rfi_flush_enable(bool enable)
rfi_flush = enable;
 }
 
-static void init_fallback_flush(void)
+static void __ref init_fallback_flush(void)
 {
u64 l1d_size, limit;
int cpu;
-- 
2.14.1



Re: [RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-05 Thread Oliver
On Thu, Apr 5, 2018 at 10:11 PM, Michael Ellerman  wrote:
> Oliver  writes:
> ...
>>
>> For context Balbir is working with me on some of the pmem stuff. You
>> probably want an Ack from Rob rather than one of us.
>
> I'll ack it if you make all the niggly nit picky trivial annoying
> changes I asked for :D

*groan*

Fine, I'll respin it tomorrow. If anyone else has comments now would
be the time to make them.

>
> cheers


Re: [PATCH v2 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Balbir Singh
On Thu, Apr 5, 2018 at 9:26 PM, Oliver  wrote:
> On Thu, Apr 5, 2018 at 5:14 PM, Balbir Singh  wrote:
>> The pmem infrastructure uses memcpy_mcsafe in the pmem
>> layer so as to convert machine check excpetions into
>> a return value on failure in case a machine check
>> exception is encoutered during the memcpy.
>>
>
> Would it be possible to move the bulk of the copyuser code into a
> seperate file which can be #included once the these err macros are
> defined? Anton's memcpy is pretty hairy and I don't think anyone wants
> to have multiple copies of it in the tree, even in a cut down form.
>

I've split it out for now, in the future that might be a good thing to do.
The copy_tofrom_user_power7 falls backs on __copy_tofrom_user_base
to track exactly how much is left over. Adding these changes there would
create a larger churn and need way more testing. I've taken this short-cut
for now with a promise to fix that as the semantics of memcpy_mcsafe()
change to do more accurate tracking of how much was copied over.

Balbir Singh.


Re: [RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-05 Thread Michael Ellerman
Oliver  writes:
...
>
> For context Balbir is working with me on some of the pmem stuff. You
> probably want an Ack from Rob rather than one of us.

I'll ack it if you make all the niggly nit picky trivial annoying
changes I asked for :D

cheers


Re: [RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-05 Thread Oliver
On Thu, Apr 5, 2018 at 12:21 AM, Dan Williams  wrote:
> On Wed, Apr 4, 2018 at 7:04 AM, Oliver  wrote:
>> On Wed, Apr 4, 2018 at 10:07 PM, Balbir Singh  wrote:
>>> On Tue, 3 Apr 2018 10:37:51 -0700
>>> Dan Williams  wrote:
>>>
 On Tue, Apr 3, 2018 at 7:24 AM, Oliver O'Halloran  wrote:
 > Add device-tree binding documentation for the nvdimm region driver.
 >
 > Cc: devicet...@vger.kernel.org
 > Signed-off-by: Oliver O'Halloran 
 > ---
 > v2: Changed name from nvdimm-region to pmem-region.
 > Cleaned up the example binding and fixed the overlapping regions.
 > Added support for multiple regions in a single reg.
 > ---
 >  .../devicetree/bindings/pmem/pmem-region.txt   | 80 
 > ++
 >  MAINTAINERS|  1 +
 >  2 files changed, 81 insertions(+)
 >  create mode 100644 
 > Documentation/devicetree/bindings/pmem/pmem-region.txt

 Device-tree folks, does this look, ok?

 Oliver, is there any concept of a management interface to the
 device(s) backing these regions? libnvdimm calls these "nmem" devices
 and support operations like health status and namespace label
 management.
>>
>> It's something I'm planning on implementing as soon as someone gives
>> me some hardware that isn't hacked up lab crap. I'm posting this
>> version with just regions since people have been asking for something
>> in upstream even if it's not fully featured.
>>
>> Grumbling aside, the plan is to have separate drivers for the DIMM
>> type. Discovering DIMM devices happens via the normal discovery
>> mechanisms (e.g. an NVDIMM supporting the JEDEC interface is an I2C
>> device) and when binding to a specific DIMM device it registers a DIMM
>> descriptor structure and a ndctl implementation for that DIMM type
>> with of_pmem. When of_pmem binds to a region it can plug everything
>> into the region specific bus. There's a few details to work out, but I
>> think it's a reasonable approach.
>
> Yeah, that sounds reasonable. It would mean that your management
> interface would need to understand that nmems on different buses could
> potentially move to another bus after a reconfiguration, but that's
> not too much different than the ACPI case where nmems can join and
> leave regions after a reset / reconfig.
>
>>> We would need a way to have nmem and pmem-regions find each other. Since we
>>> don't have the ACPI abstractions, the nmem region would need to add the
>>> ability for a driver to have a phandle to the interleaving and nmem 
>>> properties.
>>>
>>> I guess that would be a separate driver, that would manage the nmem devices
>>> and there would be a way to relate the pmem and nmems. Oliver?
>>
>> Yes, that's the plan.
>
> So Balbir, is that enough for an Acked-by for this device-tree proposal?

For context Balbir is working with me on some of the pmem stuff. You
probably want an Ack from Rob rather than one of us.


Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Anshuman Khandual
On 04/05/2018 04:44 PM, Balbir Singh wrote:
> On Thu, Apr 5, 2018 at 8:56 PM, Anshuman Khandual
>  wrote:
>> There are certian platforms which would like to use SWIOTLB based DMA API
>> for bouncing purpose without actually requiring an IOMMU back end. But the
>> virtio core does not allow such mechanism. Right now DMA MAP API is only
>> selected for devices which have an IOMMU and then the QEMU/host back end
>> will process all incoming SG buffer addresses as IOVA instead of simple
>> GPA which is the case for simple bounce buffers after being processed with
>> SWIOTLB API. To enable this usage, it introduces an architecture specific
>> function which will just make virtio core front end select DMA operations
>> structure.
>>
>> Signed-off-by: Anshuman Khandual 
>> ---
>> This RFC is just to get some feedback. Please ignore the function call
>> back into the architecture. It can be worked out properly later on. But
>> the question is can we have virtio devices in the guest which would like
>> to use SWIOTLB based (or any custom DMA API based) bounce buffering with
>> out actually being an IOMMU devices emulated by QEMU/host as been with
>> the current VIRTIO_F_IOMMU_PLATFORM virtio flag ?
>>
>>  arch/powerpc/platforms/pseries/iommu.c | 6 ++
>>  drivers/virtio/virtio_ring.c   | 4 
>>  include/linux/virtio.h | 2 ++
>>  3 files changed, 12 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 06f02960b439..dd15fbddbe89 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -1396,3 +1396,9 @@ static int __init disable_multitce(char *str)
>>  __setup("multitce=", disable_multitce);
>>
>>  machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
>> +
>> +bool is_virtio_dma_platform(void)
>> +{
>> +   return true;
>> +}
>> +EXPORT_SYMBOL(is_virtio_dma_platform);
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 71458f493cf8..9f205a79d378 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -144,6 +144,10 @@ struct vring_virtqueue {
>>
>>  static bool vring_use_dma_api(struct virtio_device *vdev)
>>  {
>> +   /* Use DMA API even for virtio devices without an IOMMU */
>> +   if (is_virtio_dma_platform())
>> +   return true;
>> +
>> if (!virtio_has_iommu_quirk(vdev))
>> return true;
>>
>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>> index 988c7355bc22..d8bb83d753ea 100644
>> --- a/include/linux/virtio.h
>> +++ b/include/linux/virtio.h
>> @@ -200,6 +200,8 @@ static inline struct virtio_driver *drv_to_virtio(struct 
>> device_driver *drv)
>>  int register_virtio_driver(struct virtio_driver *drv);
>>  void unregister_virtio_driver(struct virtio_driver *drv);
>>
>> +extern bool is_virtio_dma_platform(void);
>> +
> 
> Where is the default implementation for non-pseries platforms? Will they 
> compile
> after these changes?

No they wont. This is just a RFC asking for suggestion/feedback on a
particular direction, will clean up the code later on once we agree
on this.



Re: [PATCH v2 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Oliver
On Thu, Apr 5, 2018 at 5:14 PM, Balbir Singh  wrote:
> The pmem infrastructure uses memcpy_mcsafe in the pmem
> layer so as to convert machine check excpetions into
> a return value on failure in case a machine check
> exception is encoutered during the memcpy.
>
> This patch largely borrows from the copyuser_power7
> logic and does not add the VMX optimizations, largely
> to keep the patch simple. If needed those optimizations
> can be folded in.
>
> Signed-off-by: Balbir Singh 
> Acked-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/string.h   |   2 +
>  arch/powerpc/lib/Makefile   |   2 +-
>  arch/powerpc/lib/memcpy_mcsafe_64.S | 212 
> 
>  3 files changed, 215 insertions(+), 1 deletion(-)
>  create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S
>
> diff --git a/arch/powerpc/include/asm/string.h 
> b/arch/powerpc/include/asm/string.h
> index 9b8cedf618f4..b7e872a64726 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -30,7 +30,9 @@ extern void * memcpy_flushcache(void *,const void 
> *,__kernel_size_t);
>  #ifdef CONFIG_PPC64
>  #define __HAVE_ARCH_MEMSET32
>  #define __HAVE_ARCH_MEMSET64
> +#define __HAVE_ARCH_MEMCPY_MCSAFE
>
> +extern int memcpy_mcsafe(void *dst, const void *src, __kernel_size_t sz);
>  extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
>  extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
>  extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
> diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
> index 3c29c9009bbf..048afee9f518 100644
> --- a/arch/powerpc/lib/Makefile
> +++ b/arch/powerpc/lib/Makefile
> @@ -24,7 +24,7 @@ endif
>
>  obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
>copyuser_power7.o string_64.o copypage_power7.o memcpy_power7.o \
> -  memcpy_64.o memcmp_64.o pmem.o
> +  memcpy_64.o memcmp_64.o pmem.o memcpy_mcsafe_64.o
>
>  obj64-$(CONFIG_SMP)+= locks.o
>  obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
> diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S 
> b/arch/powerpc/lib/memcpy_mcsafe_64.S
> new file mode 100644
> index ..e7eaa9b6cded
> --- /dev/null
> +++ b/arch/powerpc/lib/memcpy_mcsafe_64.S
> @@ -0,0 +1,212 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) IBM Corporation, 2011
> + * Derived from copyuser_power7.s by Anton Blanchard 
> + * Author - Balbir Singh 
> + */
> +#include 
> +#include 
> +
> +   .macro err1
> +100:
> +   EX_TABLE(100b,.Ldo_err1)
> +   .endm
> +
> +   .macro err2
> +200:
> +   EX_TABLE(200b,.Ldo_err2)
> +   .endm

Would it be possible to move the bulk of the copyuser code into a
seperate file which can be #included once the these err macros are
defined? Anton's memcpy is pretty hairy and I don't think anyone wants
to have multiple copies of it in the tree, even in a cut down form.

> +
> +.Ldo_err2:
> +   ld  r22,STK_REG(R22)(r1)
> +   ld  r21,STK_REG(R21)(r1)
> +   ld  r20,STK_REG(R20)(r1)
> +   ld  r19,STK_REG(R19)(r1)
> +   ld  r18,STK_REG(R18)(r1)
> +   ld  r17,STK_REG(R17)(r1)
> +   ld  r16,STK_REG(R16)(r1)
> +   ld  r15,STK_REG(R15)(r1)
> +   ld  r14,STK_REG(R14)(r1)
> +   addir1,r1,STACKFRAMESIZE
> +.Ldo_err1:
> +   li  r3,-EFAULT
> +   blr
> +
> +
> +_GLOBAL(memcpy_mcsafe)
> +   cmpldi  r5,16
> +   blt .Lshort_copy
> +
> +.Lcopy:
> +   /* Get the source 8B aligned */
> +   neg r6,r4
> +   mtocrf  0x01,r6
> +   clrldi  r6,r6,(64-3)
> +
> +   bf  cr7*4+3,1f
> +err1;  lbz r0,0(r4)
> +   addir4,r4,1
> +err1;  stb r0,0(r3)
> +   addir3,r3,1
> +
> +1: bf  cr7*4+2,2f
> +err1;  lhz r0,0(r4)
> +   addir4,r4,2
> +err1;  sth r0,0(r3)
> +   addir3,r3,2
> +
> +2: bf  cr7*4+1,3f
> +err1;  lwz r0,0(r4)
> +   addir4,r4,4
> +err1;  stw r0,0(r3)
> +   addir3,r3,4
> +
> +3: sub r5,r5,r6
> +   cmpldi  r5,128
> +   blt 5f
> +
> +   mflrr0
> +   stdur1,-STACKFRAMESIZE(r1)
> +   std r14,STK_REG(R14)(r1)
> +   std r15,STK_REG(R15)(r1)
> +   std r16,STK_REG(R16)(r1)
> +   std r17,STK_REG(R17)(r1)
> +   std r18,STK_REG(R18)(r1)
> +   std r19,STK_REG(R19)(r1)
> +   std r20,STK_REG(R20)(r1)
> +   std r21,STK_REG(R21)(r1)
> +   std r22,STK_REG(R22)(r1)
> +   std r0,STACKFRAMESIZE+16(r1)
> +
> +   srdir6,r5,7
> +   mtctr   r6
> +
> +   /* Now do cacheline (128B) sized loads and stores. */
> +   .align  5
> +4:
> +err2;  ld  r0,0(r4)
> +err2;  ld  r6,8(r4)
> +err2;  ld  r7,16(r4)
> +err2;  ld  r8,24(r4)
> +err2;  ld  

Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Balbir Singh
On Thu, Apr 5, 2018 at 8:56 PM, Anshuman Khandual
 wrote:
> There are certian platforms which would like to use SWIOTLB based DMA API
> for bouncing purpose without actually requiring an IOMMU back end. But the
> virtio core does not allow such mechanism. Right now DMA MAP API is only
> selected for devices which have an IOMMU and then the QEMU/host back end
> will process all incoming SG buffer addresses as IOVA instead of simple
> GPA which is the case for simple bounce buffers after being processed with
> SWIOTLB API. To enable this usage, it introduces an architecture specific
> function which will just make virtio core front end select DMA operations
> structure.
>
> Signed-off-by: Anshuman Khandual 
> ---
> This RFC is just to get some feedback. Please ignore the function call
> back into the architecture. It can be worked out properly later on. But
> the question is can we have virtio devices in the guest which would like
> to use SWIOTLB based (or any custom DMA API based) bounce buffering with
> out actually being an IOMMU devices emulated by QEMU/host as been with
> the current VIRTIO_F_IOMMU_PLATFORM virtio flag ?
>
>  arch/powerpc/platforms/pseries/iommu.c | 6 ++
>  drivers/virtio/virtio_ring.c   | 4 
>  include/linux/virtio.h | 2 ++
>  3 files changed, 12 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 06f02960b439..dd15fbddbe89 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1396,3 +1396,9 @@ static int __init disable_multitce(char *str)
>  __setup("multitce=", disable_multitce);
>
>  machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> +
> +bool is_virtio_dma_platform(void)
> +{
> +   return true;
> +}
> +EXPORT_SYMBOL(is_virtio_dma_platform);
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 71458f493cf8..9f205a79d378 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -144,6 +144,10 @@ struct vring_virtqueue {
>
>  static bool vring_use_dma_api(struct virtio_device *vdev)
>  {
> +   /* Use DMA API even for virtio devices without an IOMMU */
> +   if (is_virtio_dma_platform())
> +   return true;
> +
> if (!virtio_has_iommu_quirk(vdev))
> return true;
>
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 988c7355bc22..d8bb83d753ea 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -200,6 +200,8 @@ static inline struct virtio_driver *drv_to_virtio(struct 
> device_driver *drv)
>  int register_virtio_driver(struct virtio_driver *drv);
>  void unregister_virtio_driver(struct virtio_driver *drv);
>
> +extern bool is_virtio_dma_platform(void);
> +

Where is the default implementation for non-pseries platforms? Will they compile
after these changes?

Balbir


[RFC 2/2] powerpc/mm/memtrace: Let the arch hotunplug code flush cache

2018-04-05 Thread Balbir Singh
Don't do this via custom code, instead now that we have support
in the arch hotplug/hotunplug code, rely on those routines
to do the right thing.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/platforms/powernv/memtrace.c | 17 -
 1 file changed, 17 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index de470caf0784..fc222a0c2ac4 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -82,19 +82,6 @@ static const struct file_operations memtrace_fops = {
.open   = simple_open,
 };
 
-static void flush_memory_region(u64 base, u64 size)
-{
-   unsigned long line_size = ppc64_caches.l1d.size;
-   u64 end = base + size;
-   u64 addr;
-
-   base = round_down(base, line_size);
-   end = round_up(end, line_size);
-
-   for (addr = base; addr < end; addr += line_size)
-   asm volatile("dcbf 0,%0" : "=r" (addr) :: "memory");
-}
-
 static int check_memblock_online(struct memory_block *mem, void *arg)
 {
if (mem->state != MEM_ONLINE)
@@ -132,10 +119,6 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, 
u64 nr_pages)
walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE,
  change_memblock_state);
 
-   /* RCU grace period? */
-   flush_memory_region((u64)__va(start_pfn << PAGE_SHIFT),
-   nr_pages << PAGE_SHIFT);
-
lock_device_hotplug();
remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
unlock_device_hotplug();
-- 
2.13.6



[RFC 1/2] powerpc/mm: Flush cache on memory hot(un)plug

2018-04-05 Thread Balbir Singh
This patch adds support for flushing potentially dirty cache lines when
memory is hot-plugged/hot-un-plugged.  The support is currently limited
to 64 bit systems.

The bug was exposed when mappings for a coherent memory device were
actually hot-unplugged and plugged in back later.  A similar issue was
observed during the development of memtrace, but memtrace does it's own
flushing of region via a custom routine.

These patches do a flush both on hotplug/unplug to clear any stale data
in the cache w.r.t mappings, there is a small race window where a clean
cache line may be created again just prior to tearing down the mapping.
Flushing them on establishment of new mappings helps start with a clean
state.

The patches were tested by disabling the flush routines in memtrace and
doing I/O on the trace file. The system immediately checkstops (quite
reliably, if prior to the hot-unplug of the memtrace region, we memset
the regions we are about to hot unplug). After these patches no custom
flushing is needed in the memtrace code.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/mm/mem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 85245ef97e72..0a8959b15b39 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -143,6 +143,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, 
struct vmem_altmap *
start, start + size, rc);
return -EFAULT;
}
+   flush_inval_dcache_range(start, start + size);
 
return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
 }
@@ -169,6 +170,7 @@ int __meminit arch_remove_memory(u64 start, u64 size, 
struct vmem_altmap *altmap
 
/* Remove htab bolted mappings for this section of memory */
start = (unsigned long)__va(start);
+   flush_inval_dcache_range(start, start + size);
ret = remove_section_mapping(start, start + size);
 
/* Ensure all vmalloc mappings are flushed in case they also
-- 
2.13.6



[RFC] virtio: Use DMA MAP API for devices without an IOMMU

2018-04-05 Thread Anshuman Khandual
There are certian platforms which would like to use SWIOTLB based DMA API
for bouncing purpose without actually requiring an IOMMU back end. But the
virtio core does not allow such mechanism. Right now DMA MAP API is only
selected for devices which have an IOMMU and then the QEMU/host back end
will process all incoming SG buffer addresses as IOVA instead of simple
GPA which is the case for simple bounce buffers after being processed with
SWIOTLB API. To enable this usage, it introduces an architecture specific
function which will just make virtio core front end select DMA operations
structure.

Signed-off-by: Anshuman Khandual 
---
This RFC is just to get some feedback. Please ignore the function call
back into the architecture. It can be worked out properly later on. But
the question is can we have virtio devices in the guest which would like
to use SWIOTLB based (or any custom DMA API based) bounce buffering with
out actually being an IOMMU devices emulated by QEMU/host as been with
the current VIRTIO_F_IOMMU_PLATFORM virtio flag ?

 arch/powerpc/platforms/pseries/iommu.c | 6 ++
 drivers/virtio/virtio_ring.c   | 4 
 include/linux/virtio.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 06f02960b439..dd15fbddbe89 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1396,3 +1396,9 @@ static int __init disable_multitce(char *str)
 __setup("multitce=", disable_multitce);
 
 machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
+
+bool is_virtio_dma_platform(void)
+{
+   return true;
+}
+EXPORT_SYMBOL(is_virtio_dma_platform);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..9f205a79d378 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -144,6 +144,10 @@ struct vring_virtqueue {
 
 static bool vring_use_dma_api(struct virtio_device *vdev)
 {
+   /* Use DMA API even for virtio devices without an IOMMU */
+   if (is_virtio_dma_platform())
+   return true;
+
if (!virtio_has_iommu_quirk(vdev))
return true;
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 988c7355bc22..d8bb83d753ea 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -200,6 +200,8 @@ static inline struct virtio_driver *drv_to_virtio(struct 
device_driver *drv)
 int register_virtio_driver(struct virtio_driver *drv);
 void unregister_virtio_driver(struct virtio_driver *drv);
 
+extern bool is_virtio_dma_platform(void);
+
 /* module_virtio_driver() - Helper macro for drivers that don't do
  * anything special in module init/exit.  This eliminates a lot of
  * boilerplate.  Each module may only use this macro once, and
-- 
2.14.1



[PATCH 6/6] powerpc/xive: standardise OPAL_BUSY delays

2018-04-05 Thread Nicholas Piggin
Convert to using the standard delay poll/delay form.

The XIVE driver:

- Did not previously loop on the OPAL_BUSY_EVENT case.
- Used a 1ms sleep.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/sysdev/xive/native.c | 193 ++
 1 file changed, 111 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index d22aeb0b69e1..682f79dabb4a 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -103,14 +103,18 @@ EXPORT_SYMBOL_GPL(xive_native_populate_irq_data);
 
 int xive_native_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 sw_irq)
 {
-   s64 rc;
+   s64 rc = OPAL_BUSY;
 
-   for (;;) {
+   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_xive_set_irq_config(hw_irq, target, prio, sw_irq);
-   if (rc != OPAL_BUSY)
-   break;
-   msleep(1);
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   opal_poll_events(NULL);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   }
}
+
return rc == 0 ? 0 : -ENXIO;
 }
 EXPORT_SYMBOL_GPL(xive_native_configure_irq);
@@ -159,12 +163,17 @@ int xive_native_configure_queue(u32 vp_id, struct xive_q 
*q, u8 prio,
}
 
/* Configure and enable the queue in HW */
-   for (;;) {
+   rc = OPAL_BUSY;
+   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_xive_set_queue_info(vp_id, prio, qpage_phys, order, 
flags);
-   if (rc != OPAL_BUSY)
-   break;
-   msleep(1);
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   opal_poll_events(NULL);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   }
}
+
if (rc) {
pr_err("Error %lld setting queue for prio %d\n", rc, prio);
rc = -EIO;
@@ -183,14 +192,17 @@ EXPORT_SYMBOL_GPL(xive_native_configure_queue);
 
 static void __xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio)
 {
-   s64 rc;
+   s64 rc = OPAL_BUSY;
 
/* Disable the queue in HW */
-   for (;;) {
+   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_xive_set_queue_info(vp_id, prio, 0, 0, 0);
-   if (rc != OPAL_BUSY)
-   break;
-   msleep(1);
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   opal_poll_events(NULL);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   }
}
if (rc)
pr_err("Error %lld disabling queue for prio %d\n", rc, prio);
@@ -240,7 +252,7 @@ static int xive_native_get_ipi(unsigned int cpu, struct 
xive_cpu *xc)
 {
struct device_node *np;
unsigned int chip_id;
-   s64 irq;
+   s64 rc = OPAL_BUSY;
 
/* Find the chip ID */
np = of_get_cpu_node(cpu, NULL);
@@ -250,33 +262,39 @@ static int xive_native_get_ipi(unsigned int cpu, struct 
xive_cpu *xc)
}
 
/* Allocate an IPI and populate info about it */
-   for (;;) {
-   irq = opal_xive_allocate_irq(chip_id);
-   if (irq == OPAL_BUSY) {
-   msleep(1);
-   continue;
-   }
-   if (irq < 0) {
-   pr_err("Failed to allocate IPI on CPU %d\n", cpu);
-   return -ENXIO;
+   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
+   rc = opal_xive_allocate_irq(chip_id);
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   opal_poll_events(NULL);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
}
-   xc->hw_ipi = irq;
-   break;
}
+   if (rc < 0) {
+   pr_err("Failed to allocate IPI on CPU %d\n", cpu);
+   return -ENXIO;
+   }
+   xc->hw_ipi = rc;
+
return 0;
 }
 #endif /* CONFIG_SMP */
 
 u32 xive_native_alloc_irq(void)
 {
-   s64 rc;
+   s64 rc = OPAL_BUSY;
 
-   for (;;) {
+   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_xive_allocate_irq(OPAL_XIVE_ANY_CHIP);
-   if (rc != OPAL_BUSY)
-   break;
-   msleep(1);
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   opal_poll_events(NULL);
+   } else if (rc == OPAL_BUSY) {
+   

[PATCH 5/6] powerpc/powernv: OPAL dump support standardise OPAL_BUSY delays

2018-04-05 Thread Nicholas Piggin
Convert to using the standard delay poll/delay form.

The dump code:

- Did not previously delay or sleep in the OPAL_BUSY case.
- Used a 20ms sleep.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/opal-dump.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/opal-dump.c 
b/arch/powerpc/platforms/powernv/opal-dump.c
index 0dc8fa4e0af2..603c4ffdb45c 100644
--- a/arch/powerpc/platforms/powernv/opal-dump.c
+++ b/arch/powerpc/platforms/powernv/opal-dump.c
@@ -264,8 +264,10 @@ static int64_t dump_read_data(struct dump_obj *dump)
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_dump_read(dump->id, addr);
if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   msleep(20);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
}
}
 
-- 
2.16.3



[PATCH 4/6] powerpc/powernv: OPAL NVRAM driver standardise OPAL_BUSY delays

2018-04-05 Thread Nicholas Piggin
Convert to using the standard delay poll/delay form.

The NVRAM driver:

- Did not previously delay or sleep in its OPAL_BUSY loop.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/opal-nvram.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/opal-nvram.c 
b/arch/powerpc/platforms/powernv/opal-nvram.c
index 9db4398ded5d..732732bddc28 100644
--- a/arch/powerpc/platforms/powernv/opal-nvram.c
+++ b/arch/powerpc/platforms/powernv/opal-nvram.c
@@ -11,6 +11,7 @@
 
 #define DEBUG
 
+#include 
 #include 
 #include 
 #include 
@@ -56,8 +57,12 @@ static ssize_t opal_nvram_write(char *buf, size_t count, 
loff_t *index)
 
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_write_nvram(__pa(buf), count, off);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   }
}
*index += count;
return count;
-- 
2.16.3



[PATCH 3/6] powerpc/powernv: OPAL platform standardise OPAL_BUSY loops

2018-04-05 Thread Nicholas Piggin
Convert to using the standard delay poll/delay form.

The platform code:

- Used delay when called from a schedule()able context.
- Did not previously delay or sleep in the OPAL_BUSY_EVENT case.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/opal.c  |  8 +---
 arch/powerpc/platforms/powernv/setup.c | 16 ++--
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index c15182765ff5..fb13bcabe609 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -896,10 +896,12 @@ void opal_shutdown(void)
 */
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_sync_host_reboot();
-   if (rc == OPAL_BUSY)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else
-   mdelay(10);
+   } else {
+   msleep(OPAL_BUSY_DELAY_MS);
+   }
}
 
/* Unregister memory dump region */
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 092715b9674b..6ea79d906784 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -187,10 +187,12 @@ static void  __noreturn pnv_restart(char *cmd)
 
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_cec_reboot();
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else
-   mdelay(10);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   }
}
for (;;)
opal_poll_events(NULL);
@@ -204,10 +206,12 @@ static void __noreturn pnv_power_off(void)
 
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_cec_power_down(0);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else
-   mdelay(10);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   }
}
for (;;)
opal_poll_events(NULL);
-- 
2.16.3



[PATCH 2/6] powerpc/powernv: OPAL RTC driver standardise OPAL_BUSY loops

2018-04-05 Thread Nicholas Piggin
Convert to using the standard delay poll/delay form.

The OPAL RTC driver:

- Did not previously delay or sleep in the OPAL_BUSY_EVENT case.
  There have been scheduling delays of up to 50 seconds observed here
  (BMC reboot can do it), which this should fix.

Cc: linux-...@vger.kernel.org
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/opal-rtc.c |  6 --
 drivers/rtc/rtc-opal.c| 33 ---
 2 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-rtc.c 
b/arch/powerpc/platforms/powernv/opal-rtc.c
index f8868864f373..f530cf62594d 100644
--- a/arch/powerpc/platforms/powernv/opal-rtc.c
+++ b/arch/powerpc/platforms/powernv/opal-rtc.c
@@ -48,10 +48,12 @@ unsigned long __init opal_get_boot_time(void)
 
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   mdelay(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else if (rc == OPAL_BUSY)
+   } else if (rc == OPAL_BUSY) {
mdelay(10);
+   }
}
if (rc != OPAL_SUCCESS)
return 0;
diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c
index 304e891e35fc..cddcc4749d39 100644
--- a/drivers/rtc/rtc-opal.c
+++ b/drivers/rtc/rtc-opal.c
@@ -57,7 +57,7 @@ static void tm_to_opal(struct rtc_time *tm, u32 *y_m_d, u64 
*h_m_s_ms)
 
 static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm)
 {
-   long rc = OPAL_BUSY;
+   s64 rc = OPAL_BUSY;
int retries = 10;
u32 y_m_d;
u64 h_m_s_ms;
@@ -66,13 +66,17 @@ static int opal_get_rtc_time(struct device *dev, struct 
rtc_time *tm)
 
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else if (retries-- && (rc == OPAL_HARDWARE
-  || rc == OPAL_INTERNAL_ERROR))
+   } else if (rc == OPAL_BUSY) {
msleep(10);
-   else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
-   break;
+   } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) {
+   if (retries--) {
+   msleep(10); /* Wait 10ms before retry */
+   rc = OPAL_BUSY; /* go around again */
+   }
+   }
}
 
if (rc != OPAL_SUCCESS)
@@ -87,21 +91,26 @@ static int opal_get_rtc_time(struct device *dev, struct 
rtc_time *tm)
 
 static int opal_set_rtc_time(struct device *dev, struct rtc_time *tm)
 {
-   long rc = OPAL_BUSY;
+   s64 rc = OPAL_BUSY;
int retries = 10;
u32 y_m_d = 0;
u64 h_m_s_ms = 0;
 
tm_to_opal(tm, _m_d, _m_s_ms);
+
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_rtc_write(y_m_d, h_m_s_ms);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else if (retries-- && (rc == OPAL_HARDWARE
-  || rc == OPAL_INTERNAL_ERROR))
+   } else if (rc == OPAL_BUSY) {
msleep(10);
-   else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
-   break;
+   } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) {
+   if (retries--) {
+   msleep(10); /* Wait 10ms before retry */
+   rc = OPAL_BUSY; /* go around again */
+   }
+   }
}
 
return rc == OPAL_SUCCESS ? 0 : -EIO;
-- 
2.16.3



[PATCH 1/6] powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops

2018-04-05 Thread Nicholas Piggin
This is the start of an effort to tidy up and standardise all the
delays. Existing loops have a range of delay/sleep periods from 1ms
to 20ms, and some have no delay. They all loop forever except rtc,
which times out after 10 retries, and that uses 10ms delays. So use
10ms as our standard delay. The OPAL maintainer agrees 10ms is a
reasonable starting point.

The idea is to use the same recipe everywhere, once this is proven to
work then it will be documented as an OPAL API standard. Then both
firmware and OS can agree, and if a particular call needs something
else, then that can be documented with reasoning.

This is not the end-all of this effort, it's just a relatively easy
change that fixes some existing high latency delays. There should be
provision for standardising timeouts and/or interruptible loops where
possible, so non-fatal firmware errors don't cause hangs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/opal.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 12e70fb58700..fcf3ed5b8b18 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -21,6 +21,9 @@
 /* We calculate number of sg entries based on PAGE_SIZE */
 #define SG_ENTRIES_PER_NODE ((PAGE_SIZE - 16) / sizeof(struct opal_sg_entry))
 
+/* Default time to sleep or delay between OPAL_BUSY/OPAL_BUSY_EVENT loops */
+#define OPAL_BUSY_DELAY_MS 10
+
 /* /sys/firmware/opal */
 extern struct kobject *opal_kobj;
 
-- 
2.16.3



[PATCH 0/6] first step of standardising OPAL_BUSY handling

2018-04-05 Thread Nicholas Piggin
Patch 1 explains most of the reasoning.
Patch 1+2 and possibly 4 (just that we've seen a bug caused by the
RTC driver but not yet one caused by NVRAM) could be backported as
bugfixes, in most other cases the changes are inconsequential or
unlikely to be a problem.

Thanks,
Nick

Nicholas Piggin (6):
  powerpc/powernv: define a standard delay for OPAL_BUSY type retry
loops
  powerpc/powernv: OPAL RTC driver standardise OPAL_BUSY loops
  powerpc/powernv: OPAL platform standardise OPAL_BUSY loops
  powerpc/powernv: OPAL NVRAM driver standardise OPAL_BUSY delays
  powerpc/powernv: OPAL dump support standardise OPAL_BUSY delays
  powerpc/xive: standardise OPAL_BUSY delays

 arch/powerpc/include/asm/opal.h |   3 +
 arch/powerpc/platforms/powernv/opal-dump.c  |   4 +-
 arch/powerpc/platforms/powernv/opal-nvram.c |   7 +-
 arch/powerpc/platforms/powernv/opal-rtc.c   |   6 +-
 arch/powerpc/platforms/powernv/opal.c   |   8 +-
 arch/powerpc/platforms/powernv/setup.c  |  16 ++-
 arch/powerpc/sysdev/xive/native.c   | 193 
 drivers/rtc/rtc-opal.c  |  33 +++--
 8 files changed, 163 insertions(+), 107 deletions(-)

-- 
2.16.3



[PATCH v2 3/3] powerpc/mce: Handle memcpy_mcsafe

2018-04-05 Thread Balbir Singh
Add a blocking notifier callback to be called in real-mode
on machine check exceptions for UE (ld/st) errors only.
The patch registers a callback on boot to be notified
of machine check exceptions and returns a NOTIFY_STOP when
a page of interest is seen as the source of the machine
check exception. This page of interest is a ZONE_DEVICE
page and hence for now, for memcpy_mcsafe to work, the page
needs to belong to ZONE_DEVICE and memcpy_mcsafe should be
used to access the memory.

The patch also modifies the NIP of the exception context
to go back to the fixup handler (in memcpy_mcsafe) and does
not print any error message as the error is treated as
returned via a return value and handled.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/include/asm/mce.h |  3 +-
 arch/powerpc/kernel/mce.c  | 77 --
 2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 3a1226e9b465..a76638e3e47e 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -125,7 +125,8 @@ struct machine_check_event {
enum MCE_UeErrorType ue_error_type:8;
uint8_t effective_address_provided;
uint8_t physical_address_provided;
-   uint8_t reserved_1[5];
+   uint8_t error_return;
+   uint8_t reserved_1[4];
uint64_teffective_address;
uint64_tphysical_address;
uint8_t reserved_2[8];
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index efdd16a79075..b9e4881fa8c5 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -28,7 +28,9 @@
 #include 
 #include 
 #include 
+#include 
 
+#include 
 #include 
 #include 
 
@@ -54,6 +56,52 @@ static struct irq_work mce_event_process_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int register_mce_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_mce_notifier);
+
+int unregister_mce_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_mce_notifier);
+
+
+static int check_memcpy_mcsafe(struct notifier_block *nb,
+   unsigned long val, void *data)
+{
+   /*
+* val contains the physical_address of the bad address
+*/
+   unsigned long pfn = val >> PAGE_SHIFT;
+   struct page *page = realmode_pfn_to_page(pfn);
+   int rc = NOTIFY_DONE;
+
+   if (!page)
+   goto out;
+
+   if (is_zone_device_page(page))  /* for HMM and PMEM */
+   rc = NOTIFY_STOP;
+out:
+   return rc;
+}
+
+struct notifier_block memcpy_mcsafe_nb = {
+   .priority = 0,
+   .notifier_call = check_memcpy_mcsafe,
+};
+
+int  mce_mcsafe_register(void)
+{
+   register_mce_notifier(_mcsafe_nb);
+   return 0;
+}
+arch_initcall(mce_mcsafe_register);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -151,9 +199,31 @@ void save_mce_event(struct pt_regs *regs, long handled,
mce->u.ue_error.effective_address_provided = true;
mce->u.ue_error.effective_address = addr;
if (phys_addr != ULONG_MAX) {
+   int rc;
+   const struct exception_table_entry *entry;
+
+   /*
+* Once we have the physical address, we check to
+* see if the current nip has a fixup entry.
+* Having a fixup entry plus the notifier stating
+* that it can handle the exception is an indication
+* that we should return to the fixup entry and
+* return an error from there
+*/
mce->u.ue_error.physical_address_provided = true;
mce->u.ue_error.physical_address = phys_addr;
-   machine_check_ue_event(mce);
+
+   rc = blocking_notifier_call_chain(_notifier_list,
+   phys_addr, NULL);
+   if (rc & NOTIFY_STOP_MASK) {
+   entry = search_exception_tables(regs->nip);
+   if (entry != NULL) {
+   mce->u.ue_error.error_return = 1;
+   regs->nip = extable_fixup(entry);
+   } else
+   

[PATCH v2 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Balbir Singh
The pmem infrastructure uses memcpy_mcsafe in the pmem
layer so as to convert machine check excpetions into
a return value on failure in case a machine check
exception is encoutered during the memcpy.

This patch largely borrows from the copyuser_power7
logic and does not add the VMX optimizations, largely
to keep the patch simple. If needed those optimizations
can be folded in.

Signed-off-by: Balbir Singh 
Acked-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/string.h   |   2 +
 arch/powerpc/lib/Makefile   |   2 +-
 arch/powerpc/lib/memcpy_mcsafe_64.S | 212 
 3 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 9b8cedf618f4..b7e872a64726 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -30,7 +30,9 @@ extern void * memcpy_flushcache(void *,const void 
*,__kernel_size_t);
 #ifdef CONFIG_PPC64
 #define __HAVE_ARCH_MEMSET32
 #define __HAVE_ARCH_MEMSET64
+#define __HAVE_ARCH_MEMCPY_MCSAFE
 
+extern int memcpy_mcsafe(void *dst, const void *src, __kernel_size_t sz);
 extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
 extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
 extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 3c29c9009bbf..048afee9f518 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -24,7 +24,7 @@ endif
 
 obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
   copyuser_power7.o string_64.o copypage_power7.o memcpy_power7.o \
-  memcpy_64.o memcmp_64.o pmem.o
+  memcpy_64.o memcmp_64.o pmem.o memcpy_mcsafe_64.o
 
 obj64-$(CONFIG_SMP)+= locks.o
 obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S 
b/arch/powerpc/lib/memcpy_mcsafe_64.S
new file mode 100644
index ..e7eaa9b6cded
--- /dev/null
+++ b/arch/powerpc/lib/memcpy_mcsafe_64.S
@@ -0,0 +1,212 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) IBM Corporation, 2011
+ * Derived from copyuser_power7.s by Anton Blanchard 
+ * Author - Balbir Singh 
+ */
+#include 
+#include 
+
+   .macro err1
+100:
+   EX_TABLE(100b,.Ldo_err1)
+   .endm
+
+   .macro err2
+200:
+   EX_TABLE(200b,.Ldo_err2)
+   .endm
+
+.Ldo_err2:
+   ld  r22,STK_REG(R22)(r1)
+   ld  r21,STK_REG(R21)(r1)
+   ld  r20,STK_REG(R20)(r1)
+   ld  r19,STK_REG(R19)(r1)
+   ld  r18,STK_REG(R18)(r1)
+   ld  r17,STK_REG(R17)(r1)
+   ld  r16,STK_REG(R16)(r1)
+   ld  r15,STK_REG(R15)(r1)
+   ld  r14,STK_REG(R14)(r1)
+   addir1,r1,STACKFRAMESIZE
+.Ldo_err1:
+   li  r3,-EFAULT
+   blr
+
+
+_GLOBAL(memcpy_mcsafe)
+   cmpldi  r5,16
+   blt .Lshort_copy
+
+.Lcopy:
+   /* Get the source 8B aligned */
+   neg r6,r4
+   mtocrf  0x01,r6
+   clrldi  r6,r6,(64-3)
+
+   bf  cr7*4+3,1f
+err1;  lbz r0,0(r4)
+   addir4,r4,1
+err1;  stb r0,0(r3)
+   addir3,r3,1
+
+1: bf  cr7*4+2,2f
+err1;  lhz r0,0(r4)
+   addir4,r4,2
+err1;  sth r0,0(r3)
+   addir3,r3,2
+
+2: bf  cr7*4+1,3f
+err1;  lwz r0,0(r4)
+   addir4,r4,4
+err1;  stw r0,0(r3)
+   addir3,r3,4
+
+3: sub r5,r5,r6
+   cmpldi  r5,128
+   blt 5f
+
+   mflrr0
+   stdur1,-STACKFRAMESIZE(r1)
+   std r14,STK_REG(R14)(r1)
+   std r15,STK_REG(R15)(r1)
+   std r16,STK_REG(R16)(r1)
+   std r17,STK_REG(R17)(r1)
+   std r18,STK_REG(R18)(r1)
+   std r19,STK_REG(R19)(r1)
+   std r20,STK_REG(R20)(r1)
+   std r21,STK_REG(R21)(r1)
+   std r22,STK_REG(R22)(r1)
+   std r0,STACKFRAMESIZE+16(r1)
+
+   srdir6,r5,7
+   mtctr   r6
+
+   /* Now do cacheline (128B) sized loads and stores. */
+   .align  5
+4:
+err2;  ld  r0,0(r4)
+err2;  ld  r6,8(r4)
+err2;  ld  r7,16(r4)
+err2;  ld  r8,24(r4)
+err2;  ld  r9,32(r4)
+err2;  ld  r10,40(r4)
+err2;  ld  r11,48(r4)
+err2;  ld  r12,56(r4)
+err2;  ld  r14,64(r4)
+err2;  ld  r15,72(r4)
+err2;  ld  r16,80(r4)
+err2;  ld  r17,88(r4)
+err2;  ld  r18,96(r4)
+err2;  ld  r19,104(r4)
+err2;  ld  r20,112(r4)
+err2;  ld  r21,120(r4)
+   addir4,r4,128
+err2;  std r0,0(r3)
+err2;  std r6,8(r3)
+err2;  std r7,16(r3)
+err2;  std r8,24(r3)
+err2;  std r9,32(r3)
+err2;  std r10,40(r3)
+err2;  std r11,48(r3)
+err2;  std r12,56(r3)
+err2;  std r14,64(r3)
+err2;  std r15,72(r3)
+err2;  std r16,80(r3)
+err2;  std 

[PATCH v2 1/3] powerpc/mce: Bug fixes for MCE handling in kernel space

2018-04-05 Thread Balbir Singh
The code currently assumes PAGE_SHIFT as the shift value of
the pfn, this works correctly (mostly) for user space pages,
but the correct thing to do is

1. Extract the shift value returned via the pte-walk API's
2. Use the shift value to access the instruction address.

Note, the final physical address still use PAGE_SHIFT for
computation. handle_ierror() is not modified and handle_derror()
is modified just for extracting the correct instruction
address.

This is largely due to __find_linux_pte() returning pfn's
shifted by pdshift. The code is much more generic and can
handle shift values returned.

Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")

Signed-off-by: Balbir Singh 
---
 arch/powerpc/kernel/mce_power.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index fe6fc63251fe..bd9754def479 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -36,7 +36,8 @@
  * Convert an address related to an mm to a PFN. NOTE: we are in real
  * mode, we could potentially race with page table updates.
  */
-static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
+static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr,
+   unsigned int *shift)
 {
pte_t *ptep;
unsigned long flags;
@@ -49,13 +50,15 @@ static unsigned long addr_to_pfn(struct pt_regs *regs, 
unsigned long addr)
 
local_irq_save(flags);
if (mm == current->mm)
-   ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
+   ptep = find_current_mm_pte(mm->pgd, addr, NULL, shift);
else
-   ptep = find_init_mm_pte(addr, NULL);
+   ptep = find_init_mm_pte(addr, shift);
local_irq_restore(flags);
if (!ptep || pte_special(*ptep))
return ULONG_MAX;
-   return pte_pfn(*ptep);
+   if (!*shift)
+   *shift = PAGE_SHIFT;
+   return (pte_val(*ptep) & PTE_RPN_MASK) >> *shift;
 }
 
 /* flush SLBs and reload */
@@ -353,15 +356,16 @@ static int mce_find_instr_ea_and_pfn(struct pt_regs 
*regs, uint64_t *addr,
unsigned long pfn, instr_addr;
struct instruction_op op;
struct pt_regs tmp = *regs;
+   unsigned int shift;
 
-   pfn = addr_to_pfn(regs, regs->nip);
+   pfn = addr_to_pfn(regs, regs->nip, );
if (pfn != ULONG_MAX) {
-   instr_addr = (pfn << PAGE_SHIFT) + (regs->nip & ~PAGE_MASK);
+   instr_addr = (pfn << shift) + (regs->nip & ((1 << shift) - 1));
instr = *(unsigned int *)(instr_addr);
if (!analyse_instr(, , instr)) {
-   pfn = addr_to_pfn(regs, op.ea);
+   pfn = addr_to_pfn(regs, op.ea, );
*addr = op.ea;
-   *phys_addr = (pfn << PAGE_SHIFT);
+   *phys_addr = (pfn << shift);
return 0;
}
/*
@@ -435,12 +439,14 @@ static int mce_handle_ierror(struct pt_regs *regs,
if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
table[i].error_type == MCE_ERROR_TYPE_UE) {
unsigned long pfn;
+   unsigned int shift;
 
if (get_paca()->in_mce < MAX_MCE_DEPTH) {
-   pfn = addr_to_pfn(regs, regs->nip);
+   pfn = addr_to_pfn(regs, regs->nip,
+   );
if (pfn != ULONG_MAX) {
*phys_addr =
-   (pfn << PAGE_SHIFT);
+   (pfn << shift);
handled = 1;
}
}
-- 
2.13.6



[PATCH v2 0/3] Add support for memcpy_mcsafe

2018-04-05 Thread Balbir Singh
memcpy_mcsafe() is an API currently used by the pmem subsystem to convert
errors while doing a memcpy (machine check exception errors) to a return
value. This patchset consists of three patches

1. The first patch is a bug fix to handle machine check errors correctly
while walking the page tables in kernel mode, due to huge pmd/pud sizes
2. The second patch adds memcpy_mcsafe() support, this is largely derived
from existing code
3. The third patch registers for callbacks on machine check exceptions and
in them uses specialized knowledge of the type of page to decide whether
to handle the MCE as is or to return to a fixup address present in
memcpy_mcsafe(). If a fixup address is used, then we return an error
value of -EFAULT to the caller.

Testing

A large part of the testing was done under a simulator by selectively
inserting machine check exceptions in a test driver doing memcpy_mcsafe
via ioctls.

Changelog v2
 - Fix the logic of shifting in addr_to_pfn
 - Use shift consistently instead of PAGE_SHIFT
 - Fix a typo in patch1

Balbir Singh (3):
  powerpc/mce: Bug fixes for MCE handling in kernel space
  powerpc/memcpy: Add memcpy_mcsafe for pmem
  powerpc/mce: Handle memcpy_mcsafe

 arch/powerpc/include/asm/mce.h  |   3 +-
 arch/powerpc/include/asm/string.h   |   2 +
 arch/powerpc/kernel/mce.c   |  77 -
 arch/powerpc/kernel/mce_power.c |  26 +++--
 arch/powerpc/lib/Makefile   |   2 +-
 arch/powerpc/lib/memcpy_mcsafe_64.S | 212 
 6 files changed, 308 insertions(+), 14 deletions(-)
 create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S

-- 
2.13.6



Re: [PATCH v2 03/19] powerpc: Mark variables as unused

2018-04-05 Thread LEROY Christophe

Michael Ellerman  a écrit :


LEROY Christophe  writes:


Mathieu Malaterre  a écrit :


Add gcc attribute unused for two variables. Fix warnings treated as errors
with W=1:

  arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set
but not used [-Werror=unused-but-set-variable]

Suggested-by: Christophe Leroy 
Signed-off-by: Mathieu Malaterre 
---
v2: move path within ifdef DEBUG_PROM

 arch/powerpc/kernel/prom_init.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c
b/arch/powerpc/kernel/prom_init.c
index acf4b2e0530c..4163b11abb6c 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -603,7 +603,7 @@ static void __init early_cmdline_parse(void)
const char *opt;

char *p;
-   int l = 0;
+   int l __maybe_unused = 0;

prom_cmd_line[0] = 0;
p = prom_cmd_line;
@@ -1385,7 +1385,7 @@ static void __init reserve_mem(u64 base, u64 size)
 static void __init prom_init_mem(void)
 {
phandle node;
-   char *path, type[64];
+   char *path __maybe_unused, type[64];


You should enclose that in an ifdef DEBUG_PROM instead of hiding the warning


I disagree, the result is horrible:

 static void __init prom_init_mem(void)
 {
phandle node;
-   char *path, type[64];
+#ifdef DEBUG_PROM
+   char *path;
+#endif
+   char type[64];
unsigned int plen;
cell_t *p, *endp;
__be32 val;


The right fix is to move the debug logic into a helper, and put the path
in there, eg. something like (not tested):

diff --git a/arch/powerpc/kernel/prom_init.c  
b/arch/powerpc/kernel/prom_init.c

index f9d6befb55a6..b02fa2ccc70b 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1389,6 +1389,18 @@ static void __init reserve_mem(u64 base, u64 size)
mem_reserve_cnt = cnt + 1;
 }

+#ifdef DEBUG_PROM
+static void prom_debug_path(phandle node)
+{
+   char *path;
+   path = prom_scratch;
+   memset(path, 0, PROM_SCRATCH_SIZE);
+   call_prom("package-to-path", 3, 1, node, path, PROM_SCRATCH_SIZE-1);
+   prom_debug("  node %s :\n", path);
+}
+#else
+static void prom_debug_path(phandle node) { }


Or put the ifdef inside the function to avoid double definition ?


+#endif /* DEBUG_PROM */
 /*
  * Initialize memory allocation mechanism, parse "memory" nodes and
  * obtain that way the top of memory and RMO to setup out local allocator
@@ -1441,11 +1453,7 @@ static void __init prom_init_mem(void)
p = regbuf;
endp = p + (plen / sizeof(cell_t));

-#ifdef DEBUG_PROM
-   memset(path, 0, PROM_SCRATCH_SIZE);
-   call_prom("package-to-path", 3, 1, node, path, 
PROM_SCRATCH_SIZE-1);
-   prom_debug("  node %s :\n", path);
-#endif /* DEBUG_PROM */
+   prom_debug_path(node);

while ((endp - p) >= (rac + rsc)) {
unsigned long base, size;


Although that also begs the question of why the hell do we need path at
all, and not just use prom_scratch directly?


Wondering the same, why not use prom_scratch directly

Christophe



cheers





Re: [RESEND 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Nicholas Piggin
On Thu, 5 Apr 2018 15:53:07 +1000
Balbir Singh  wrote:

> On Thu, 5 Apr 2018 15:04:05 +1000
> Nicholas Piggin  wrote:
> 
> > On Wed, 4 Apr 2018 20:00:52 -0700
> > Dan Williams  wrote:
> >   
> > > [ adding Matthew, Christoph, and Tony  ]
> > > 
> > > On Wed, Apr 4, 2018 at 4:57 PM, Nicholas Piggin  
> > > wrote:
> > > > On Thu,  5 Apr 2018 09:19:42 +1000
> > > > Balbir Singh  wrote:
> > > >  
> > > >> The pmem infrastructure uses memcpy_mcsafe in the pmem
> > > >> layer so as to convert machine check excpetions into
> > > >> a return value on failure in case a machine check
> > > >> exception is encoutered during the memcpy.
> > > >>
> > > >> This patch largely borrows from the copyuser_power7
> > > >> logic and does not add the VMX optimizations, largely
> > > >> to keep the patch simple. If needed those optimizations
> > > >> can be folded in.  
> > > >
> > > > So memcpy_mcsafe doesn't return number of bytes copied?
> > > > Huh, well that makes it simple.  
> > > 
> > > Well, not in current kernels, but we need to add that support or
> > > remove the direct call to copy_to_iter() in fs/dax.c. I'm looking
> > > right now to add "bytes remaining" support to the x86 memcpy_mcsafe(),
> > > but for copy_to_user we also need to handle bytes remaining for write
> > > faults. That fix is hopefully something that can land in an early
> > > 4.17-rc, but it won't be ready for -rc1.
> > 
> > I wonder if the powerpc implementation should just go straight to
> > counting bytes. Backporting to this interface would be trivial, but
> > it would just mean there's only one variant of the code to support.
> > That's up to Balbir though.
> >   
> 
> I'm thinking about it, I wonder what "bytes remaining" mean in pmem context
> in the context of a machine check exception. Also, do we want to be byte
> accurate or cache-line accurate for the bytes remaining? The former is much
> easier than the latter :)

The ideal would be a linear measure of how much of your copy reached
(or can reach) non-volatile storage with nothing further copied. You
may have to allow for some relaxing of the semantics depending on
what the architecture can support.

What's the problem with just counting bytes copied like usercopy --
why is that harder than cacheline accuracy?

> I'd rather implement the existing interface and port/support the new interface
> as it becomes available

Fair enough.

Thanks,
Nick


[PATCH] powerpc/64s/idle: POWER9 restore AMOR after deep sleep

2018-04-05 Thread Nicholas Piggin
POWER8 restores AMOR when waking from deep sleep, but POWER9 does not,
because it does not go through the subcore restore.

Have POWER9 restore it in core restore.

Cc: Vaidyanathan Srinivasan 
Signed-off-by: Nicholas Piggin 
---

Do we need this guy after waking from deep sleep?

This code is a little messy at the moment, it can be a bit tricky to
see exactly what we've restored. I'm doing a bit of work to tidy it
up and make it clearer, but that's not going to make 4.17 or backports.

 arch/powerpc/kernel/idle_book3s.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index bc4e391d031e..e72e385a4973 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -857,6 +857,8 @@ BEGIN_FTR_SECTION
mtspr   SPRN_PTCR,r4
ld  r4,_RPR(r1)
mtspr   SPRN_RPR,r4
+   ld  r4,_AMOR(r1)
+   mtspr   SPRN_AMOR,r4
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 
ld  r4,_TSCR(r1)
-- 
2.16.3



[PATCH 2/2] powerpc/64s: Fix POWER9 DD2.2 and above in cputable features

2018-04-05 Thread Nicholas Piggin
The CPU_FTR_POWER9_DD2_1 flag is intended to be set for DD2.1 and
above (which is what the dt_cpu_ftrs setup does). Fix cputable for
DD2.2 to match.

This came about due to patches b5af4f279323 ("powerpc: Add CPU feature
bits for TM bug workarounds on POWER9 v2.2"), and 9e9626ed3a4a
("powerpc/64s: Fix POWER9 DD2.2 and above in DT CPU features") being
in-flight at once. The latter patch fixed dt_cpu_ftrs like this one
does. The former changed cputable to match dt_cpu_ftrs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/cputable.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 4e332f3531c5..931dda8be87c 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -467,7 +467,8 @@ static inline void cpu_feature_keys_init(void) { }
 (~CPU_FTR_SAO))
 #define CPU_FTRS_POWER9_DD2_0 CPU_FTRS_POWER9
 #define CPU_FTRS_POWER9_DD2_1 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD2_1)
-#define CPU_FTRS_POWER9_DD2_2 (CPU_FTRS_POWER9 | CPU_FTR_P9_TM_HV_ASSIST | \
+#define CPU_FTRS_POWER9_DD2_2 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD2_1 | \
+  CPU_FTR_P9_TM_HV_ASSIST | \
   CPU_FTR_P9_TM_XER_SO_BUG)
 #define CPU_FTRS_CELL  (CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
-- 
2.16.3



[PATCH 1/2] powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit

2018-04-05 Thread Nicholas Piggin
The pkey code added a CPU_FTR_PKEY bit, but did not add it to the
dt_cpu_ftrs feature set. Although capability is supported by all
processors in the base dt_cpu_ftrs set for 64s, it's a significant
and sufficiently well defined feature to make it optional. So add
it as a quirk for now, which can be versioned out then controlled
by the firmware (once dt_cpu_ftrs gains versioning support).

Fixes: cf43d3b264 ("powerpc: Enable pkey subsystem ")
Cc: Ram Pai 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/dt_cpu_ftrs.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index ed7605d8fd2d..e88fbb1fdb8f 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -729,6 +729,13 @@ static __init void cpufeatures_cpu_quirks(void)
cur_cpu_spec->cpu_features &= ~(CPU_FTR_DAWR);
cur_cpu_spec->cpu_features |= CPU_FTR_P9_TLBIE_BUG;
}
+
+   /*
+* PKEY was not in the initial base or feature node
+* specification, but it should become optional in the next
+* cpu feature version sequence.
+*/
+   cur_cpu_spec->cpu_features |= CPU_FTR_PKEY;
 }
 
 static void __init cpufeatures_setup_finished(void)
-- 
2.16.3



[PATCH 0/2] a couple of cpu ftrs fixes

2018-04-05 Thread Nicholas Piggin
These are a couple of differences between cputable and dt_cpu_ftrs
I noticed with CPU_FTR bits. We have to be a bit careful now to keep
them in sync when we change one or the other.

Nicholas Piggin (2):
  powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit
  powerpc/64s: Fix POWER9 DD2.2 and above in cputable features

 arch/powerpc/include/asm/cputable.h | 3 ++-
 arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

-- 
2.16.3