date:20170321

Re: [v2 PATCH 4/4] powernv: Recover correct PACA on wakeup from a stop on P9 DD1

2017-03-21 Thread Gautham R Shenoy

On Tue, Mar 21, 2017 at 02:59:46AM +1000, Nicholas Piggin wrote:
> On Mon, 20 Mar 2017 21:24:18 +0530
> "Gautham R. Shenoy"  wrote:
> 
> > From: "Gautham R. Shenoy" 
> > 
> > POWER9 DD1.0 hardware has an issue due to which the SPRs of a thread
> > waking up from stop 0,1,2 with ESL=1 can endup being misplaced in the
> > core. Thus the HSPRG0 of a thread waking up from can contain the paca
> > pointer of its sibling.
> > 
> > This patch implements a context recovery framework within threads of a
> > core, by provisioning space in paca_struct for saving every sibling
> > threads's paca pointers. Basically, we should be able to arrive at the
> > right paca pointer from any of the thread's existing paca pointer.
> > 
> > At bootup, during powernv idle-init, we save the paca address of every
> > CPU in each one its siblings paca_struct in the slot corresponding to
> > this CPU's index in the core.
> > 
> > On wakeup from a stop, the thread will determine its index in the core
> > from the lower 2 bits of the PIR register and recover its PACA pointer
> > by indexing into the correct slot in the provisioned space in the
> > current PACA.
> > 
> > Furthermore, ensure that the NVGPRs are restored from the stack on the
> > way out by setting the NAPSTATELOST in paca.
> 
> Thanks for expanding on this, it makes the patch easier to follow :)
> 
> As noted before, I think if we use PACA_EXNMI for system reset, then
> *hopefully* there should be minimal races with the initial use of other
> thread's PACA at the start of the exception. So I'll work on getting
> that in, but it need not prevent this patch from being merged first
> IMO.
> 
> > [Changelog written with inputs from sva...@linux.vnet.ibm.com]
> > 
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> >  arch/powerpc/include/asm/paca.h   |  5 
> >  arch/powerpc/kernel/asm-offsets.c |  1 +
> >  arch/powerpc/kernel/idle_book3s.S | 49 
> > ++-
> >  arch/powerpc/platforms/powernv/idle.c | 22 
> >  4 files changed, 76 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/paca.h 
> > b/arch/powerpc/include/asm/paca.h
> > index 708c3e5..4405630 100644
> > --- a/arch/powerpc/include/asm/paca.h
> > +++ b/arch/powerpc/include/asm/paca.h
> > @@ -172,6 +172,11 @@ struct paca_struct {
> > u8 thread_mask;
> > /* Mask to denote subcore sibling threads */
> > u8 subcore_sibling_mask;
> > +   /*
> > +* Pointer to an array which contains pointer
> > +* to the sibling threads' paca.
> > +*/
> > +   struct paca_struct *thread_sibling_pacas[8];

> 
> Is 8 the right number? I wonder if we have a define for it.

Thats the maximum number of threads per core that we have had on POWER
so far.

Perhaps, I can make this

 struct paca_struct **thread_sibling_pacas;

and allocate threads_per_core number of slots in
pnv_init_idle_states. Sounds ok ?


> 
> >  #endif
> >  
> >  #ifdef CONFIG_PPC_BOOK3S_64
> > diff --git a/arch/powerpc/kernel/asm-offsets.c 
> > b/arch/powerpc/kernel/asm-offsets.c
> > index 4367e7d..6ec5016 100644
> > --- a/arch/powerpc/kernel/asm-offsets.c
> > +++ b/arch/powerpc/kernel/asm-offsets.c
> > @@ -727,6 +727,7 @@ int main(void)
> > OFFSET(PACA_THREAD_IDLE_STATE, paca_struct, thread_idle_state);
> > OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
> > OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
> > +   OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas);
> >  #endif
> >  
> > DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> > diff --git a/arch/powerpc/kernel/idle_book3s.S 
> > b/arch/powerpc/kernel/idle_book3s.S
> > index 9957287..c2f2cfb 100644
> > --- a/arch/powerpc/kernel/idle_book3s.S
> > +++ b/arch/powerpc/kernel/idle_book3s.S
> > @@ -375,6 +375,46 @@ _GLOBAL(power9_idle_stop)
> > li  r4,1
> > b   pnv_powersave_common
> > /* No return */
> > +
> > +
> > +/*
> > + * On waking up from stop 0,1,2 with ESL=1 on POWER9 DD1,
> > + * HSPRG0 will be set to the HSPRG0 value of one of the
> > + * threads in this core. Thus the value we have in r13
> > + * may not be this thread's paca pointer.
> > + *
> > + * Fortunately, the PIR remains invariant. Since this thread's
> > + * paca pointer is recorded in all its sibling's paca, we can
> > + * correctly recover this thread's paca pointer if we
> > + * know the index of this thread in the core.
> > + *
> > + * This index can be obtained from the lower two bits of the PIR.
> > + *
> > + * i.e, thread's position in the core = PIR[62:63].
> > + * If this value is i, then this thread's paca is
> > + * paca->thread_sibling_pacas[i].
> > + */
> > +power9_dd1_recover_paca:
> > +   mfspr   r4, SPRN_PIR
> > +   clrldi  r4, r4, 62
> 
> Does SPRN_TIR work?

I wasn't aware of SPRN_TIR!

I can check this. If my reading of the ISA is correct, TIR should
contain the thread number which are in the range [0..3].

> 
> > +   /*
> > +* Since each entry in

Re: [v2 PATCH 3/4] powernv:idle: Don't override default/deepest directly in kernel

2017-03-21 Thread Gautham R Shenoy

Hi,

On Tue, Mar 21, 2017 at 02:39:34AM +1000, Nicholas Piggin wrote:
> > @@ -241,8 +240,9 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600,
> >   * The default stop state that will be used by ppc_md.power_save
> >   * function on platforms that support stop instruction.
> >   */
> > -u64 pnv_default_stop_val;
> > -u64 pnv_default_stop_mask;
> > +static u64 pnv_default_stop_val;
> > +static u64 pnv_default_stop_mask;
> > +static bool default_stop_found;
> >  
> >  /*
> >   * Used for ppc_md.power_save which needs a function with no parameters
> > @@ -262,8 +262,9 @@ static void power9_idle(void)
> >   * psscr value and mask of the deepest stop idle state.
> >   * Used when a cpu is offlined.
> >   */
> > -u64 pnv_deepest_stop_psscr_val;
> > -u64 pnv_deepest_stop_psscr_mask;
> > +static u64 pnv_deepest_stop_psscr_val;
> > +static u64 pnv_deepest_stop_psscr_mask;
> > +static bool deepest_stop_found;
> 
> Aha you have made them static. Nitpick withdrawn :)
> 
> The log messages look good now.
> 
> Reviewed-by: Nicholas Piggin 
> 
Thanks!

--
Thanks and Regards
gautham.

Re: [v2 PATCH 1/4] powernv: Move CPU-Offline idle state invocation from smp.c to idle.c

2017-03-21 Thread Gautham R Shenoy

Hi Nick,

On Tue, Mar 21, 2017 at 02:35:17AM +1000, Nicholas Piggin wrote:
> On Mon, 20 Mar 2017 21:24:15 +0530
> "Gautham R. Shenoy"  wrote:
> 
> > From: "Gautham R. Shenoy" 
> > 
> > Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which
> > transitions the CPU to the deepest available platform idle state to a
> > new function named pnv_cpu_offline() in powernv/idle.c. The rationale
> > behind this code movement is that the data required to determine the
> > deepest available platform state resides in powernv/idle.c.
> > 
> > Signed-off-by: Gautham R. Shenoy 
> 
> Looks good. As a nitpick, possibly one or two variables in idle.c could
> become static (pnv_deepest_stop_psscr_*).
>

I changed that, but it ended up being in the next patch.

> Reviewed-by: Nicholas Piggin 
> 

Thanks!

--
Thanks and Regards
gautham.

Re: [v6] powerpc/powernv: add hdat attribute to sysfs

2017-03-21 Thread Andrew Donnellan


On 22/03/17 10:53, Matt Brown wrote:

The HDAT data area is consumed by skiboot and turned into a device-tree.
In some cases we would like to look directly at the HDAT, so this patch
adds a sysfs node to allow it to be viewed.  This is not possible through
/dev/mem as it is reserved memory which is stopped by the /dev/mem filter.

Signed-off-by: Matt Brown 
---
Changelog

v6
- attribute names are stored locally, removing potential null pointer 
errors
- added of_node_put for the corresponding of_find_node
- folded exports node creation into opal_export_attr()
- fixed kzalloc flags to GFP_KERNEL
- fixed struct array indexing
- fixed error message
---
 arch/powerpc/platforms/powernv/opal.c | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 2822935..953537e 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -604,6 +604,87 @@ static void opal_export_symmap(void)
pr_warn("Error %d creating OPAL symbols file\n", rc);
 }

+static ssize_t export_attr_read(struct file *fp, struct kobject *kobj,
+struct bin_attribute *bin_attr, char *buf,
+loff_t off, size_t count)
+{
+   return memory_read_from_buffer(buf, count, &off, bin_attr->private,
+  bin_attr->size);
+}
+
+static struct bin_attribute *exported_attrs;
+static char **attr_name;


Can these be moved inside opal_export_attrs()?


--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

[PATCH kernel v11 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently

2017-03-21 Thread Alexey Kardashevskiy

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+   select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
-- 
2.11.0

[PATCH kernel v11 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

2017-03-21 Thread Alexey Kardashevskiy

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_tce_table_put() and makes
iommu_free_table() static. iommu_tce_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v10:
* iommu_tce_table_get() can fail now if a table is being destroyed, will be
used in 10/10
* iommu_tce_table_put() returns what kref_put() returned
* iommu_tce_table_put() got WARN_ON(!tbl) as the callers already check
for it and do not call _put() when tbl==NULL

v9:
* s/iommu_table_get/iommu_tce_table_get/ and
s/iommu_table_put/iommu_tce_table_put/ -- so I removed r-b/a-b
---
 arch/powerpc/include/asm/iommu.h  |  5 +++--
 arch/powerpc/kernel/iommu.c   | 27 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++---
 arch/powerpc/platforms/powernv/pci.c  |  1 +
 arch/powerpc/platforms/pseries/iommu.c|  3 ++-
 arch/powerpc/platforms/pseries/vio.c  |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
 7 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4554699aec02..d96142572e6d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -119,6 +119,7 @@ struct iommu_table {
struct list_head it_group_list;/* List of iommu_table_group_link */
unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
+   struct krefit_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern struct iommu_table *iommu_tce_table_get(struct iommu_table *tbl);
+extern int iommu_tce_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index bc142d87130f..af915da5e03a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid)
return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
unsigned long bitmap_sz;
unsigned int order;
+   struct iommu_table *tbl;
 
-   if (!tbl)
-   return;
+   tbl = container_of(kref, struct iommu_table, it_kref);
 
if (tbl->it_ops->free)
tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
 
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_map, tbl->it_size))
-   pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+   pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* calculate bitmap size in bytes */
bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,24 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+struct iommu_table *iommu_tce_table_get(struct iommu_table *tbl)
+{
+   if (kref_get_unless_zero(&tbl->it_kref))
+   return tbl;
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_table_get);
+
+int iommu_tce_table_put(struct iommu_table *tbl)
+{
+   if (WARN_ON(!tbl))
+   return 0;
+
+   return kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_tce_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5dae54cb11e3..ee4cdb5b893f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1424,7 +1424,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->tabl

[PATCH kernel v11 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal

2017-03-21 Thread Alexey Kardashevskiy

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Acked-by: Alex Williamson 
---
Changes:
v5:
* moved "tbl->it_ops = &pnv_ioda2_iommu_ops" earlier and updated
the commit log
---
 arch/powerpc/kernel/iommu.c   |  4 
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 --
 drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9bace5df05d5..bc142d87130f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
if (!tbl)
return;
 
+   if (tbl->it_ops->free)
+   tbl->it_ops->free(tbl);
+
if (!tbl->it_map) {
kfree(tbl);
return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 572e9c9f1ea0..5dae54cb11e3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1424,7 +1424,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
-   pnv_pci_ioda2_table_free_pages(tbl);
iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2040,7 +2039,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
pnv_pci_ioda2_table_free_pages(tbl);
-   iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2317,6 +2315,8 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
if (!tbl)
return -ENOMEM;
 
+   tbl->it_ops = &pnv_ioda2_iommu_ops;
+
ret = pnv_pci_ioda2_table_alloc_pages(nid,
bus_offset, page_shift, window_size,
levels, tbl);
@@ -2325,8 +2325,6 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
return ret;
}
 
-   tbl->it_ops = &pnv_ioda2_iommu_ops;
-
*ptbl = tbl;
 
return 0;
@@ -2367,7 +2365,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
rc);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "");
return rc;
}
 
@@ -2455,7 +2453,7 @@ static void pnv_ioda2_take_ownership(struct 
iommu_table_group *table_group)
pnv_pci_ioda2_unset_window(&pe->table_group, 0);
if (pe->pbus)
pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index cf3de91fbfe7..fbec7348a7e5 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container 
*container,
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
tce_iommu_userspace_view_free(tbl, container->mm);
-   tbl->it_ops->free(tbl);
+   iommu_free_table(tbl, "");
decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0

[PATCH kernel v11 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()

2017-03-21 Thread Alexey Kardashevskiy

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/iommu.h  |  7 +++
 arch/powerpc/kernel/iommu.c   | 23 +++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..4554699aec02 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
long index,
unsigned long *hpa,
enum dma_data_direction *direction);
+   /* Real mode */
+   int (*exchange_rm)(struct iommu_table *tbl,
+   long index,
+   unsigned long *hpa,
+   enum dma_data_direction *direction);
 #endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..9bace5df05d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1004,6 +1004,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned 
long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret;
+
+   ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+   if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+   (*direction == DMA_BIDIRECTIONAL))) {
+   struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+   if (likely(pg)) {
+   SetPageDirty(pg);
+   } else {
+   tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+   ret = -EFAULT;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e36738291c32..572e9c9f1ea0 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1860,6 +1860,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+   if (!ret)
+   pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+   return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1874,6 +1885,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda1_tce_xchg,
+   .exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
.clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
@@ -1948,7 +1960,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
 {
struct iommu_table_group_link *tgl;
 
-   list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+   list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
struct pnv_ioda_pe *pe = container_of(tgl->table_group,
struct pnv_ioda_pe, table_group);
struct pnv_phb *phb = pe->phb;
@@ -2004,6 +2016,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tb

[PATCH kernel v11 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration

2017-03-21 Thread Alexey Kardashevskiy

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on sha1 093b995e3b55 Huang Ying "mm, swap: Remove WARN_ON_ONCE() 
in free_swap_slot()".

Please comment. Thanks.



Changes:
v11:
* added rb:David to 04/10
* fixed reference leak in 10/10

v10:
* fixed bugs in 10/10
* fixed 04/10 to avoid iommu_table get/put race in 10/10

v9:
* renamed few exported symbols in 04/10
* reforked various objects reference counting in 10/10

v8:
* kept fixing oddities with error handling in 10/10

v7:
* added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c

v6:
* reworked the last patch in terms of error handling and parameters checking

v5:
* replaced "KVM: PPC: Separate TCE validation from update" with
"KVM: PPC: iommu: Unify TCE checking"
* changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table 
disposal"
* reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
* more details in individual commit logs

v4:
* addressed comments from v3
* updated subject lines with correct component names
* regrouped the patchset in order:
- powerpc fixes;
- vfio_spapr_tce driver fixes;
- KVM/PPC fixes;
- KVM+PPC+VFIO;
* everything except last 2 patches have "Reviewed-By: David"

v3:
* there was no full repost, only last patch was posted

v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.


Alexey Kardashevskiy (10):
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  powerpc/powernv/iommu: Add real mode version of
iommu_table_ops::exchange()
  powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Use preregistered memory API to access TCE list
  KVM: PPC: iommu: Unify TCE checking
  KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  18 +-
 arch/powerpc/include/asm/iommu.h   |  32 ++-
 arch/powerpc/include/asm/kvm_host.h|   8 +
 arch/powerpc/include/asm/kvm_ppc.h |  12 +-
 arch/powerpc/include/asm/mmu_context.h |   4 +
 include/uapi/linux/kvm.h   |   7 +
 arch/powerpc/kernel/iommu.c|  89 ++---
 arch/powerpc/kvm/book3s_64_vio.c   | 308 -
 arch/powerpc/kvm/book3s_64_vio_hv.c| 303 +++-
 arch/powerpc/kvm/powerpc.c |   2 +
 arch/powerpc/mm/mmu_context_iommu.c|  39 
 arch/powerpc/platforms/powernv/pci-ioda.c  |  46 +++--
 arch/powerpc/platforms/powernv/pci.c   |   1 +
 arch/powerpc/platforms/pseries/iommu.c |   3 +-
 arch/powerpc/platforms/pseries/vio.c   |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c|   2 +-
 virt/kvm/vfio.c| 105 ++
 arch/powerpc/kvm/Kconfig   |   1 +
 18 files changed, 875 insertions(+), 107 deletions(-)

-- 
2.11.0

[PATCH kernel v11 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-21 Thread Alexey Kardashevskiy

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v11:
* fixed vfio_group reference leak in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
handler

v10:
* fixed leaking references in virt/kvm/vfio.c
* moved code to helpers - kvm_vfio_group_get_iommu_group, 
kvm_spapr_tce_release_vfio_group
* fixed possible race between referencing table and destroying it via
VFIO add/remove window ioctls()

v9:
* removed referencing a group in KVM, only referencing iommu_table's now
* fixed a reference leak in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler
* fixed typo in vfio.txt
* removed @argsz and @flags from struct kvm_vfio_spapr_tce

v8:
* changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
to handle them
* changed vmalloc_to_phys() callers to return H_HARDWARE
* changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
and added a comment about this in the code
* changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
and do WARN_ON
* added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
have all vmalloc_to_phys() callsites covered

v7:
* added realmode-friendly WARN_ON_ONCE_RM

v6:
* changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
* moved kvmppc_gpa_to_ua() to TCE validation

v5:
* changed error codes in multiple places
* added bunch of WARN_ON() in places which should not really happen
* adde a check that an iommu table is not attached already to LIOBN
* dropped explicit calls to iommu_tce_clear_param_check/
iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
call them anyway (since the previous patch)
* if we fail to update a hardware IOMMU table for unexpected reason,
this just clears the entry

v4:
* added note to the commit log about allowing multiple updates of
the same IOMMU table;
* instead of checking for if any memory was preregistered, this
returns H_TOO_HARD if a specific page was not;
* fixed comments from v3 about error handling in many places;
* simplified TCE handlers and merged IOMMU parts inline - for example,
there used to be kvmppc_h_put_tce_iommu(), now it is merged into
kvmppc_h_put_tce(); this allows to check IOBA boundaries against
the first attached table only (makes the code simpler);

v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---
 Documentation/virtual/kvm/devices/vfio.txt |  18 +-

[PATCH kernel v11 09/10] KVM: PPC: iommu: Unify TCE checking

2017-03-21 Thread Alexey Kardashevskiy

This reworks helpers for checking TCE update parameters in way they
can be used in KVM.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v6:
* s/tce/gpa/ as TCE without permission bits is a GPA and this is what is
passed everywhere
---
 arch/powerpc/include/asm/iommu.h| 20 +++-
 arch/powerpc/include/asm/kvm_ppc.h  |  6 --
 arch/powerpc/kernel/iommu.c | 37 +
 arch/powerpc/kvm/book3s_64_vio_hv.c | 31 +++
 4 files changed, 39 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index d96142572e6d..8a8ce220d7d0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -296,11 +296,21 @@ static inline void iommu_restore(void)
 #endif
 
 /* The API to support IOMMU operations for VFIO */
-extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce_value,
-   unsigned long npages);
-extern int iommu_tce_put_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce);
+extern int iommu_tce_check_ioba(unsigned long page_shift,
+   unsigned long offset, unsigned long size,
+   unsigned long ioba, unsigned long npages);
+extern int iommu_tce_check_gpa(unsigned long page_shift,
+   unsigned long gpa);
+
+#define iommu_tce_clear_param_check(tbl, ioba, tce_value, npages) \
+   (iommu_tce_check_ioba((tbl)->it_page_shift,   \
+   (tbl)->it_offset, (tbl)->it_size, \
+   (ioba), (npages)) || (tce_value))
+#define iommu_tce_put_param_check(tbl, ioba, gpa) \
+   (iommu_tce_check_ioba((tbl)->it_page_shift,   \
+   (tbl)->it_offset, (tbl)->it_size, \
+   (ioba), 1) || \
+   iommu_tce_check_gpa((tbl)->it_page_shift, (gpa)))
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index eba8988d8443..72c2a155641f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -169,8 +169,10 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
struct kvm *kvm, unsigned long liobn);
-extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
-   unsigned long ioba, unsigned long npages);
+#define kvmppc_ioba_validate(stt, ioba, npages) \
+   (iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \
+   (stt)->size, (ioba), (npages)) ?\
+   H_PARAMETER : H_SUCCESS)
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
unsigned long tce);
 extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index af915da5e03a..e73927352672 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -963,47 +963,36 @@ void iommu_flush_tce(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_flush_tce);
 
-int iommu_tce_clear_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce_value,
-   unsigned long npages)
+int iommu_tce_check_ioba(unsigned long page_shift,
+   unsigned long offset, unsigned long size,
+   unsigned long ioba, unsigned long npages)
 {
-   /* tbl->it_ops->clear() does not support any value but 0 */
-   if (tce_value)
-   return -EINVAL;
+   unsigned long mask = (1UL << page_shift) - 1;
 
-   if (ioba & ~IOMMU_PAGE_MASK(tbl))
+   if (ioba & mask)
return -EINVAL;
 
-   ioba >>= tbl->it_page_shift;
-   if (ioba < tbl->it_offset)
+   ioba >>= page_shift;
+   if (ioba < offset)
return -EINVAL;
 
-   if ((ioba + npages) > (tbl->it_offset + tbl->it_size))
+   if ((ioba + 1) > (offset + size))
return -EINVAL;
 
return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
+EXPORT_SYMBOL_GPL(iommu_tce_check_ioba);
 
-int iommu_tce_put_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce)
+int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
 {
-   if (tce & ~IOMMU_PAGE_MASK(tbl))
-   return -EINVAL;
-
-   if (ioba & ~IOMMU_PAGE_MASK(tbl))
-   return -EINVAL;
-
-   ioba >>= tbl->it_page_shift;
-   if (ioba < tbl->it_offset)

[PATCH kernel v11 08/10] KVM: PPC: Use preregistered memory API to access TCE list

2017-03-21 Thread Alexey Kardashevskiy

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v4:
* removed oneline inlines
* now falls back to locking rmap if TCE list is not in preregistered memory

v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++--
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 918af76ab2b6..0f145fc7a3a5 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
long i, ret = H_SUCCESS;
unsigned long tces, entry, ua = 0;
unsigned long *rmap = NULL;
+   bool prereg = false;
 
stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
@@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (ret != H_SUCCESS)
return ret;
 
-   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-   return H_TOO_HARD;
+   if (mm_iommu_preregistered(vcpu->kvm->mm)) {
+   /*
+* We get here if guest memory was pre-registered which
+* is normally VFIO case and gpa->hpa translation does not
+* depend on hpt.
+*/
+   struct mm_iommu_table_group_mem_t *mem;
 
-   rmap = (void *) vmalloc_to_phys(rmap);
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+   return H_TOO_HARD;
 
-   /*
-* Synchronize with the MMU notifier callbacks in
-* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-* While we have the rmap lock, code running on other CPUs
-* cannot finish unmapping the host real page that backs
-* this guest real page, so we are OK to access the host
-* real page.
-*/
-   lock_rmap(rmap);
-   if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-   ret = H_TOO_HARD;
-   goto unlock_exit;
+   mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
+   if (mem)
+   prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
+   }
+
+   if (!prereg) {
+   /*
+* This is usually a case of a guest with emulated devices only
+* when TCE list is not in preregistered memory.
+* We do not require memory to be preregistered in this case
+* so lock rmap and do __find_linux_pte_or_hugepte().
+*/
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+   return H_TOO_HARD;
+
+   rmap = (void *) vmalloc_to_phys(rmap);
+
+   /*
+* Synchronize with the MMU notifier callbacks in
+* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+* While we have the rmap lock, code running on other CPUs
+* cannot finish unmapping the host real page that backs
+* this guest real page, so we are OK to access the host
+* real page.
+*/
+   lock_rmap(rmap);
+   if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+   ret = H_TOO_HARD;
+   goto unlock_exit;
+   }
}
 
for (i = 0; i < npages; ++i) {
@@ -289,7 +314,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
}
 
 unlock_exit:
-   unlock_rmap(rmap);
+   if (rmap)
+   unlock_rmap(rmap);
 
return ret;
 }
-- 
2.11.0

[PATCH kernel v11 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table()

2017-03-21 Thread Alexey Kardashevskiy

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c|  7 ---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++--
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index dd11c4c8c56a..eba8988d8443 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -168,7 +168,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-   struct kvm_vcpu *vcpu, unsigned long liobn);
+   struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 3e26cd4979f9..e96a4590464c 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -214,12 +214,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -247,7 +248,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
u64 __user *tces;
u64 tce;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -301,7 +302,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index e4c4ea973e57..918af76ab2b6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -48,10 +48,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *  mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
unsigned long liobn)
 {
-   struct kvm *kvm = vcpu->kvm;
struct kvmppc_spapr_tce_table *stt;
 
list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -182,12 +181,13 @@ EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -240,7 +240,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
unsigned long tces, entry, ua = 0;
unsigned long *rmap = NULL;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -301,7 +301,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -322,12 +322,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
unsigned long idx;
struct page *page;
u64 *tbl;
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
-- 
2.11.0

[PATCH kernel v11 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number

2017-03-21 Thread Alexey Kardashevskiy

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f51d5082a377..f5a52ffb6b58 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -883,6 +883,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_MMU_RADIX 134
 #define KVM_CAP_PPC_MMU_HASH_V3 135
 #define KVM_CAP_IMMEDIATE_EXIT 136
+#define KVM_CAP_SPAPR_TCE_VFIO 137
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0

[PATCH kernel v11 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory

2017-03-21 Thread Alexey Kardashevskiy

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/mmu_context.h |  4 
 arch/powerpc/mm/mmu_context_iommu.c| 39 ++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+   struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 497130c5c742..fc67bd766eaf 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct 
mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+   unsigned long ua, unsigned long size)
+{
+   struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+   list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+   next) {
+   if ((mem->ua <= ua) &&
+   (ua + size <= mem->ua +
+(mem->entries << PAGE_SHIFT))) {
+   ret = mem;
+   break;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t 
*mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa)
+{
+   const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+   void *va = &mem->hpas[entry];
+   unsigned long *pa;
+
+   if (entry >= mem->entries)
+   return -EFAULT;
+
+   pa = (void *) vmalloc_to_phys(va);
+   if (!pa)
+   return -EFAULT;
+
+   *hpa = *pa | (ua & ~PAGE_MASK);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.11.0

[PATCH 3/3] powerpc/powernv: Introduce address translation services for Nvlink2

2017-03-21 Thread Alistair Popple

Nvlink2 supports address translation services (ATS) allowing devices
to request address translations from an mmu known as the nest MMU
which is setup to walk the CPU page tables.

To access this functionality certain firmware calls are required to
setup and manage hardware context tables in the nvlink processing unit
(NPU). The NPU also manages forwarding of TLB invalidates (known as
address translation shootdowns/ATSDs) to attached devices.

This patch exports several methods to allow device drivers to register
a process id (PASID/PID) in the hardware tables and to receive
notification of when a device should stop issuing address translation
requests (ATRs). It also adds a fault handler to allow device drivers
to demand fault pages in.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/include/asm/book3s/64/mmu.h   |   6 +
 arch/powerpc/include/asm/opal-api.h|   5 +-
 arch/powerpc/include/asm/opal.h|   5 +
 arch/powerpc/include/asm/tlb.h |  10 +-
 arch/powerpc/mm/mmu_context_book3s64.c |   1 +
 arch/powerpc/platforms/powernv/npu-dma.c   | 427 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |   3 +
 arch/powerpc/platforms/powernv/pci-ioda.c  |   2 +
 arch/powerpc/platforms/powernv/pci.h   |  25 +-
 9 files changed, 480 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 805d4105..1676ec8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -73,10 +73,16 @@ extern struct patb_entry *partition_tb;
 typedef unsigned long mm_context_id_t;
 struct spinlock;
 
+/* Maximum possible number of NPUs in a system. */
+#define NV_MAX_NPUS 8
+
 typedef struct {
mm_context_id_t id;
u16 user_psize; /* page size index */
 
+   /* NPU NMMU context */
+   struct npu_context *npu_context;
+
 #ifdef CONFIG_PPC_MM_SLICES
u64 low_slices_psize;   /* SLB page size encodings */
unsigned char high_slices_psize[SLICE_ARRAY_SIZE];
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index a0aa285..a599a2c 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -168,7 +168,10 @@
 #define OPAL_INT_SET_MFRR  125
 #define OPAL_PCI_TCE_KILL  126
 #define OPAL_NMMU_SET_PTCR 127
-#define OPAL_LAST  127
+#define OPAL_NPU_INIT_CONTEXT  146
+#define OPAL_NPU_DESTROY_CONTEXT   147
+#define OPAL_NPU_MAP_LPAR  148
+#define OPAL_LAST  148
 
 /* Device tree flags */
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 1ff03a6..b3b97c4 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -29,6 +29,11 @@ extern struct device_node *opal_node;
 
 /* API functions */
 int64_t opal_invalid_call(void);
+int64_t opal_npu_destroy_context(uint64_t phb_id, uint64_t pid, uint64_t bdf);
+int64_t opal_npu_init_context(uint64_t phb_id, int pasid, uint64_t msr,
+   uint64_t bdf);
+int64_t opal_npu_map_lpar(uint64_t phb_id, uint64_t bdf, uint64_t lparid,
+   uint64_t lpcr);
 int64_t opal_console_write(int64_t term_number, __be64 *length,
   const uint8_t *buffer);
 int64_t opal_console_read(int64_t term_number, __be64 *length,
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 6095575..fc61fca 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -63,15 +63,21 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 }
 
 #ifdef CONFIG_SMP
+/* If there is an NPU context associated with this thread it may have
+ * been active on a GPU which has issued translation requests via the
+ * nest mmu. In this case we need to do a broadcast tlb to invalidate
+ * any caches on the nest mmu. Invalidations on the GPU are handled
+ * via mmu notfiers.
+ */
 static inline int mm_is_core_local(struct mm_struct *mm)
 {
-   return cpumask_subset(mm_cpumask(mm),
+   return !mm->context.npu_context && cpumask_subset(mm_cpumask(mm),
  topology_sibling_cpumask(smp_processor_id()));
 }
 
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
-   return cpumask_equal(mm_cpumask(mm),
+   return !mm->context.npu_context && cpumask_equal(mm_cpumask(mm),
  cpumask_of(smp_processor_id()));
 }
 
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index 73bf6e1..eb317f1 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -67,6 +67,7 @@ static int radix__init_new_context(struct mm_struct *mm, int 
index)

[PATCH 2/3] powerpc/powernv: Add sanity checks to pnv_pci_get_{gpu|npu}_dev

2017-03-21 Thread Alistair Popple

The pnv_pci_get_{gpu|npu}_dev functions are used to find associations
between nvlink PCIe devices and standard PCIe devices. However they
lacked basic sanity checking which results in NULL pointer
dereferencing if they are incorrectly called which can be harder to
spot than an explicit WARN_ON.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 1c383f3..050bd5d 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -37,6 +37,12 @@ struct pci_dev *pnv_pci_get_gpu_dev(struct pci_dev *npdev)
struct device_node *dn;
struct pci_dev *gpdev;

+   if (WARN_ON(!npdev))
+   return NULL;
+
+   if (WARN_ON(!npdev->dev.of_node))
+   return NULL;
+
/* Get assoicated PCI device */
dn = of_parse_phandle(npdev->dev.of_node, "ibm,gpu", 0);
if (!dn)
@@ -55,6 +61,12 @@ struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, 
int index)
struct device_node *dn;
struct pci_dev *npdev;

+   if (WARN_ON(!gpdev))
+   return NULL;
+
+   if (WARN_ON(!gpdev->dev.of_node))
+   return NULL;
+
/* Get assoicated PCI device */
dn = of_parse_phandle(gpdev->dev.of_node, "ibm,npu", index);
if (!dn)
--
2.1.4

[PATCH 1/3] drivers/of/base.c: Add of_property_read_u64_index

2017-03-21 Thread Alistair Popple

There is of_property_read_u32_index but no u64 variant. This patch
adds one similar to the u32 version for u64.

Signed-off-by: Alistair Popple 
---
 drivers/of/base.c  | 31 +++
 include/linux/of.h |  3 +++
 2 files changed, 34 insertions(+)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index d7c4629..0ea16bd 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -1213,6 +1213,37 @@ int of_property_read_u32_index(const struct device_node 
*np,
 EXPORT_SYMBOL_GPL(of_property_read_u32_index);
 
 /**
+ * of_property_read_u64_index - Find and read a u64 from a multi-value 
property.
+ *
+ * @np:device node from which the property value is to be read.
+ * @propname:  name of the property to be searched.
+ * @index: index of the u64 in the list of values
+ * @out_value: pointer to return value, modified only if no error.
+ *
+ * Search for a property in a device node and read nth 64-bit value from
+ * it. Returns 0 on success, -EINVAL if the property does not exist,
+ * -ENODATA if property does not have a value, and -EOVERFLOW if the
+ * property data isn't large enough.
+ *
+ * The out_value is modified only if a valid u64 value can be decoded.
+ */
+int of_property_read_u64_index(const struct device_node *np,
+  const char *propname,
+  u32 index, u64 *out_value)
+{
+   const u64 *val = of_find_property_value_of_size(np, propname,
+   ((index + 1) * sizeof(*out_value)),
+   0, NULL);
+
+   if (IS_ERR(val))
+   return PTR_ERR(val);
+
+   *out_value = be64_to_cpup(((__be64 *)val) + index);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(of_property_read_u64_index);
+
+/**
  * of_property_read_variable_u8_array - Find and read an array of u8 from a
  * property, with bounds on the minimum and maximum array size.
  *
diff --git a/include/linux/of.h b/include/linux/of.h
index 21e6323..d08788d 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -292,6 +292,9 @@ extern int of_property_count_elems_of_size(const struct 
device_node *np,
 extern int of_property_read_u32_index(const struct device_node *np,
   const char *propname,
   u32 index, u32 *out_value);
+extern int of_property_read_u64_index(const struct device_node *np,
+  const char *propname,
+  u32 index, u64 *out_value);
 extern int of_property_read_variable_u8_array(const struct device_node *np,
const char *propname, u8 *out_values,
size_t sz_min, size_t sz_max);
-- 
2.1.4

[PATCH V5 16/17] mm: Let arch choose the initial value of task size

2017-03-21 Thread Aneesh Kumar K.V

As we start supporting larger address space (>128TB), we want to give
architecture a control on max task size of an application which is different
from the TASK_SIZE. For ex: ppc64 needs to track the base page size of a segment
and it is copied from mm_context_t to PACA on each context switch. If we know 
that
application has not used an address range above 128TB we only need to copy
details about 128TB range to PACA. This will help in improving context switch
performance by avoiding larger copy operation.

Cc: Kirill A. Shutemov 
Cc: linux...@kvack.org
Cc: Andrew Morton 
Signed-off-by: Aneesh Kumar K.V 
---
 fs/exec.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 65145a3df065..5550a56d03c3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1308,6 +1308,14 @@ void would_dump(struct linux_binprm *bprm, struct file 
*file)
 }
 EXPORT_SYMBOL(would_dump);
 
+#ifndef arch_init_task_size
+static inline void arch_init_task_size(void)
+{
+   current->mm->task_size = TASK_SIZE;
+}
+#define arch_init_task_size arch_init_task_size
+#endif
+
 void setup_new_exec(struct linux_binprm * bprm)
 {
arch_pick_mmap_layout(current->mm);
@@ -1327,7 +1335,7 @@ void setup_new_exec(struct linux_binprm * bprm)
 * depend on TIF_32BIT which is only updated in flush_thread() on
 * some architectures like powerpc
 */
-   current->mm->task_size = TASK_SIZE;
+   arch_init_task_size();
 
/* install the new credentials */
if (!uid_eq(bprm->cred->uid, current_euid()) ||
-- 
2.7.4

[PATCH V5 14/17] powerpc/mm/hash: Skip using reserved virtual address range

2017-03-21 Thread Aneesh Kumar K.V

Now that we use all the available virtual address range, we need to make sure
we don't generate VSID such that it overlaps with the reserved vsid range.
Reserved vsid range include the virtual address range used by the adjunct
partition and also the VRMA virtual segment. We find the context value that
can result in generating such a VSID and reserve it early in boot.

We don't look at the adjunct range, because for now we disable the adjunct usage
in a Linux LPAR via CAS interface.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  7 
 arch/powerpc/include/asm/kvm_book3s_64.h  |  2 -
 arch/powerpc/include/asm/mmu_context.h|  1 +
 arch/powerpc/mm/hash_utils_64.c   | 58 +++
 arch/powerpc/mm/mmu_context_book3s64.c| 26 
 5 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index c99ea6bbd82c..ac987e08ce63 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -578,11 +578,18 @@ extern void slb_set_size(u16 size);
 #define VSID_MULTIPLIER_256M   ASM_CONST(12538073) /* 24-bit prime */
 #define VSID_BITS_256M (VA_BITS - SID_SHIFT)
 #define VSID_BITS_65_256M  (65 - SID_SHIFT)
+/*
+ * Modular multiplicative inverse of VSID_MULTIPLIER under modulo VSID_MODULUS
+ */
+#define VSID_MULINV_256M   ASM_CONST(665548017062)
 
 #define VSID_MULTIPLIER_1T ASM_CONST(12538073) /* 24-bit prime */
 #define VSID_BITS_1T   (VA_BITS - SID_SHIFT_1T)
 #define VSID_BITS_65_1T(65 - SID_SHIFT_1T)
+#define VSID_MULINV_1T ASM_CONST(209034062)
 
+/* 1TB VSID reserved for VRMA */
+#define VRMA_VSID  0x1ffUL
 #define USER_VSID_RANGE(1UL << (ESID_BITS + SID_SHIFT))
 
 /* 4 bits per slice and we have one slice per 1TB */
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index d9b48f5bb606..d55c7f881ce7 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -49,8 +49,6 @@ static inline bool kvm_is_radix(struct kvm *kvm)
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
-#define VRMA_VSID  0x1ffUL /* 1TB VSID reserved for VRMA */
-
 /*
  * We use a lock bit in HPTE dword 0 to synchronize updates and
  * accesses to each HPTE, and another bit to indicate non-present
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 8fe1ba1808d3..757d4a9e1a1c 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -51,6 +51,7 @@ static inline void switch_mmu_context(struct mm_struct *prev,
return switch_slb(tsk, next);
 }
 
+extern void hash__resv_context(int context_id);
 extern int hash__get_new_context(void);
 extern void __destroy_context(int context_id);
 static inline void mmu_context_init(void) { }
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 04523052ad8e..81db182225fb 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1846,4 +1846,62 @@ static int __init hash64_debugfs(void)
 }
 machine_device_initcall(pseries, hash64_debugfs);
 
+/*
+ * if modinv is the modular multiplicate inverse of (x % vsid_modulus) and
+ * vsid = (protovsid * x) % vsid_modulus, then we say
+ *
+ * provosid = (vsid * modinv) % vsid_modulus
+ */
+static unsigned long vsid_unscramble(unsigned long vsid, int ssize)
+{
+   unsigned long protovsid;
+   unsigned long va_bits = VA_BITS;
+   unsigned long modinv, vsid_modulus;
+   unsigned long max_mod_inv, tmp_modinv;
+
+
+   if (!mmu_has_feature(MMU_FTR_68_BIT_VA))
+   va_bits = 65;
+
+   if (ssize == MMU_SEGSIZE_256M) {
+   modinv = VSID_MULINV_256M;
+   vsid_modulus = ((1UL << (va_bits - SID_SHIFT)) - 1);
+   } else {
+   modinv = VSID_MULINV_1T;
+   vsid_modulus = ((1UL << (va_bits - SID_SHIFT_1T)) - 1);
+   }
+   /*
+* vsid outside our range.
+*/
+   if (vsid >= vsid_modulus)
+   return 0;
+
+   /* Check if (vsid * modinv) overflow (63 bits) */
+   max_mod_inv = 0x7fffull / vsid;
+   if (modinv < max_mod_inv)
+   return (vsid * modinv) % vsid_modulus;
+
+   tmp_modinv = modinv/max_mod_inv;
+   modinv %= max_mod_inv;
+
+   protovsid = (((vsid * max_mod_inv) % vsid_modulus) * tmp_modinv) % 
vsid_modulus;
+   protovsid = (protovsid + vsid * modinv) % vsid_modulus;
+   return protovsid;
+}
+
+static int __init hash_init_reserved_context(void)
+{
+   unsigned long protovsid;
+
+   /*
+* VRMA_VSID to skip list. We don't bother about
+* ibm,adjunct-virtual-addresses because we disable
+

[PATCH V5 17/17] powerpc/mm: Enable mappings above 128TB

2017-03-21 Thread Aneesh Kumar K.V

Not all user space application is ready to handle wide addresses. It's known 
that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 512TB addresses and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 128TB by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 128TB.

If hint address set above 128TB, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 128TB window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

This is going to be a per mmap decision. ie, we can have some mmaps with larger
addresses and other that do not.

A sample memory layout looks like below.

1000-1001 r-xp  fc:00 9057045
/home/max_addr_512TB
1001-1002 r--p  fc:00 9057045
/home/max_addr_512TB
1002-1003 rw-p 0001 fc:00 9057045
/home/max_addr_512TB
1002963-1002966 rw-p  00:00 0[heap]
7fff834a-7fff834b rw-p  00:00 0
7fff834b-7fff8367 r-xp  fc:00 9177190
/lib/powerpc64le-linux-gnu/libc-2.23.so
7fff8367-7fff8368 r--p 001b fc:00 9177190
/lib/powerpc64le-linux-gnu/libc-2.23.so
7fff8368-7fff8369 rw-p 001c fc:00 9177190
/lib/powerpc64le-linux-gnu/libc-2.23.so
7fff8369-7fff836a rw-p  00:00 0
7fff836a-7fff836c r-xp  00:00 0  [vdso]
7fff836c-7fff8370 r-xp  fc:00 9177193
/lib/powerpc64le-linux-gnu/ld-2.23.so
7fff8370-7fff8371 r--p 0003 fc:00 9177193
/lib/powerpc64le-linux-gnu/ld-2.23.so
7fff8371-7fff8372 rw-p 0004 fc:00 9177193
/lib/powerpc64le-linux-gnu/ld-2.23.so
7fffdccf-7fffdcd2 rw-p  00:00 0  [stack]
1-10001 rw-p  00:00 0
18371-18372 rw-p  00:00 0

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/processor.h | 27 ++
 arch/powerpc/mm/hugetlbpage-radix.c  |  7 +
 arch/powerpc/mm/mmap.c   | 33 ++
 arch/powerpc/mm/slice.c  | 55 
 4 files changed, 94 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 146c3a91d89f..e3e667240d8f 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -114,9 +114,9 @@ void release_thread(struct task_struct *);
 /*
  * MAx value currently used:
  */
-#define TASK_SIZE_USER64 TASK_SIZE_128TB
+#define TASK_SIZE_USER64   TASK_SIZE_512TB
 #else
-#define TASK_SIZE_USER64 TASK_SIZE_64TB
+#define TASK_SIZE_USER64   TASK_SIZE_64TB
 #endif
 
 /*
@@ -128,26 +128,43 @@ void release_thread(struct task_struct *);
 #define TASK_SIZE_OF(tsk) (test_tsk_thread_flag(tsk, TIF_32BIT) ? \
TASK_SIZE_USER32 : TASK_SIZE_USER64)
 #define TASK_SIZETASK_SIZE_OF(current)
+/*
+ * We want to track current task size in mm->task_size not the max possible
+ * task size.
+ */
+#define arch_init_task_size() (current->mm->task_size = DEFAULT_MAP_WINDOW)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4))
-#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_USER64 / 4))
+#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_128TB / 4))
 
 #define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \
TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
 #endif
 
+/*
+ * Initial task size value for user applications. For book3s 64 we start
+ * with 128TB and conditionally enable upto 512TB
+ */
+#ifdef CONFIG_PPC_BOOK3S_64
+#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ? \
+TASK_SIZE_USER32 : TASK_SIZE_128TB)
+#else
+#define DEFAULT_MAP_WINDOW TASK_SIZE
+#endif
+
 #ifdef __powerpc64__
 
-#define STACK_TOP_USER64 TASK_SIZE_USER64
+/* Limit stack to 128TB */
+#define STACK_TOP_USER64 TASK_SIZE_128TB
 #define STACK_TOP_USER32 TASK_SIZE_USER32
 
 #define STACK_TOP (is_32bit_task() ? \
   STACK_TOP_USER32 : STACK_TOP_USER64)
 
-#define STACK_TOP_MAX STACK_TOP_USER64
+#define STACK_TOP_MAX TASK_SIZE_USER64
 
 #else /* __powerpc64__ */
 
diff --git a/arch/powerpc/mm/hugetlbpage-radix.c 
b/arch/powerpc/mm/hugetlbpage-radix.c
index 686cfe8e664b..f596215d0c1b 100644

[PATCH V5 15/17] powerpc/mm: Switch TASK_SIZE check to use mm->task_size

2017-03-21 Thread Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage-radix.c |  4 ++--
 arch/powerpc/mm/mmap.c  | 12 ++--
 arch/powerpc/mm/slice.c |  6 +++---
 arch/powerpc/mm/subpage-prot.c  |  3 ++-
 4 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage-radix.c 
b/arch/powerpc/mm/hugetlbpage-radix.c
index 35254a678456..686cfe8e664b 100644
--- a/arch/powerpc/mm/hugetlbpage-radix.c
+++ b/arch/powerpc/mm/hugetlbpage-radix.c
@@ -52,7 +52,7 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned 
long addr,
 
if (len & ~huge_page_mask(h))
return -EINVAL;
-   if (len > TASK_SIZE)
+   if (len > mm->task_size)
return -ENOMEM;
 
if (flags & MAP_FIXED) {
@@ -64,7 +64,7 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned 
long addr,
if (addr) {
addr = ALIGN(addr, huge_page_size(h));
vma = find_vma(mm, addr);
-   if (TASK_SIZE - len >= addr &&
+   if (mm->task_size - len >= addr &&
(!vma || addr + len <= vma->vm_start))
return addr;
}
diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index a5d9ef59debe..bce788a41bc3 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -97,7 +97,7 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned 
long addr,
struct vm_area_struct *vma;
struct vm_unmapped_area_info info;
 
-   if (len > TASK_SIZE - mmap_min_addr)
+   if (len > mm->task_size - mmap_min_addr)
return -ENOMEM;
 
if (flags & MAP_FIXED)
@@ -106,7 +106,7 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned 
long addr,
if (addr) {
addr = PAGE_ALIGN(addr);
vma = find_vma(mm, addr);
-   if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
+   if (mm->task_size - len >= addr && addr >= mmap_min_addr &&
(!vma || addr + len <= vma->vm_start))
return addr;
}
@@ -114,7 +114,7 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned 
long addr,
info.flags = 0;
info.length = len;
info.low_limit = mm->mmap_base;
-   info.high_limit = TASK_SIZE;
+   info.high_limit = mm->task_size;
info.align_mask = 0;
return vm_unmapped_area(&info);
 }
@@ -132,7 +132,7 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
struct vm_unmapped_area_info info;
 
/* requested length too big for entire address space */
-   if (len > TASK_SIZE - mmap_min_addr)
+   if (len > mm->task_size - mmap_min_addr)
return -ENOMEM;
 
if (flags & MAP_FIXED)
@@ -142,7 +142,7 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
if (addr) {
addr = PAGE_ALIGN(addr);
vma = find_vma(mm, addr);
-   if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
+   if (mm->task_size - len >= addr && addr >= mmap_min_addr &&
(!vma || addr + len <= vma->vm_start))
return addr;
}
@@ -164,7 +164,7 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = TASK_UNMAPPED_BASE;
-   info.high_limit = TASK_SIZE;
+   info.high_limit = mm->task_size;
addr = vm_unmapped_area(&info);
}
 
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 217638846053..6d214786a2df 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -277,7 +277,7 @@ static unsigned long slice_find_area_bottomup(struct 
mm_struct *mm,
info.align_offset = 0;
 
addr = TASK_UNMAPPED_BASE;
-   while (addr < TASK_SIZE) {
+   while (addr < mm->task_size) {
info.low_limit = addr;
if (!slice_scan_available(addr, available, 1, &addr))
continue;
@@ -289,8 +289,8 @@ static unsigned long slice_find_area_bottomup(struct 
mm_struct *mm,
 * Check if we need to reduce the range, or if we can
 * extend it to cover the next available slice.
 */
-   if (addr >= TASK_SIZE)
-   addr = TASK_SIZE;
+   if (addr >= mm->task_size)
+   addr = mm->task_size;
else if (slice_scan_available(addr, available, 1, &next_end)) {
addr = next_end;
goto next_slice;
diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index 94210940112f..e94fbd4c8845 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -197,7 +197,8 @@ long sys_subpage_prot(unsigned long addr, unsigned long

[PATCH V5 13/17] powerpc/mm/hash: Store task size in PACA

2017-03-21 Thread Aneesh Kumar K.V

We optmize the slice page size array copy to paca by copying only the
range based on task size. This will require us to not look at page size
array beyond task size in PACA on slb fault. To enable that copy task size
to paca which will be used during slb fault.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/paca.h   | 4 +++-
 arch/powerpc/kernel/asm-offsets.c | 4 
 arch/powerpc/kernel/paca.c| 1 +
 arch/powerpc/mm/slb_low.S | 8 +++-
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index f48c250339fd..25f4a1c14759 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -144,7 +144,9 @@ struct paca_struct {
u16 mm_ctx_sllp;
 #endif
 #endif
-
+#ifdef CONFIG_PPC_STD_MMU_64
+   u64 task_size;
+#endif
/*
 * then miscellaneous read-write fields
 */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 4367e7df51a1..a60ef1d976ab 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -189,6 +189,10 @@ int main(void)
 #endif /* CONFIG_PPC_MM_SLICES */
 #endif
 
+#ifdef CONFIG_PPC_STD_MMU_64
+   DEFINE(PACATASKSIZE, offsetof(struct paca_struct, task_size));
+#endif
+
 #ifdef CONFIG_PPC_BOOK3E
OFFSET(PACAPGD, paca_struct, pgd);
OFFSET(PACA_KERNELPGD, paca_struct, kernel_pgd);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index bffdbd6d6774..50b60e23d07f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -254,6 +254,7 @@ void copy_mm_to_paca(struct mm_struct *mm)
get_paca()->mm_ctx_id = context->id;
 #ifdef CONFIG_PPC_MM_SLICES
VM_BUG_ON(!mm->task_size);
+   get_paca()->task_size = mm->task_size;
get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
memcpy(&get_paca()->mm_ctx_high_slices_psize,
   &context->high_slices_psize, TASK_SLICE_ARRAY_SZ(mm));
diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S
index 35e91e89640f..b09e7748856f 100644
--- a/arch/powerpc/mm/slb_low.S
+++ b/arch/powerpc/mm/slb_low.S
@@ -149,7 +149,13 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT)
 * For userspace addresses, make sure this is region 0.
 */
cmpdi   r9, 0
-   bne 8f
+   bne-8f
+/*
+ * user space make sure we are within the allowed limit
+*/
+   ld  r11,PACATASKSIZE(r13)
+   cmpld   r3,r11
+   bge-8f
 
/* when using slices, we extract the psize off the slice bitmaps
 * and then we need to get the sllp encoding off the mmu_psize_defs
-- 
2.7.4

[PATCH V5 12/17] powerpc/mm/slice: Use mm task_size as max value of slice index

2017-03-21 Thread Aneesh Kumar K.V

In the followup patch, we will increase the slice array size to handle 512TB
range, but will limit the task size to 128TB. Avoid doing uncessary computation
and avoid doing slice mask related operation above task_size.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++-
 arch/powerpc/kernel/paca.c|  3 ++-
 arch/powerpc/kernel/setup-common.c|  9 +
 arch/powerpc/mm/mmu_context_book3s64.c|  8 
 arch/powerpc/mm/slice.c   | 22 --
 5 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 73f34a98ce99..c99ea6bbd82c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -586,7 +586,8 @@ extern void slb_set_size(u16 size);
 #define USER_VSID_RANGE(1UL << (ESID_BITS + SID_SHIFT))
 
 /* 4 bits per slice and we have one slice per 1TB */
-#define SLICE_ARRAY_SIZE  (H_PGTABLE_RANGE >> 41)
+#define SLICE_ARRAY_SIZE   (H_PGTABLE_RANGE >> 41)
+#define TASK_SLICE_ARRAY_SZ(x) ((x)->task_size >> 41)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index e2cf745a4b94..bffdbd6d6774 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -253,9 +253,10 @@ void copy_mm_to_paca(struct mm_struct *mm)
 
get_paca()->mm_ctx_id = context->id;
 #ifdef CONFIG_PPC_MM_SLICES
+   VM_BUG_ON(!mm->task_size);
get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
memcpy(&get_paca()->mm_ctx_high_slices_psize,
-  &context->high_slices_psize, SLICE_ARRAY_SIZE);
+  &context->high_slices_psize, TASK_SLICE_ARRAY_SZ(mm));
 #else /* CONFIG_PPC_MM_SLICES */
get_paca()->mm_ctx_user_psize = context->user_psize;
get_paca()->mm_ctx_sllp = context->sllp;
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 4697da895133..aaf1e2befbcb 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -920,6 +920,15 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = klimit;
+
+#ifdef CONFIG_PPC_MM_SLICES
+#ifdef CONFIG_PPC64
+   init_mm.task_size = TASK_SIZE_USER64;
+#else
+#error "task_size not initialized."
+#endif
+#endif
+
 #ifdef CONFIG_PPC_64K_PAGES
init_mm.context.pte_frag = NULL;
 #endif
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index e6a5bcbf8abe..9ab6cd2923be 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -86,6 +86,14 @@ static int hash__init_new_context(struct mm_struct *mm)
 * We should not be calling init_new_context() on init_mm. Hence a
 * check against 0 is ok.
 */
+#ifdef CONFIG_PPC_MM_SLICES
+   /*
+* We do switch_slb early in fork, even before we setup the 
mm->task_size.
+* Default to max task size so that we copy the default values to paca
+* which will help us to handle slb miss early.
+*/
+   mm->task_size = TASK_SIZE_USER64;
+#endif
if (mm->context.id == 0)
slice_set_user_psize(mm, mmu_virtual_psize);
subpage_prot_init_new_context(mm);
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index ee3cfc5d9bbc..217638846053 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -136,7 +136,7 @@ static void slice_mask_for_free(struct mm_struct *mm, 
struct slice_mask *ret)
if (mm->task_size <= SLICE_LOW_TOP)
return;
 
-   for (i = 0; i < SLICE_NUM_HIGH; i++)
+   for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++)
if (!slice_high_has_vma(mm, i))
__set_bit(i, ret->high_slices);
 }
@@ -157,7 +157,7 @@ static void slice_mask_for_size(struct mm_struct *mm, int 
psize, struct slice_ma
ret->low_slices |= 1u << i;
 
hpsizes = mm->context.high_slices_psize;
-   for (i = 0; i < SLICE_NUM_HIGH; i++) {
+   for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) {
mask_index = i & 0x1;
index = i >> 1;
if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize)
@@ -165,15 +165,17 @@ static void slice_mask_for_size(struct mm_struct *mm, int 
psize, struct slice_ma
}
 }
 
-static int slice_check_fit(struct slice_mask mask, struct slice_mask available)
+static int slice_check_fit(struct mm_struct *mm,
+  struct slice_mask mask, struct slice_mask available)
 {
DECLARE_BITMAP(result, SLICE_NUM_HIGH);
+   unsigned long slice_count = GET_HIGH_SLICE_INDEX(mm->task_size);
 
bitmap_and(

[PATCH V5 11/17] powerpc/mm/hash: Increase VA range to 128TB

2017-03-21 Thread Aneesh Kumar K.V

We update the hash linux page table layout such that we can support 512TB. But
we limit the TASK_SIZE to 128TB. We can switch to 128TB by default without
conditional because that is the max virtual address supported by other
architectures. We will later add a mechanism to on-demand increase the
application's effective address range to 512TB.

Having the page table layout changed to accommodate 512TB  makes testing large
memory configuration easier with less code changes to kernel

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  2 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  2 +-
 arch/powerpc/include/asm/processor.h  | 22 ++
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 0c4e470571ca..b4b5e6b671ca 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  9
+#define H_PGD_INDEX_SIZE  12
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE   (sizeof(pte_t) << H_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 7be54f9590a3..214219dff87c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -4,7 +4,7 @@
 #define H_PTE_INDEX_SIZE  8
 #define H_PMD_INDEX_SIZE  5
 #define H_PUD_INDEX_SIZE  5
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  15
 
 /*
  * 64k aligned address free up few of the lower bits of RPN for us
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index e0fecbcea2a2..146c3a91d89f 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -102,11 +102,25 @@ void release_thread(struct task_struct *);
 #endif
 
 #ifdef CONFIG_PPC64
-/* 64-bit user address space is 46-bits (64TB user VM) */
-#define TASK_SIZE_USER64 (0x4000UL)
+/*
+ * 64-bit user address space can have multiple limits
+ * For now supported values are:
+ */
+#define TASK_SIZE_64TB  (0x4000UL)
+#define TASK_SIZE_128TB (0x8000UL)
+#define TASK_SIZE_512TB (0x0002UL)
 
-/* 
- * 32-bit user address space is 4GB - 1 page 
+#ifdef CONFIG_PPC_BOOK3S_64
+/*
+ * MAx value currently used:
+ */
+#define TASK_SIZE_USER64 TASK_SIZE_128TB
+#else
+#define TASK_SIZE_USER64 TASK_SIZE_64TB
+#endif
+
+/*
+ * 32-bit user address space is 4GB - 1 page
  * (this 1 page is needed so referencing of 0x generates EFAULT
  */
 #define TASK_SIZE_USER32 (0x0001UL - (1*PAGE_SIZE))
-- 
2.7.4

[PATCH V5 07/17] powerpc/mm/hash: Move kernel context to the starting of context range

2017-03-21 Thread Aneesh Kumar K.V

With current kernel, we use the top 4 context for the kernel. Kernel VSIDs are 
built
using these top context values and effective segemnt ID. In the following 
patches,
we want to increase the max effective address to 512TB. We achieve that by
increasing the effective segments IDs there by increasing virtual address range.

We will be switching to a 68bit virtual address in the following patch. But for
platforms like  p4 and p5, which only support a 65 bit va, we want to limit the
virtual addrress to a 65 bit value. We do that by limiting the context bits to 
16
instead of 19. That means we will have different max context values on different
platforms.

To make this simpler. we move the kernel context to the starting of the range.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 39 ++--
 arch/powerpc/include/asm/mmu_context.h|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c |  2 +-
 arch/powerpc/mm/hash_utils_64.c   |  5 --
 arch/powerpc/mm/mmu_context_book3s64.c| 88 ++-
 arch/powerpc/mm/slb_low.S | 20 ++
 6 files changed, 84 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 52d8d1e4b772..37dbc9becaba 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -494,10 +494,10 @@ extern void slb_set_size(u16 size);
  * For user processes max context id is limited to ((1ul << 19) - 5)
  * for kernel space, we use the top 4 context ids to map address as below
  * NOTE: each context only support 64TB now.
- * 0x7fffc -  [ 0xc000 - 0xc0003fff ]
- * 0x7fffd -  [ 0xd000 - 0xd0003fff ]
- * 0x7fffe -  [ 0xe000 - 0xe0003fff ]
- * 0x7 -  [ 0xf000 - 0xf0003fff ]
+ * 0x0 -  [ 0xc000 - 0xc0003fff ]
+ * 0x1 -  [ 0xd000 - 0xd0003fff ]
+ * 0x2 -  [ 0xe000 - 0xe0003fff ]
+ * 0x3 -  [ 0xf000 - 0xf0003fff ]
  *
  * The proto-VSIDs are then scrambled into real VSIDs with the
  * multiplicative hash:
@@ -511,15 +511,9 @@ extern void slb_set_size(u16 size);
  * robust scattering in the hash table (at least based on some initial
  * results).
  *
- * We also consider VSID 0 special. We use VSID 0 for slb entries mapping
- * bad address. This enables us to consolidate bad address handling in
- * hash_page.
- *
  * We also need to avoid the last segment of the last context, because that
  * would give a protovsid of 0x1f. That will result in a VSID 0
- * because of the modulo operation in vsid scramble. But the vmemmap
- * (which is what uses region 0xf) will never be close to 64TB in size
- * (it's 56 bytes per page of system memory).
+ * because of the modulo operation in vsid scramble.
  */
 
 #define CONTEXT_BITS   19
@@ -532,12 +526,15 @@ extern void slb_set_size(u16 size);
 /*
  * 256MB segment
  * The proto-VSID space has 2^(CONTEX_BITS + ESID_BITS) - 1 segments
- * available for user + kernel mapping. The top 4 contexts are used for
+ * available for user + kernel mapping. The bottom 4 contexts are used for
  * kernel mapping. Each segment contains 2^28 bytes. Each
- * context maps 2^46 bytes (64TB) so we can support 2^19-1 contexts
- * (19 == 37 + 28 - 46).
+ * context maps 2^46 bytes (64TB).
+ *
+ * We also need to avoid the last segment of the last context, because that
+ * would give a protovsid of 0x1f. That will result in a VSID 0
+ * because of the modulo operation in vsid scramble.
  */
-#define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 5)
+#define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 2)
 
 /*
  * This should be computed such that protovosid * vsid_mulitplier
@@ -673,19 +670,19 @@ static inline unsigned long get_vsid(unsigned long 
context, unsigned long ea,
  * This is only valid for addresses >= PAGE_OFFSET
  *
  * For kernel space, we use the top 4 context ids to map address as below
- * 0x7fffc -  [ 0xc000 - 0xc0003fff ]
- * 0x7fffd -  [ 0xd000 - 0xd0003fff ]
- * 0x7fffe -  [ 0xe000 - 0xe0003fff ]
- * 0x7 -  [ 0xf000 - 0xf0003fff ]
+ * 0x0 -  [ 0xc000 - 0xc0003fff ]
+ * 0x1 -  [ 0xd000 - 0xd0003fff ]
+ * 0x2 -  [ 0xe000 - 0xe0003fff ]
+ * 0x3 -  [ 0xf000 - 0xf0003fff ]
  */
 static inline unsigned long get_kernel_vsid(unsigned long ea, int ssize)
 {
unsigned long context;
 
/*
-* kernel take the top 4 context from the available range
+* kernel take the first 4 context from the available range
 */
-   context = (MAX_USER_CONTEXT) + ((ea >> 60) - 0xc) + 1;
+   context = (ea >> 60) -

[PATCH V5 10/17] powerpc/mm/hash: Convert mask to unsigned long

2017-03-21 Thread Aneesh Kumar K.V

This doesn't have any functional change. But helps in avoiding mistakes
in case the shift bit changes

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 078d7bf93a69..73f34a98ce99 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -409,7 +409,7 @@ static inline unsigned long hpt_vpn(unsigned long ea,
 static inline unsigned long hpt_hash(unsigned long vpn,
 unsigned int shift, int ssize)
 {
-   int mask;
+   unsigned long mask;
unsigned long hash, vsid;
 
/* VPN_SHIFT can be atmost 12 */
-- 
2.7.4

[PATCH V5 09/17] powerpc/mm/hash: VSID 0 is no more an invalid VSID

2017-03-21 Thread Aneesh Kumar K.V

This is now used by linear mapped region of the kernel. User space still
should not see a VSID 0. But having that VSID check confuse the reader.
Remove the same and convert the error checking to be based on addr value

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  6 --
 arch/powerpc/mm/hash_utils_64.c   | 19 +++
 arch/powerpc/mm/pgtable-hash64.c  |  1 -
 arch/powerpc/mm/tlb_hash64.c  |  1 -
 4 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 3897d30820b0..078d7bf93a69 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -673,12 +673,6 @@ static inline unsigned long get_vsid(unsigned long 
context, unsigned long ea,
unsigned long vsid_bits;
unsigned long protovsid;
 
-   /*
-* Bad address. We return VSID 0 for that
-*/
-   if ((ea & ~REGION_MASK) >= H_PGTABLE_RANGE)
-   return 0;
-
if (!mmu_has_feature(MMU_FTR_68_BIT_VA))
va_bits = 65;
 
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 6922a8d267cc..04523052ad8e 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1222,6 +1222,13 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
ea, access, trap);
trace_hash_fault(ea, access, trap);
 
+   /* Bad address. */
+   if ((ea & ~REGION_MASK) >= H_PGTABLE_RANGE) {
+   DBG_LOW("Bad address!\n");
+   rc = 1;
+   goto bail;
+   }
+
/* Get region & vsid */
switch (REGION_ID(ea)) {
case USER_REGION_ID:
@@ -1252,12 +1259,6 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
}
DBG_LOW(" mm=%p, mm->pgdir=%p, vsid=%016lx\n", mm, mm->pgd, vsid);
 
-   /* Bad address. */
-   if (!vsid) {
-   DBG_LOW("Bad address!\n");
-   rc = 1;
-   goto bail;
-   }
/* Get pgdir */
pgdir = mm->pgd;
if (pgdir == NULL) {
@@ -1500,8 +1501,6 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
/* Get VSID */
ssize = user_segment_size(ea);
vsid = get_vsid(mm->context.id, ea, ssize);
-   if (!vsid)
-   return;
/*
 * Hash doesn't like irqs. Walking linux page table with irq disabled
 * saves us from holding multiple locks.
@@ -1746,10 +1745,6 @@ static void kernel_map_linear_page(unsigned long vaddr, 
unsigned long lmi)
 
hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
 
-   /* Don't create HPTE entries for bad address */
-   if (!vsid)
-   return;
-
ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
HPTE_V_BOLTED,
mmu_linear_psize, mmu_kernel_ssize);
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index 8b85a14b08ea..ddfeb141af29 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -263,7 +263,6 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned 
long addr,
if (!is_kernel_addr(addr)) {
ssize = user_segment_size(addr);
vsid = get_vsid(mm->context.id, addr, ssize);
-   WARN_ON(vsid == 0);
} else {
vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
ssize = mmu_kernel_ssize;
diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
index 4517aa43a8b1..d8fa336bf05d 100644
--- a/arch/powerpc/mm/tlb_hash64.c
+++ b/arch/powerpc/mm/tlb_hash64.c
@@ -87,7 +87,6 @@ void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
ssize = mmu_kernel_ssize;
}
-   WARN_ON(vsid == 0);
vpn = hpt_vpn(addr, vsid, ssize);
rpte = __real_pte(__pte(pte), ptep);
 
-- 
2.7.4

[PATCH V5 06/17] powerpc/mm/slice: Update slice mask printing to use bitmap printing.

2017-03-21 Thread Aneesh Kumar K.V

We now get output like below which is much better.

[0.935306]  good_mask low_slice: 0-15
[0.935360]  good_mask high_slice: 0-511

Compared to

[0.953414]  good_mask: - 1.

I also fixed an error with slice_dbg printing.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 30 +++---
 1 file changed, 7 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index e1c42b54f5c5..ee3cfc5d9bbc 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -52,29 +52,13 @@ int _slice_debug = 1;
 
 static void slice_print_mask(const char *label, struct slice_mask mask)
 {
-   char*p, buf[SLICE_NUM_LOW + 3 + SLICE_NUM_HIGH + 1];
-   int i;
-
if (!_slice_debug)
return;
-   p = buf;
-   for (i = 0; i < SLICE_NUM_LOW; i++)
-   *(p++) = (mask.low_slices & (1 << i)) ? '1' : '0';
-   *(p++) = ' ';
-   *(p++) = '-';
-   *(p++) = ' ';
-   for (i = 0; i < SLICE_NUM_HIGH; i++) {
-   if (test_bit(i, mask.high_slices))
-   *(p++) = '1';
-   else
-   *(p++) = '0';
-   }
-   *(p++) = 0;
-
-   printk(KERN_DEBUG "%s:%s\n", label, buf);
+   pr_devel("%s low_slice: %*pbl\n", label, (int)SLICE_NUM_LOW, 
&mask.low_slices);
+   pr_devel("%s high_slice: %*pbl\n", label, (int)SLICE_NUM_HIGH, 
mask.high_slices);
 }
 
-#define slice_dbg(fmt...) do { if (_slice_debug) pr_debug(fmt); } while(0)
+#define slice_dbg(fmt...) do { if (_slice_debug) pr_devel(fmt); } while (0)
 
 #else
 
@@ -243,8 +227,8 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
}
 
slice_dbg(" lsps=%lx, hsps=%lx\n",
- mm->context.low_slices_psize,
- mm->context.high_slices_psize);
+ (unsigned long)mm->context.low_slices_psize,
+ (unsigned long)mm->context.high_slices_psize);
 
spin_unlock_irqrestore(&slice_convert_lock, flags);
 
@@ -686,8 +670,8 @@ void slice_set_user_psize(struct mm_struct *mm, unsigned 
int psize)
 
 
slice_dbg(" lsps=%lx, hsps=%lx\n",
- mm->context.low_slices_psize,
- mm->context.high_slices_psize);
+ (unsigned long)mm->context.low_slices_psize,
+ (unsigned long)mm->context.high_slices_psize);
 
  bail:
spin_unlock_irqrestore(&slice_convert_lock, flags);
-- 
2.7.4

[PATCH V5 08/17] powerpc/mm/hash: Support 68 bit VA

2017-03-21 Thread Aneesh Kumar K.V

Inorder to support large effective address range (512TB), we want to increase
the virtual address bits to 68. But we do have platforms like p4 and p5 that can
only do 65 bit VA. We support those platforms by limiting context bits on them
to 16.

The protovsid -> vsid conversion is verified to work with both 65 and 68 bit
va values. I also documented the restrictions in a table format as part of code
comments.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 125 --
 arch/powerpc/include/asm/mmu.h|  19 ++--
 arch/powerpc/kvm/book3s_64_mmu_host.c |   8 +-
 arch/powerpc/mm/mmu_context_book3s64.c|   8 +-
 arch/powerpc/mm/slb_low.S |  54 +--
 5 files changed, 150 insertions(+), 64 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 37dbc9becaba..3897d30820b0 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -39,6 +39,7 @@
 
 /* Bits in the SLB VSID word */
 #define SLB_VSID_SHIFT 12
+#define SLB_VSID_SHIFT_256MSLB_VSID_SHIFT
 #define SLB_VSID_SHIFT_1T  24
 #define SLB_VSID_SSIZE_SHIFT   62
 #define SLB_VSID_B ASM_CONST(0xc000)
@@ -516,9 +517,19 @@ extern void slb_set_size(u16 size);
  * because of the modulo operation in vsid scramble.
  */
 
+/*
+ * Max Va bits we support as of now is 68 bits. We want 19 bit
+ * context ID.
+ * Restrictions:
+ * GPU has restrictions of not able to access beyond 128TB
+ * (47 bit effective address). We also cannot do more than 20bit PID.
+ * For p4 and p5 which can only do 65 bit VA, we restrict our CONTEXT_BITS
+ * to 16 bits (ie, we can only have 2^16 pids at the same time).
+ */
+#define VA_BITS68
 #define CONTEXT_BITS   19
-#define ESID_BITS  18
-#define ESID_BITS_1T   6
+#define ESID_BITS  (VA_BITS - (SID_SHIFT + CONTEXT_BITS))
+#define ESID_BITS_1T   (VA_BITS - (SID_SHIFT_1T + CONTEXT_BITS))
 
 #define ESID_BITS_MASK ((1 << ESID_BITS) - 1)
 #define ESID_BITS_1T_MASK  ((1 << ESID_BITS_1T) - 1)
@@ -528,62 +539,52 @@ extern void slb_set_size(u16 size);
  * The proto-VSID space has 2^(CONTEX_BITS + ESID_BITS) - 1 segments
  * available for user + kernel mapping. The bottom 4 contexts are used for
  * kernel mapping. Each segment contains 2^28 bytes. Each
- * context maps 2^46 bytes (64TB).
+ * context maps 2^49 bytes (512TB).
  *
  * We also need to avoid the last segment of the last context, because that
  * would give a protovsid of 0x1f. That will result in a VSID 0
  * because of the modulo operation in vsid scramble.
  */
 #define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 2)
+/*
+ * For platforms that support on 65bit VA we limit the context bits
+ */
+#define MAX_USER_CONTEXT_65BIT_VA ((ASM_CONST(1) << (65 - (SID_SHIFT + 
ESID_BITS))) - 2)
 
 /*
  * This should be computed such that protovosid * vsid_mulitplier
  * doesn't overflow 64 bits. It should also be co-prime to vsid_modulus
+ * We also need to make sure that number of bits in divisor is less
+ * than twice the number of protovsid bits for our modulus optmization to work.
+ * The below table shows the current values used.
+ *
+ * |---++++--|
+ * |   | Prime Bits | VSID_BITS_65VA | Total Bits | 2* VSID_BITS |
+ * |---++++--|
+ * | 1T| 24 | 25 | 49 |   50 |
+ * |---++++--|
+ * | 256MB | 24 | 37 | 61 |   74 |
+ * |---++++--|
+ *
+ * |---++++--|
+ * |   | Prime Bits | VSID_BITS_68VA | Total Bits | 2* VSID_BITS |
+ * |---++++--|
+ * | 1T| 24 | 28 | 52 |   56 |
+ * |---++++--|
+ * | 256MB | 24 | 40 | 64 |   80 |
+ * |---++++--|
+ *
  */
 #define VSID_MULTIPLIER_256M   ASM_CONST(12538073) /* 24-bit prime */
-#define VSID_BITS_256M (CONTEXT_BITS + ESID_BITS)
-#define VSID_MODULUS_256M  ((1UL<=   \
-* 2^36-1, then r3+1 has the 2^36 bit set.  So, if r3+1 has \
-* the bit clear, r3 already has the answer we want, if it  \
-* doesn't, the answer is the low 36 bits of r3+1.  So in all   \
-* cases the answer is the low 36 bits of (r3 + ((r3+1) >> 36))*/\
-   addirx,rt,1;\
-   srdirx,rx,VSID_BITS_##size; /*

[PATCH V5 05/17] powerpc/mm/slice: Move slice_mask struct definition to slice.c

2017-03-21 Thread Aneesh Kumar K.V

This structure definition need not be in a header since this is used only by
slice.c file. So move it to slice.c. This also allow us to use SLICE_NUM_HIGH
instead of 64.

I also switch the low_slices type to u64 from u16. This doesn't have an impact
on size of struct due to padding added with u16 type. This helps in using
bitmap printing function for printing slice mask.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page_64.h | 11 ---
 arch/powerpc/mm/slice.c| 10 +-
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index bd55ff751938..c4d9654bd637 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -99,17 +99,6 @@ extern u64 ppc64_pft_size;
 #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
 
 #ifndef __ASSEMBLY__
-/*
- * One bit per slice. We have lower slices which cover 256MB segments
- * upto 4G range. That gets us 16 low slices. For the rest we track slices
- * in 1TB size.
- * 64 below is actually SLICE_NUM_HIGH to fixup complie errros
- */
-struct slice_mask {
-   u16 low_slices;
-   DECLARE_BITMAP(high_slices, 64);
-};
-
 struct mm_struct;
 
 extern unsigned long slice_get_unmapped_area(unsigned long addr,
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index b40b4111ca13..e1c42b54f5c5 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -37,7 +37,15 @@
 #include 
 
 static DEFINE_SPINLOCK(slice_convert_lock);
-
+/*
+ * One bit per slice. We have lower slices which cover 256MB segments
+ * upto 4G range. That gets us 16 low slices. For the rest we track slices
+ * in 1TB size.
+ */
+struct slice_mask {
+   u64 low_slices;
+   DECLARE_BITMAP(high_slices, SLICE_NUM_HIGH);
+};
 
 #ifdef DEBUG
 int _slice_debug = 1;
-- 
2.7.4

[PATCH V5 04/17] powerpc/mm: Remove redundant TASK_SIZE_USER64 checks

2017-03-21 Thread Aneesh Kumar K.V

The check against VSID range is implied when we check task size against
hash and radix pgtable range[1], because we make sure page table range cannot
exceed vsid range.

[1] BUILD_BUG_ON(TASK_SIZE_USER64 > H_PGTABLE_RANGE);
BUILD_BUG_ON(TASK_SIZE_USER64 > RADIX_PGTABLE_RANGE);

The check for smaller task size is also removed here, because the follow up
patch will support a tasksize smaller than pgtable range.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c| 4 
 arch/powerpc/mm/pgtable_64.c | 5 -
 2 files changed, 9 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 9be992083d2a..8f6f2a173e47 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -71,10 +71,6 @@
 #if H_PGTABLE_RANGE > USER_VSID_RANGE
 #warning Limited user VSID range means pagetable space is wasted
 #endif
-
-#if (TASK_SIZE_USER64 < H_PGTABLE_RANGE) && (TASK_SIZE_USER64 < 
USER_VSID_RANGE)
-#warning TASK_SIZE is smaller than it needs to be.
-#endif
 #endif /* CONFIG_PPC_STD_MMU_64 */
 
 phys_addr_t memstart_addr = ~0;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index db93cf747a03..3e0cfb6420dd 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -56,11 +56,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_PPC_STD_MMU_64
-#if TASK_SIZE_USER64 > (1UL << (ESID_BITS + SID_SHIFT))
-#error TASK_SIZE_USER64 exceeds user VSID range
-#endif
-#endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /*
-- 
2.7.4

[PATCH V5 03/17] powerpc/mm: Move copy_mm_to_paca to paca.c

2017-03-21 Thread Aneesh Kumar K.V

We also update the function arg to struct mm_struct. Move this so that function
finds the definition of struct mm_struct. No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/paca.h | 18 +-
 arch/powerpc/kernel/paca.c  | 19 +++
 arch/powerpc/mm/hash_utils_64.c |  4 ++--
 arch/powerpc/mm/slb.c   |  2 +-
 arch/powerpc/mm/slice.c |  2 +-
 5 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 708c3e592eeb..f48c250339fd 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -206,23 +206,7 @@ struct paca_struct {
 #endif
 };
 
-#ifdef CONFIG_PPC_BOOK3S
-static inline void copy_mm_to_paca(mm_context_t *context)
-{
-   get_paca()->mm_ctx_id = context->id;
-#ifdef CONFIG_PPC_MM_SLICES
-   get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
-   memcpy(&get_paca()->mm_ctx_high_slices_psize,
-  &context->high_slices_psize, SLICE_ARRAY_SIZE);
-#else
-   get_paca()->mm_ctx_user_psize = context->user_psize;
-   get_paca()->mm_ctx_sllp = context->sllp;
-#endif
-}
-#else
-static inline void copy_mm_to_paca(mm_context_t *context){}
-#endif
-
+extern void copy_mm_to_paca(struct mm_struct *mm);
 extern struct paca_struct *paca;
 extern void initialise_paca(struct paca_struct *new_paca, int cpu);
 extern void setup_paca(struct paca_struct *new_paca);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index dfc479df9634..e2cf745a4b94 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -245,3 +245,22 @@ void __init free_unused_pacas(void)
 
free_lppacas();
 }
+
+void copy_mm_to_paca(struct mm_struct *mm)
+{
+#ifdef CONFIG_PPC_BOOK3S
+   mm_context_t *context = &mm->context;
+
+   get_paca()->mm_ctx_id = context->id;
+#ifdef CONFIG_PPC_MM_SLICES
+   get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
+   memcpy(&get_paca()->mm_ctx_high_slices_psize,
+  &context->high_slices_psize, SLICE_ARRAY_SIZE);
+#else /* CONFIG_PPC_MM_SLICES */
+   get_paca()->mm_ctx_user_psize = context->user_psize;
+   get_paca()->mm_ctx_sllp = context->sllp;
+#endif
+#else /* CONFIG_PPC_BOOK3S */
+   return;
+#endif
+}
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index c554768b1fa2..1bbf572243f7 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1120,7 +1120,7 @@ void demote_segment_4k(struct mm_struct *mm, unsigned 
long addr)
copro_flush_all_slbs(mm);
if ((get_paca_psize(addr) != MMU_PAGE_4K) && (current->mm == mm)) {
 
-   copy_mm_to_paca(&mm->context);
+   copy_mm_to_paca(mm);
slb_flush_and_rebolt();
}
 }
@@ -1192,7 +1192,7 @@ static void check_paca_psize(unsigned long ea, struct 
mm_struct *mm,
 {
if (user_region) {
if (psize != get_paca_psize(ea)) {
-   copy_mm_to_paca(&mm->context);
+   copy_mm_to_paca(mm);
slb_flush_and_rebolt();
}
} else if (get_paca()->vmalloc_sllp !=
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 5e01b2ece1d0..98ae810b8c21 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -229,7 +229,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
asm volatile("slbie %0" : : "r" (slbie_data));
 
get_paca()->slb_cache_ptr = 0;
-   copy_mm_to_paca(&mm->context);
+   copy_mm_to_paca(mm);
 
/*
 * preload some userspace segments into the SLB.
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 2396db6b3a05..b40b4111ca13 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -192,7 +192,7 @@ static void slice_flush_segments(void *parm)
if (mm != current->active_mm)
return;
 
-   copy_mm_to_paca(¤t->active_mm->context);
+   copy_mm_to_paca(current->active_mm);
 
local_irq_save(flags);
slb_flush_and_rebolt();
-- 
2.7.4

[PATCH V5 02/17] powerpc/mm/slice: Update the function prototype

2017-03-21 Thread Aneesh Kumar K.V

This avoid copying the slice_mask struct as function return value

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 62 ++---
 1 file changed, 28 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index b3919b2e76d3..2396db6b3a05 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -75,19 +75,18 @@ static void slice_print_mask(const char *label, struct 
slice_mask mask) {}
 
 #endif
 
-static struct slice_mask slice_range_to_mask(unsigned long start,
-unsigned long len)
+static void slice_range_to_mask(unsigned long start, unsigned long len,
+   struct slice_mask *ret)
 {
unsigned long end = start + len - 1;
-   struct slice_mask ret;
 
-   ret.low_slices = 0;
-   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+   ret->low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
if (start < SLICE_LOW_TOP) {
unsigned long mend = min(end, (SLICE_LOW_TOP - 1));
 
-   ret.low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1))
+   ret->low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1))
- (1u << GET_LOW_SLICE_INDEX(start));
}
 
@@ -96,9 +95,8 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
unsigned long align_end = ALIGN(end, (1UL << SLICE_HIGH_SHIFT));
unsigned long count = GET_HIGH_SLICE_INDEX(align_end) - 
start_index;
 
-   bitmap_set(ret.high_slices, start_index, count);
+   bitmap_set(ret->high_slices, start_index, count);
}
-   return ret;
 }
 
 static int slice_area_is_free(struct mm_struct *mm, unsigned long addr,
@@ -132,53 +130,47 @@ static int slice_high_has_vma(struct mm_struct *mm, 
unsigned long slice)
return !slice_area_is_free(mm, start, end - start);
 }
 
-static struct slice_mask slice_mask_for_free(struct mm_struct *mm)
+static void slice_mask_for_free(struct mm_struct *mm, struct slice_mask *ret)
 {
-   struct slice_mask ret;
unsigned long i;
 
-   ret.low_slices = 0;
-   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+   ret->low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
for (i = 0; i < SLICE_NUM_LOW; i++)
if (!slice_low_has_vma(mm, i))
-   ret.low_slices |= 1u << i;
+   ret->low_slices |= 1u << i;
 
if (mm->task_size <= SLICE_LOW_TOP)
-   return ret;
+   return;
 
for (i = 0; i < SLICE_NUM_HIGH; i++)
if (!slice_high_has_vma(mm, i))
-   __set_bit(i, ret.high_slices);
-
-   return ret;
+   __set_bit(i, ret->high_slices);
 }
 
-static struct slice_mask slice_mask_for_size(struct mm_struct *mm, int psize)
+static void slice_mask_for_size(struct mm_struct *mm, int psize, struct 
slice_mask *ret)
 {
unsigned char *hpsizes;
int index, mask_index;
-   struct slice_mask ret;
unsigned long i;
u64 lpsizes;
 
-   ret.low_slices = 0;
-   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+   ret->low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
lpsizes = mm->context.low_slices_psize;
for (i = 0; i < SLICE_NUM_LOW; i++)
if (((lpsizes >> (i * 4)) & 0xf) == psize)
-   ret.low_slices |= 1u << i;
+   ret->low_slices |= 1u << i;
 
hpsizes = mm->context.high_slices_psize;
for (i = 0; i < SLICE_NUM_HIGH; i++) {
mask_index = i & 0x1;
index = i >> 1;
if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize)
-   __set_bit(i, ret.high_slices);
+   __set_bit(i, ret->high_slices);
}
-
-   return ret;
 }
 
 static int slice_check_fit(struct slice_mask mask, struct slice_mask available)
@@ -460,7 +452,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
/* First make up a "good" mask of slices that have the right size
 * already
 */
-   good_mask = slice_mask_for_size(mm, psize);
+   slice_mask_for_size(mm, psize, &good_mask);
slice_print_mask(" good_mask", good_mask);
 
/*
@@ -485,7 +477,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
 #ifdef CONFIG_PPC_64K_PAGES
/* If we support combo pages, we can allow 64k pages in 4k slices */
if (psize == MMU_PAGE_64K) {
-   compat_mask = slice_mask_for_size(mm, MMU_PAGE_4K);
+   slice_mask_for_size(mm, MMU_PAGE_4K, &compat_mask);
if (fixed)
slice_or_mask(&good_mask, &compat_mask);
}
@@ -494,7 +486,7 @@ unsigned long slice_g

[PATCH V5 01/17] powerpc/mm/slice: Convert slice_mask high slice to a bitmap

2017-03-21 Thread Aneesh Kumar K.V

In followup patch we want to increase the va range which will result
in us requiring high_slices to have more than 64 bits. To enable this
convert high_slices to bitmap. We keep the number bits same in this patch
and later change that to higher value

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page_64.h |  15 ++---
 arch/powerpc/mm/slice.c| 110 +
 2 files changed, 80 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index 3e83d2a20b6f..bd55ff751938 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -98,19 +98,16 @@ extern u64 ppc64_pft_size;
 #define GET_LOW_SLICE_INDEX(addr)  ((addr) >> SLICE_LOW_SHIFT)
 #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
 
+#ifndef __ASSEMBLY__
 /*
- * 1 bit per slice and we have one slice per 1TB
- * Right now we support only 64TB.
- * IF we change this we will have to change the type
- * of high_slices
+ * One bit per slice. We have lower slices which cover 256MB segments
+ * upto 4G range. That gets us 16 low slices. For the rest we track slices
+ * in 1TB size.
+ * 64 below is actually SLICE_NUM_HIGH to fixup complie errros
  */
-#define SLICE_MASK_SIZE 8
-
-#ifndef __ASSEMBLY__
-
 struct slice_mask {
u16 low_slices;
-   u64 high_slices;
+   DECLARE_BITMAP(high_slices, 64);
 };
 
 struct mm_struct;
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index bf150557dba8..b3919b2e76d3 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -36,11 +36,6 @@
 #include 
 #include 
 
-/* some sanity checks */
-#if (H_PGTABLE_RANGE >> 43) > SLICE_MASK_SIZE
-#error H_PGTABLE_RANGE exceeds slice_mask high_slices size
-#endif
-
 static DEFINE_SPINLOCK(slice_convert_lock);
 
 
@@ -49,7 +44,7 @@ int _slice_debug = 1;
 
 static void slice_print_mask(const char *label, struct slice_mask mask)
 {
-   char*p, buf[16 + 3 + 64 + 1];
+   char*p, buf[SLICE_NUM_LOW + 3 + SLICE_NUM_HIGH + 1];
int i;
 
if (!_slice_debug)
@@ -60,8 +55,12 @@ static void slice_print_mask(const char *label, struct 
slice_mask mask)
*(p++) = ' ';
*(p++) = '-';
*(p++) = ' ';
-   for (i = 0; i < SLICE_NUM_HIGH; i++)
-   *(p++) = (mask.high_slices & (1ul << i)) ? '1' : '0';
+   for (i = 0; i < SLICE_NUM_HIGH; i++) {
+   if (test_bit(i, mask.high_slices))
+   *(p++) = '1';
+   else
+   *(p++) = '0';
+   }
*(p++) = 0;
 
printk(KERN_DEBUG "%s:%s\n", label, buf);
@@ -80,7 +79,10 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
 unsigned long len)
 {
unsigned long end = start + len - 1;
-   struct slice_mask ret = { 0, 0 };
+   struct slice_mask ret;
+
+   ret.low_slices = 0;
+   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
 
if (start < SLICE_LOW_TOP) {
unsigned long mend = min(end, (SLICE_LOW_TOP - 1));
@@ -89,10 +91,13 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
- (1u << GET_LOW_SLICE_INDEX(start));
}
 
-   if ((start + len) > SLICE_LOW_TOP)
-   ret.high_slices = (1ul << (GET_HIGH_SLICE_INDEX(end) + 1))
-   - (1ul << GET_HIGH_SLICE_INDEX(start));
+   if ((start + len) > SLICE_LOW_TOP) {
+   unsigned long start_index = GET_HIGH_SLICE_INDEX(start);
+   unsigned long align_end = ALIGN(end, (1UL << SLICE_HIGH_SHIFT));
+   unsigned long count = GET_HIGH_SLICE_INDEX(align_end) - 
start_index;
 
+   bitmap_set(ret.high_slices, start_index, count);
+   }
return ret;
 }
 
@@ -129,9 +134,12 @@ static int slice_high_has_vma(struct mm_struct *mm, 
unsigned long slice)
 
 static struct slice_mask slice_mask_for_free(struct mm_struct *mm)
 {
-   struct slice_mask ret = { 0, 0 };
+   struct slice_mask ret;
unsigned long i;
 
+   ret.low_slices = 0;
+   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+
for (i = 0; i < SLICE_NUM_LOW; i++)
if (!slice_low_has_vma(mm, i))
ret.low_slices |= 1u << i;
@@ -141,7 +149,7 @@ static struct slice_mask slice_mask_for_free(struct 
mm_struct *mm)
 
for (i = 0; i < SLICE_NUM_HIGH; i++)
if (!slice_high_has_vma(mm, i))
-   ret.high_slices |= 1ul << i;
+   __set_bit(i, ret.high_slices);
 
return ret;
 }
@@ -150,10 +158,13 @@ static struct slice_mask slice_mask_for_size(struct 
mm_struct *mm, int psize)
 {
unsigned char *hpsizes;
int index, mask_index;
-   struct slice_mask ret = { 0, 0 };
+   struct slice_mask ret;
unsigned long i;
u64 lpsizes;
 
+   ret.low_s

[PATCH V5 00/17] powerpc/mm/ppc64: Add 128TB support

2017-03-21 Thread Aneesh Kumar K.V

This patch series increase the effective virtual address range of
applications from 64TB to 128TB. We do that by supporting a 
68 bit virtual address. On platforms that can only do 65 bit virtual
address we limit the max contexts to a 16bit value instead of 19.

The patch series also switch the page table layout such that we can
do 512TB effective address. Userspace can ask for allocation from
the full 512TB address space by specifying hint address
(with or without MAP_FIXED) above 128TB.


Changes from V4:
* Rebase to latest upstream
* Add support 512TB mmap.

Changes from V3:
* Rebase to latest upstream
* Fixes based on testing

Changes from V2:
* Handle hugepage size correctly.


Aneesh Kumar K.V (17):
  powerpc/mm/slice: Convert slice_mask high slice to a bitmap
  powerpc/mm/slice: Update the function prototype
  powerpc/mm: Move copy_mm_to_paca to paca.c
  powerpc/mm: Remove redundant TASK_SIZE_USER64 checks
  powerpc/mm/slice: Move slice_mask struct definition to slice.c
  powerpc/mm/slice: Update slice mask printing to use bitmap printing.
  powerpc/mm/hash: Move kernel context to the starting of context range
  powerpc/mm/hash: Support 68 bit VA
  powerpc/mm/hash: VSID 0 is no more an invalid VSID
  powerpc/mm/hash: Convert mask to unsigned long
  powerpc/mm/hash: Increase VA range to 128TB
  powerpc/mm/slice: Use mm task_size as max value of slice index
  powerpc/mm/hash: Store task size in PACA
  powerpc/mm/hash: Skip using reserved virtual address range
  powerpc/mm: Switch TASK_SIZE check to use mm->task_size
  mm: Let arch choose the initial value of task size
  powerpc/mm: Enable mappings above 128TB

 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   2 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   2 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 178 ++
 arch/powerpc/include/asm/kvm_book3s_64.h  |   2 -
 arch/powerpc/include/asm/mmu.h|  19 +-
 arch/powerpc/include/asm/mmu_context.h|   3 +-
 arch/powerpc/include/asm/paca.h   |  22 +--
 arch/powerpc/include/asm/page_64.h|  14 --
 arch/powerpc/include/asm/processor.h  |  45 -
 arch/powerpc/kernel/asm-offsets.c |   4 +
 arch/powerpc/kernel/paca.c|  21 +++
 arch/powerpc/kernel/setup-common.c|   9 +
 arch/powerpc/kvm/book3s_64_mmu_host.c |  10 +-
 arch/powerpc/mm/hash_utils_64.c   |  86 +++--
 arch/powerpc/mm/hugetlbpage-radix.c   |  11 +-
 arch/powerpc/mm/init_64.c |   4 -
 arch/powerpc/mm/mmap.c|  41 +++--
 arch/powerpc/mm/mmu_context_book3s64.c| 130 ++---
 arch/powerpc/mm/pgtable-hash64.c  |   1 -
 arch/powerpc/mm/pgtable_64.c  |   5 -
 arch/powerpc/mm/slb.c |   2 +-
 arch/powerpc/mm/slb_low.S |  82 ++---
 arch/powerpc/mm/slice.c   | 253 --
 arch/powerpc/mm/subpage-prot.c|   3 +-
 arch/powerpc/mm/tlb_hash64.c  |   1 -
 fs/exec.c |  10 +-
 26 files changed, 631 insertions(+), 329 deletions(-)

-- 
2.7.4

Re: [v2 PATCH 4/4] powernv: Recover correct PACA on wakeup from a stop on P9 DD1

2017-03-21 Thread Nicholas Piggin

On Tue, 21 Mar 2017 02:59:46 +1000
Nicholas Piggin  wrote:

> On Mon, 20 Mar 2017 21:24:18 +0530

> This is quite neat now you've moved it to its own function. Nice.
> It will be only a trivial clash with my patches now, I think.
> 
> Reviewed-by: Nicholas Piggin 

Hmm... This won't actually work for machine check wakeups either.

We're doing the low level machine check before checking for wakeup,
which complicates things a little bit.

Simplest way to handle that might be to just always immediately
call the restore paca function at machine check entry, and then
just always mark machine checks as not-recoverable for DD1.

Thanks,
Nick

Re: [PATCH kernel v10 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

2017-03-21 Thread David Gibson

On Fri, Mar 17, 2017 at 04:09:53PM +1100, Alexey Kardashevskiy wrote:
> So far iommu_table obejcts were only used in virtual mode and had
> a single owner. We are going to change this by implementing in-kernel
> acceleration of DMA mapping requests. The proposed acceleration
> will handle requests in real mode and KVM will keep references to tables.
> 
> This adds a kref to iommu_table and defines new helpers to update it.
> This replaces iommu_free_table() with iommu_tce_table_put() and makes
> iommu_free_table() static. iommu_tce_table_get() is not used in this patch
> but it will be in the following patch.
> 
> Since this touches prototypes, this also removes @node_name parameter as
> it has never been really useful on powernv and carrying it for
> the pseries platform code to iommu_free_table() seems to be quite
> useless as well.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
> Changes:
> v10:
> * iommu_tce_table_get() can fail now if a table is being destroyed, will be
> used in 10/10
> * iommu_tce_table_put() returns what kref_put() returned
> * iommu_tce_table_put() got WARN_ON(!tbl) as the callers already check
> for it and do not call _put() when tbl==NULL
> 
> v9:
> * s/iommu_table_get/iommu_tce_table_get/ and
> s/iommu_table_put/iommu_tce_table_put/ -- so I removed r-b/a-b
> ---
>  arch/powerpc/include/asm/iommu.h  |  5 +++--
>  arch/powerpc/kernel/iommu.c   | 27 ++-
>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++---
>  arch/powerpc/platforms/powernv/pci.c  |  1 +
>  arch/powerpc/platforms/pseries/iommu.c|  3 ++-
>  arch/powerpc/platforms/pseries/vio.c  |  2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
>  7 files changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 4554699aec02..d96142572e6d 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -119,6 +119,7 @@ struct iommu_table {
>   struct list_head it_group_list;/* List of iommu_table_group_link */
>   unsigned long *it_userspace; /* userspace view of the table */
>   struct iommu_table_ops *it_ops;
> + struct krefit_kref;
>  };
>  
>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> @@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device 
> *dev)
>  
>  extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
>  
> -/* Frees table for an individual device node */
> -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> +extern struct iommu_table *iommu_tce_table_get(struct iommu_table *tbl);
> +extern int iommu_tce_table_put(struct iommu_table *tbl);
>  
>  /* Initializes an iommu_table based in values set in the passed-in
>   * structure
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index bc142d87130f..af915da5e03a 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table 
> *tbl, int nid)
>   return tbl;
>  }
>  
> -void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> +static void iommu_table_free(struct kref *kref)
>  {
>   unsigned long bitmap_sz;
>   unsigned int order;
> + struct iommu_table *tbl;
>  
> - if (!tbl)
> - return;
> + tbl = container_of(kref, struct iommu_table, it_kref);
>  
>   if (tbl->it_ops->free)
>   tbl->it_ops->free(tbl);
> @@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
> *node_name)
>  
>   /* verify that table contains no entries */
>   if (!bitmap_empty(tbl->it_map, tbl->it_size))
> - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
> + pr_warn("%s: Unexpected TCEs\n", __func__);
>  
>   /* calculate bitmap size in bytes */
>   bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> @@ -748,7 +748,24 @@ void iommu_free_table(struct iommu_table *tbl, const 
> char *node_name)
>   /* free table */
>   kfree(tbl);
>  }
> -EXPORT_SYMBOL_GPL(iommu_free_table);
> +
> +struct iommu_table *iommu_tce_table_get(struct iommu_table *tbl)
> +{
> + if (kref_get_unless_zero(&tbl->it_kref))
> + return tbl;
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(iommu_tce_table_get);
> +
> +int iommu_tce_table_put(struct iommu_table *tbl)
> +{
> + if (WARN_ON(!tbl))
> + return 0;
> +
> + return kref_put(&tbl->it_kref, iommu_table_free);
> +}
> +EXPORT_SYMBOL_GPL(iommu_tce_table_put);
>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index

[v6] powerpc/powernv: add hdat attribute to sysfs

2017-03-21 Thread Matt Brown

The HDAT data area is consumed by skiboot and turned into a device-tree.
In some cases we would like to look directly at the HDAT, so this patch
adds a sysfs node to allow it to be viewed.  This is not possible through
/dev/mem as it is reserved memory which is stopped by the /dev/mem filter.

Signed-off-by: Matt Brown 
---
Changelog

v6
- attribute names are stored locally, removing potential null pointer 
errors
- added of_node_put for the corresponding of_find_node
- folded exports node creation into opal_export_attr()
- fixed kzalloc flags to GFP_KERNEL
- fixed struct array indexing
- fixed error message
---
 arch/powerpc/platforms/powernv/opal.c | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 2822935..953537e 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -604,6 +604,87 @@ static void opal_export_symmap(void)
pr_warn("Error %d creating OPAL symbols file\n", rc);
 }
 
+static ssize_t export_attr_read(struct file *fp, struct kobject *kobj,
+struct bin_attribute *bin_attr, char *buf,
+loff_t off, size_t count)
+{
+   return memory_read_from_buffer(buf, count, &off, bin_attr->private,
+  bin_attr->size);
+}
+
+static struct bin_attribute *exported_attrs;
+static char **attr_name;
+/*
+ * opal_export_attrs: creates a sysfs node for each property listed in
+ * the device-tree under /ibm,opal/firmware/exports/
+ * All new sysfs nodes are created under /opal/exports/.
+ * This allows for reserved memory regions (e.g. HDAT) to be read.
+ * The new sysfs nodes are only readable by root.
+ */
+static void opal_export_attrs(void)
+{
+   /* /sys/firmware/opal/exports */
+   struct kobject *opal_export_kobj;
+
+   struct bin_attribute *attr_tmp;
+   const __be64 *syms;
+   unsigned int size;
+   struct device_node *fw;
+   struct property *prop;
+   int rc;
+   int attr_count = 0;
+   int n = 0;
+   
+   /* Create new 'exports' directory */
+   opal_export_kobj = kobject_create_and_add("exports", opal_kobj);
+   if (!opal_export_kobj) {
+   pr_warn("kobject_create_and_add opal_exports failed\n");
+   return;
+   }
+
+   fw = of_find_node_by_path("/ibm,opal/firmware/exports");
+   if (!fw)
+   return;
+
+   for (prop = fw->properties; prop != NULL; prop = prop->next)
+   attr_count++;
+
+   if (attr_count > 2) {
+   exported_attrs = kzalloc(sizeof(exported_attrs)*(attr_count-2),
+   GFP_KERNEL);
+   attr_name = kzalloc(sizeof(char *)*(attr_count-2), GFP_KERNEL);
+   }
+
+   for_each_property_of_node(fw, prop) {
+   
+   attr_name[n] = kstrdup(prop->name, GFP_KERNEL);
+   syms = of_get_property(fw, attr_name[n], &size);
+
+   if (!strcmp(attr_name[n], "name") ||
+   !strcmp(attr_name[n], "phandle"))
+   continue;
+
+   if (!syms || size != 2 * sizeof(__be64))
+   continue;
+
+   attr_tmp = &exported_attrs[n];
+   attr_tmp->attr.name = attr_name[n];
+   attr_tmp->attr.mode = 0400;
+   attr_tmp->read = export_attr_read;
+   attr_tmp->private = __va(be64_to_cpu(syms[0]));
+   attr_tmp->size = be64_to_cpu(syms[1]);
+
+   rc = sysfs_create_bin_file(opal_export_kobj, attr_tmp);
+   if (rc)
+   pr_warn("Error %d creating OPAL sysfs exports/%s 
file\n",
+   rc, attr_name[n]);
+   n++;
+   }
+
+   of_node_put(fw);
+
+}
+
 static void __init opal_dump_region_init(void)
 {
void *addr;
@@ -742,6 +823,9 @@ static int __init opal_init(void)
opal_msglog_sysfs_init();
}
 
+   /* Export all properties */
+   opal_export_attrs();
+
/* Initialize platform devices: IPMI backend, PRD & flash interface */
opal_pdev_init("ibm,opal-ipmi");
opal_pdev_init("ibm,opal-flash");
-- 
2.9.3

[PATCH v2 3/9] powerpc: mpc52xx_gpt: make use of raw_spinlock variants

2017-03-21 Thread Julia Cartwright

The mpc52xx_gpt code currently implements an irq_chip for handling
interrupts; due to how irq_chip handling is done, it's necessary for the
irq_chip methods to be invoked from hardirq context, even on a a
real-time kernel.  Because the spinlock_t type becomes a "sleeping"
spinlock w/ RT kernels, it is not suitable to be used with irq_chips.

A quick audit of the operations under the lock reveal that they do only
minimal, bounded work, and are therefore safe to do under a raw spinlock.

Signed-off-by: Julia Cartwright 
---
v1 -> v2:
  - No change.

 arch/powerpc/platforms/52xx/mpc52xx_gpt.c | 52 +++
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/platforms/52xx/mpc52xx_gpt.c 
b/arch/powerpc/platforms/52xx/mpc52xx_gpt.c
index 22645a7c6b8a..18c1383717f2 100644
--- a/arch/powerpc/platforms/52xx/mpc52xx_gpt.c
+++ b/arch/powerpc/platforms/52xx/mpc52xx_gpt.c
@@ -90,7 +90,7 @@ struct mpc52xx_gpt_priv {
struct list_head list;  /* List of all GPT devices */
struct device *dev;
struct mpc52xx_gpt __iomem *regs;
-   spinlock_t lock;
+   raw_spinlock_t lock;
struct irq_domain *irqhost;
u32 ipb_freq;
u8 wdt_mode;
@@ -141,9 +141,9 @@ static void mpc52xx_gpt_irq_unmask(struct irq_data *d)
struct mpc52xx_gpt_priv *gpt = irq_data_get_irq_chip_data(d);
unsigned long flags;
 
-   spin_lock_irqsave(&gpt->lock, flags);
+   raw_spin_lock_irqsave(&gpt->lock, flags);
setbits32(&gpt->regs->mode, MPC52xx_GPT_MODE_IRQ_EN);
-   spin_unlock_irqrestore(&gpt->lock, flags);
+   raw_spin_unlock_irqrestore(&gpt->lock, flags);
 }
 
 static void mpc52xx_gpt_irq_mask(struct irq_data *d)
@@ -151,9 +151,9 @@ static void mpc52xx_gpt_irq_mask(struct irq_data *d)
struct mpc52xx_gpt_priv *gpt = irq_data_get_irq_chip_data(d);
unsigned long flags;
 
-   spin_lock_irqsave(&gpt->lock, flags);
+   raw_spin_lock_irqsave(&gpt->lock, flags);
clrbits32(&gpt->regs->mode, MPC52xx_GPT_MODE_IRQ_EN);
-   spin_unlock_irqrestore(&gpt->lock, flags);
+   raw_spin_unlock_irqrestore(&gpt->lock, flags);
 }
 
 static void mpc52xx_gpt_irq_ack(struct irq_data *d)
@@ -171,14 +171,14 @@ static int mpc52xx_gpt_irq_set_type(struct irq_data *d, 
unsigned int flow_type)
 
dev_dbg(gpt->dev, "%s: virq=%i type=%x\n", __func__, d->irq, flow_type);
 
-   spin_lock_irqsave(&gpt->lock, flags);
+   raw_spin_lock_irqsave(&gpt->lock, flags);
reg = in_be32(&gpt->regs->mode) & ~MPC52xx_GPT_MODE_ICT_MASK;
if (flow_type & IRQF_TRIGGER_RISING)
reg |= MPC52xx_GPT_MODE_ICT_RISING;
if (flow_type & IRQF_TRIGGER_FALLING)
reg |= MPC52xx_GPT_MODE_ICT_FALLING;
out_be32(&gpt->regs->mode, reg);
-   spin_unlock_irqrestore(&gpt->lock, flags);
+   raw_spin_unlock_irqrestore(&gpt->lock, flags);
 
return 0;
 }
@@ -264,11 +264,11 @@ mpc52xx_gpt_irq_setup(struct mpc52xx_gpt_priv *gpt, 
struct device_node *node)
/* If the GPT is currently disabled, then change it to be in Input
 * Capture mode.  If the mode is non-zero, then the pin could be
 * already in use for something. */
-   spin_lock_irqsave(&gpt->lock, flags);
+   raw_spin_lock_irqsave(&gpt->lock, flags);
mode = in_be32(&gpt->regs->mode);
if ((mode & MPC52xx_GPT_MODE_MS_MASK) == 0)
out_be32(&gpt->regs->mode, mode | MPC52xx_GPT_MODE_MS_IC);
-   spin_unlock_irqrestore(&gpt->lock, flags);
+   raw_spin_unlock_irqrestore(&gpt->lock, flags);
 
dev_dbg(gpt->dev, "%s() complete. virq=%i\n", __func__, cascade_virq);
 }
@@ -295,9 +295,9 @@ mpc52xx_gpt_gpio_set(struct gpio_chip *gc, unsigned int 
gpio, int v)
dev_dbg(gpt->dev, "%s: gpio:%d v:%d\n", __func__, gpio, v);
r = v ? MPC52xx_GPT_MODE_GPIO_OUT_HIGH : MPC52xx_GPT_MODE_GPIO_OUT_LOW;
 
-   spin_lock_irqsave(&gpt->lock, flags);
+   raw_spin_lock_irqsave(&gpt->lock, flags);
clrsetbits_be32(&gpt->regs->mode, MPC52xx_GPT_MODE_GPIO_MASK, r);
-   spin_unlock_irqrestore(&gpt->lock, flags);
+   raw_spin_unlock_irqrestore(&gpt->lock, flags);
 }
 
 static int mpc52xx_gpt_gpio_dir_in(struct gpio_chip *gc, unsigned int gpio)
@@ -307,9 +307,9 @@ static int mpc52xx_gpt_gpio_dir_in(struct gpio_chip *gc, 
unsigned int gpio)
 
dev_dbg(gpt->dev, "%s: gpio:%d\n", __func__, gpio);
 
-   spin_lock_irqsave(&gpt->lock, flags);
+   raw_spin_lock_irqsave(&gpt->lock, flags);
clrbits32(&gpt->regs->mode, MPC52xx_GPT_MODE_GPIO_MASK);
-   spin_unlock_irqrestore(&gpt->lock, flags);
+   raw_spin_unlock_irqrestore(&gpt->lock, flags);
 
return 0;
 }
@@ -436,16 +436,16 @@ static int mpc52xx_gpt_do_start(struct mpc52xx_gpt_priv 
*gpt, u64 period,
}
 
/* Set and enable the timer, reject an attempt to use a wdt as gpt */
-   spin_lock_irqsave(&gpt->lock, flags);
+   raw_spi

Re: Optimised memset64/memset32 for powerpc

2017-03-21 Thread Benjamin Herrenschmidt

On Tue, 2017-03-21 at 06:29 -0700, Matthew Wilcox wrote:
> 
> Well, those are the generic versions in the first patch:
> 
> http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b977
> 6ac925199969bd5af4e994da776d461e7
> 
> so if those are good enough for you guys, there's no need for you to
> do anything.
> 
> Thanks for your time!

I suspect on ppc64 we can do much better, if anything moving 64-bit at
a time. Matthew, what are the main use cases of these ?

Cheers,
Ben.

Re: [PATCH v2 2/4] asm-generic/io.h: Remove unused generic __ioremap() definition

2017-03-21 Thread Bjorn Helgaas

On Tue, Mar 21, 2017 at 11:37:11AM +0100, Geert Uytterhoeven wrote:
> Hi Björn,
> 
> On Mon, Mar 20, 2017 at 7:42 PM, Bjorn Helgaas  wrote:
> > Several arches use __ioremap() to help implement the generic ioremap(),
> > ioremap_nocache(), and ioremap_wc() interfaces, but this usage is all
> > inside the arch/ directory.
> >
> > The only __ioremap() uses outside arch/ are in the ZorroII RAM disk driver
> > and some framebuffer drivers that are only buildable on m68k and powerpc,
> > and they use the versions provided by those arches.
> >
> > There's no need for a generic version of __ioremap(), so remove it.
> 
> These all predate the ioremap_*() variants, and can be converted to
> either ioremap_nocache() or ioremap_wt().
> 
> However, PPC doesn't implement ioremap_wt() yet, so asm-generic will
> fall back to the less-efficient nocache variant.
> PPC does support __ioremap(..., _PAGE_WRITETHRU), so adding a wrapper
> is trivial.

Thanks, I'll try adding ioremap_wt() (at least for PPC32) and cleaning this
up.

> > Signed-off-by: Bjorn Helgaas 
> > Reviewed-by: Arnd Bergmann 
> 
> Regardless,
> Acked-by: Geert Uytterhoeven 
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> ge...@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
> -- Linus Torvalds

Re: [PATCH kernel v10 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-21 Thread Alex Williamson

On Fri, 17 Mar 2017 16:09:59 +1100
Alexey Kardashevskiy  wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v10:
> * fixed leaking references in virt/kvm/vfio.c
> * moved code to helpers - kvm_vfio_group_get_iommu_group, 
> kvm_spapr_tce_release_vfio_group
> * fixed possible race between referencing table and destroying it via
> VFIO add/remove window ioctls()
> 
> v9:
> * removed referencing a group in KVM, only referencing iommu_table's now
> * fixed a reference leak in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler
> * fixed typo in vfio.txt
> * removed @argsz and @flags from struct kvm_vfio_spapr_tce
> 
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be clea

Re: Optimised memset64/memset32 for powerpc

2017-03-21 Thread Segher Boessenkool

On Tue, Mar 21, 2017 at 06:29:10AM -0700, Matthew Wilcox wrote:
> > Unrolling the loop could help a bit on old powerpc32s that don't have branch
> > units, but on those processors the main driver is the time spent to do the
> > effective write to memory, and the operations necessary to unroll the loop
> > are not worth the cycle added by the branch.
> > 
> > On more modern powerpc32s, the branch unit implies that branches have a zero
> > cost.
> 
> Fair enough.  I'm just surprised it was worth unrolling the loop on
> powerpc64 and not on powerpc32 -- see mem_64.S.

We can do at most one loop iteration per cycle, but we can do multiple
stores per cycle, on modern, bigger CPUs.  Many old or small CPUs have
only one load/store unit on the other hand.  There are other issues,
but that is the biggest difference.


Segher

[PATCH V3 10/10] powerpc/mm: Move hash specific pte bits to be top bits of RPN

2017-03-21 Thread Aneesh Kumar K.V

We don't support the full 57 bits of physical address and hence can overload
the top bits of RPN as hash specific pte bits.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash.h| 17 +
 arch/powerpc/include/asm/book3s/64/pgtable.h | 17 ++---
 arch/powerpc/mm/hash_native_64.c |  1 +
 3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index ec2828b1db07..4e957b027fe0 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -6,20 +6,13 @@
  * Common bits between 4K and 64K pages in a linux-style PTE.
  * Additional bits may be defined in pgtable-hash64-*.h
  *
- * Note: We only support user read/write permissions. Supervisor always
- * have full read/write to pages above PAGE_OFFSET (pages below that
- * always use the user access permissions).
- *
- * We could create separate kernel read-only if we used the 3 PP bits
- * combinations that newer processors provide but we currently don't.
  */
-#define H_PAGE_BUSY_RPAGE_SW1 /* software: PTE & hash are busy */
 #define H_PTE_NONE_MASK_PAGE_HPTEFLAGS
-#define H_PAGE_F_GIX_SHIFT 57
-/* (7ul << 57) HPTE index within HPTEG */
-#define H_PAGE_F_GIX   (_RPAGE_RSV2 | _RPAGE_RSV3 | _RPAGE_RSV4)
-#define H_PAGE_F_SECOND_RPAGE_RSV1 /* HPTE is in 2ndary 
HPTEG */
-#define H_PAGE_HASHPTE _RPAGE_SW0  /* PTE has associated HPTE */
+#define H_PAGE_F_GIX_SHIFT 56
+#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */
+#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary 
HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
 
 #ifdef CONFIG_PPC_64K_PAGES
 #include 
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 881fa7060b13..20c51b84c656 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -36,8 +36,22 @@
 #define _RPAGE_RSV20x0800UL
 #define _RPAGE_RSV30x0400UL
 #define _RPAGE_RSV40x0200UL
+
+#define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
+#define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
+
+/*
+ * Top and bottom bits of RPN which can be used by hash
+ * translation mode, because we expect them to be zero
+ * otherwise.
+ */
 #define _RPAGE_RPN00x01000
 #define _RPAGE_RPN10x02000
+#define _RPAGE_RPN44   0x0100UL
+#define _RPAGE_RPN43   0x0080UL
+#define _RPAGE_RPN42   0x0040UL
+#define _RPAGE_RPN41   0x0020UL
+
 /* Max physical address bit as per radix table */
 #define _RPAGE_PA_MAX  57
 /*
@@ -63,9 +77,6 @@
 
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
-
-#define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
-#define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
  * Instead of fixing all of them, add an alternate define which
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index cc332608e656..917a5a336441 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -246,6 +246,7 @@ static long native_hpte_insert(unsigned long hpte_group, 
unsigned long vpn,
 
__asm__ __volatile__ ("ptesync" : : : "memory");
 
+   BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul  << (H_PAGE_F_GIX_SHIFT + 3)));
return i | (!!(vflags & HPTE_V_SECONDARY) << 3);
 }
 
-- 
2.7.4

[PATCH V3 09/10] powerpc/mm: Lower the max real address to 53 bits

2017-03-21 Thread Aneesh Kumar K.V

Max value supported by hardware is 51 bits address. Radix page table define
a slot of 57 bits for future expansion. We restrict the value supported in
linux kernel 53 bits, so that we can use the bits between 57-53 for storing
hash linux page table bits. This is done in the next patch.

This will free up the software page table bits to be used for features
that are needed for both hash and radix. The current hash linux page table
format doesn't have any free software bits. Moving hash linux page table
specific bits to top of RPN field free up the software bits for other purpose.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 29 +---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 96566df547a8..881fa7060b13 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -38,6 +38,28 @@
 #define _RPAGE_RSV40x0200UL
 #define _RPAGE_RPN00x01000
 #define _RPAGE_RPN10x02000
+/* Max physical address bit as per radix table */
+#define _RPAGE_PA_MAX  57
+/*
+ * Max physical address bit we will use for now.
+ *
+ * This is mostly a hardware limitation and for now Power9 has
+ * a 51 bit limit.
+ *
+ * This is different from the number of physical bit required to address
+ * the last byte of memory. That is defined by MAX_PHYSMEM_BITS.
+ * MAX_PHYSMEM_BITS is a linux limitation imposed by the maximum
+ * number of sections we can support (SECTIONS_SHIFT).
+ *
+ * This is different from Radix page table limitation above and
+ * should always be less than that. The limit is done such that
+ * we can overload the bits between _RPAGE_PA_MAX and _PAGE_PA_MAX
+ * for hash linux page table specific bits.
+ *
+ * In order to be compatible with future hardware generations we keep
+ * some offsets and limit this for now to 53
+ */
+#define _PAGE_PA_MAX   53
 
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
@@ -51,10 +73,11 @@
  */
 #define _PAGE_NO_CACHE _PAGE_TOLERANT
 /*
- * We support 57 bit real address in pte. Clear everything above 57, and
- * every thing below PAGE_SHIFT;
+ * We support _RPAGE_PA_MAX bit real address in pte. On the linux side
+ * we are limited by _PAGE_PA_MAX. Clear everything above _PAGE_PA_MAX
+ * and every thing below PAGE_SHIFT;
  */
-#define PTE_RPN_MASK   (((1UL << 57) - 1) & (PAGE_MASK))
+#define PTE_RPN_MASK   (((1UL << _PAGE_PA_MAX) - 1) & (PAGE_MASK))
 /*
  * set of bits not changed in pmd_modify. Even though we have hash specific 
bits
  * in here, on radix we expect them to be zero.
-- 
2.7.4

[PATCH V3 08/10] powerpc/mm: Define all PTE bits based on radix definitions.

2017-03-21 Thread Aneesh Kumar K.V

Reviewed-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 4 ++--
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index b39f0b86405e..7be54f9590a3 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -10,8 +10,8 @@
  * 64k aligned address free up few of the lower bits of RPN for us
  * We steal that here. For more deatils look at pte_pfn/pfn_pte()
  */
-#define H_PAGE_COMBO   0x1000 /* this is a combo 4k page */
-#define H_PAGE_4K_PFN  0x2000 /* PFN is for a single 4k page */
+#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
+#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
 /*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4d4ff9a324f0..96566df547a8 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -36,6 +36,8 @@
 #define _RPAGE_RSV20x0800UL
 #define _RPAGE_RSV30x0400UL
 #define _RPAGE_RSV40x0200UL
+#define _RPAGE_RPN00x01000
+#define _RPAGE_RPN10x02000
 
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
-- 
2.7.4

[PATCH V3 07/10] powerpc/mm: Define _PAGE_SOFT_DIRTY unconditionally

2017-03-21 Thread Aneesh Kumar K.V

Conditional PTE bit definition is confusing and results in coding error.

Reviewed-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c39bc4cb9247..4d4ff9a324f0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -37,11 +37,7 @@
 #define _RPAGE_RSV30x0400UL
 #define _RPAGE_RSV40x0200UL
 
-#ifdef CONFIG_MEM_SOFT_DIRTY
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
-#else
-#define _PAGE_SOFT_DIRTY   0x0
-#endif
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
 
 #define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
-- 
2.7.4

[PATCH V3 06/10] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout

2017-03-21 Thread Aneesh Kumar K.V

Without this if firmware reports 1MB page size support we will crash
trying to use 1MB as hugetlb page size.

echo 300 > /sys/kernel/mm/hugepages/hugepages-1024kB/nr_hugepages

kernel BUG at ./arch/powerpc/include/asm/hugetlb.h:19!
.

[c000e2c27b30] c029dae8 .hugetlb_fault+0x638/0xda0
[c000e2c27c30] c026fb64 .handle_mm_fault+0x844/0x1d70
[c000e2c27d70] c004805c .do_page_fault+0x3dc/0x7c0
[c000e2c27e30] c000ac98 handle_page_fault+0x10/0x30

With fix, we don't enable 1MB as hugepage size.

bash-4.2# cd /sys/kernel/mm/hugepages/
bash-4.2# ls
hugepages-16384kB  hugepages-16777216kB

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 8c3389cbcd12..a4f33de4008e 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -753,6 +753,24 @@ static int __init add_huge_page_size(unsigned long long 
size)
if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
return -EINVAL;
 
+#ifdef CONFIG_PPC_BOOK3S_64
+   /*
+* We need to make sure that for different page sizes reported by
+* firmware we only add hugetlb support for page sizes that can be
+* supported by linux page table layout.
+* For now we have
+* Radix: 2M
+* Hash: 16M and 16G
+*/
+   if (radix_enabled()) {
+   if (mmu_psize != MMU_PAGE_2M)
+   return -EINVAL;
+   } else {
+   if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
+   return -EINVAL;
+   }
+#endif
+
BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
 
/* Return if huge page size has already been setup */
-- 
2.7.4

[PATCH V3 05/10] powerpc/mm: Add translation mode information in /proc/cpuinfo

2017-03-21 Thread Aneesh Kumar K.V

With this we have on powernv and pseries /proc/cpuinfo reporting

timebase: 51200
platform: PowerNV
model   : 8247-22L
machine : PowerNV 8247-22L
firmware: OPAL
MMU : Hash

Reviewed-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/powernv/setup.c | 4 
 arch/powerpc/platforms/pseries/setup.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index d50c7d99baaf..2d937f6d9260 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -95,6 +95,10 @@ static void pnv_show_cpuinfo(struct seq_file *m)
else
seq_printf(m, "firmware\t: BML\n");
of_node_put(root);
+   if (radix_enabled())
+   seq_printf(m, "MMU\t\t: Radix\n");
+   else
+   seq_printf(m, "MMU\t\t: Hash\n");
 }
 
 static void pnv_prepare_going_down(void)
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index b4d362ed03a1..b5d86426e97b 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -87,6 +87,10 @@ static void pSeries_show_cpuinfo(struct seq_file *m)
model = of_get_property(root, "model", NULL);
seq_printf(m, "machine\t\t: CHRP %s\n", model);
of_node_put(root);
+   if (radix_enabled())
+   seq_printf(m, "MMU\t\t: Radix\n");
+   else
+   seq_printf(m, "MMU\t\t: Hash\n");
 }
 
 /* Initialize firmware assisted non-maskable interrupts if
-- 
2.7.4

[PATCH V3 04/10] powerpc/mm/radix: rename _PAGE_LARGE to R_PAGE_LARGE

2017-03-21 Thread Aneesh Kumar K.V

This bit is only used by radix and it is nice to follow the naming style of 
having
bit name start with H_/R_ depending on which translation mode they are used.

No functional change in this patch.

Reviewed-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 2 +-
 arch/powerpc/include/asm/book3s/64/radix.h   | 4 ++--
 arch/powerpc/mm/tlb-radix.c  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index c62f14d0bec1..cd366596 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -46,7 +46,7 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct 
vm_area_struct *vma,
 */
VM_WARN_ON(page_shift == mmu_psize_defs[MMU_PAGE_1G].shift);
if (page_shift == mmu_psize_defs[MMU_PAGE_2M].shift)
-   return __pte(pte_val(entry) | _PAGE_LARGE);
+   return __pte(pte_val(entry) | R_PAGE_LARGE);
else
return entry;
 }
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 2a2ea47a9bd2..ac16d1943022 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -14,7 +14,7 @@
 /*
  * For P9 DD1 only, we need to track whether the pte's huge.
  */
-#define _PAGE_LARGE_RPAGE_RSV1
+#define R_PAGE_LARGE   _RPAGE_RSV1
 
 
 #ifndef __ASSEMBLY__
@@ -258,7 +258,7 @@ static inline int radix__pmd_trans_huge(pmd_t pmd)
 static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
 {
if (cpu_has_feature(CPU_FTR_POWER9_DD1))
-   return __pmd(pmd_val(pmd) | _PAGE_PTE | _PAGE_LARGE);
+   return __pmd(pmd_val(pmd) | _PAGE_PTE | R_PAGE_LARGE);
return __pmd(pmd_val(pmd) | _PAGE_PTE);
 }
 static inline void radix__pmdp_huge_split_prepare(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 952713d6cf04..83dc1ccc2fa1 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -437,7 +437,7 @@ void radix__flush_tlb_pte_p9_dd1(unsigned long old_pte, 
struct mm_struct *mm,
return;
}
 
-   if (old_pte & _PAGE_LARGE)
+   if (old_pte & R_PAGE_LARGE)
radix__flush_tlb_page_psize(mm, address, MMU_PAGE_2M);
else
radix__flush_tlb_page_psize(mm, address, mmu_virtual_psize);
-- 
2.7.4

[PATCH V3 03/10] powerpc/mm: Cleanup bits definition between hash and radix.

2017-03-21 Thread Aneesh Kumar K.V

Define everything based on bits present in pgtable.h. This will help in easily
identifying overlapping bits between hash/radix.

No functional change with this patch.

Reviewed-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  4 
 arch/powerpc/include/asm/book3s/64/hash.h |  9 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 10 ++
 arch/powerpc/include/asm/book3s/64/radix.h|  6 ++
 4 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index f3dd21efa2ea..b39f0b86405e 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -6,6 +6,10 @@
 #define H_PUD_INDEX_SIZE  5
 #define H_PGD_INDEX_SIZE  12
 
+/*
+ * 64k aligned address free up few of the lower bits of RPN for us
+ * We steal that here. For more deatils look at pte_pfn/pfn_pte()
+ */
 #define H_PAGE_COMBO   0x1000 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  0x2000 /* PFN is for a single 4k page */
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index f7b721bbf918..ec2828b1db07 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -13,12 +13,13 @@
  * We could create separate kernel read-only if we used the 3 PP bits
  * combinations that newer processors provide but we currently don't.
  */
-#define H_PAGE_BUSY0x00800 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY_RPAGE_SW1 /* software: PTE & hash are busy */
 #define H_PTE_NONE_MASK_PAGE_HPTEFLAGS
 #define H_PAGE_F_GIX_SHIFT 57
-#define H_PAGE_F_GIX   (7ul << 57) /* HPTE index within HPTEG */
-#define H_PAGE_F_SECOND(1ul << 60) /* HPTE is in 2ndary 
HPTEG */
-#define H_PAGE_HASHPTE (1ul << 61) /* PTE has associated HPTE */
+/* (7ul << 57) HPTE index within HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RSV2 | _RPAGE_RSV3 | _RPAGE_RSV4)
+#define H_PAGE_F_SECOND_RPAGE_RSV1 /* HPTE is in 2ndary 
HPTEG */
+#define H_PAGE_HASHPTE _RPAGE_SW0  /* PTE has associated HPTE */
 
 #ifdef CONFIG_PPC_64K_PAGES
 #include 
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8f4d41936e5a..c39bc4cb9247 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -44,14 +44,8 @@
 #endif
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
 
-/*
- * For P9 DD1 only, we need to track whether the pte's huge.
- */
-#define _PAGE_LARGE_RPAGE_RSV1
-
-
-#define _PAGE_PTE  (1ul << 62) /* distinguishes PTEs from 
pointers */
-#define _PAGE_PRESENT  (1ul << 63) /* pte contains a translation */
+#define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
+#define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
  * Instead of fixing all of them, add an alternate define which
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 9e0bb7cd6e22..2a2ea47a9bd2 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -11,6 +11,12 @@
 #include 
 #endif
 
+/*
+ * For P9 DD1 only, we need to track whether the pte's huge.
+ */
+#define _PAGE_LARGE_RPAGE_RSV1
+
+
 #ifndef __ASSEMBLY__
 #include 
 #include 
-- 
2.7.4

[PATCH V3 02/10] powerpc/mm/slice: Fix off-by-1 error when computing slice mask

2017-03-21 Thread Aneesh Kumar K.V

For low slice, max addr should be less than 4G. Without limiting this correctly
we will end up with a low slice mask which has 17th bit set. This is not
a problem with the current code because our low slice mask is of type u16. But
in later patch I am switching low slice mask to u64 type and having the 17bit
set result in wrong slice mask which in turn results in mmap failures.

Reviewed-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 2b27458902ee..bf150557dba8 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -83,11 +83,10 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
struct slice_mask ret = { 0, 0 };
 
if (start < SLICE_LOW_TOP) {
-   unsigned long mend = min(end, SLICE_LOW_TOP);
-   unsigned long mstart = min(start, SLICE_LOW_TOP);
+   unsigned long mend = min(end, (SLICE_LOW_TOP - 1));
 
ret.low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1))
-   - (1u << GET_LOW_SLICE_INDEX(mstart));
+   - (1u << GET_LOW_SLICE_INDEX(start));
}
 
if ((start + len) > SLICE_LOW_TOP)
-- 
2.7.4

[PATCH V3 01/10] powerpc/mm/nohash: MM_SLICE is only used by book3s 64

2017-03-21 Thread Aneesh Kumar K.V

BOOKE code is dead code as per the Kconfig details. So make it simpler
by enabling MM_SLICE only for book3s_64. The changes w.r.t nohash is just
removing deadcode. W.r.t ppc64, 4k without hugetlb will now enable MM_SLICE.
But that is good, because we reduce one extra variant which probably is not
getting tested much.

Reviewed-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/mmu-book3e.h| 5 -
 arch/powerpc/include/asm/nohash/64/pgtable.h | 5 -
 arch/powerpc/mm/hugetlbpage-book3e.c | 7 ---
 arch/powerpc/mm/mmu_context_nohash.c | 5 -
 arch/powerpc/platforms/Kconfig.cputype   | 2 +-
 5 files changed, 1 insertion(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-book3e.h 
b/arch/powerpc/include/asm/mmu-book3e.h
index b62a8d43a06c..7ca8d8e80ffa 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -229,11 +229,6 @@ typedef struct {
unsigned intid;
unsigned intactive;
unsigned long   vdso_base;
-#ifdef CONFIG_PPC_MM_SLICES
-   u64 low_slices_psize;   /* SLB page size encodings */
-   u64 high_slices_psize;  /* 4 bits per slice for now */
-   u16 user_psize; /* page size index */
-#endif
 #ifdef CONFIG_PPC_64K_PAGES
/* for 4K PTE fragment support */
void *pte_frag;
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index c7f927e67d14..f0ff384d4ca5 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -88,11 +88,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_PPC_MM_SLICES
-#define HAVE_ARCH_UNMAPPED_AREA
-#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
-#endif /* CONFIG_PPC_MM_SLICES */
-
 #ifndef __ASSEMBLY__
 /* pte_clear moved to later in this file */
 
diff --git a/arch/powerpc/mm/hugetlbpage-book3e.c 
b/arch/powerpc/mm/hugetlbpage-book3e.c
index 83a8be791e06..bfe4e8526b2d 100644
--- a/arch/powerpc/mm/hugetlbpage-book3e.c
+++ b/arch/powerpc/mm/hugetlbpage-book3e.c
@@ -148,16 +148,9 @@ void book3e_hugetlb_preload(struct vm_area_struct *vma, 
unsigned long ea,
 
mm = vma->vm_mm;
 
-#ifdef CONFIG_PPC_MM_SLICES
-   psize = get_slice_psize(mm, ea);
-   tsize = mmu_get_tsize(psize);
-   shift = mmu_psize_defs[psize].shift;
-#else
psize = vma_mmu_pagesize(vma);
shift = __ilog2(psize);
tsize = shift - 10;
-#endif
-
/*
 * We can't be interrupted while we're setting up the MAS
 * regusters or after we've confirmed that no tlb exists.
diff --git a/arch/powerpc/mm/mmu_context_nohash.c 
b/arch/powerpc/mm/mmu_context_nohash.c
index c491f2c8f2b9..4554d6527682 100644
--- a/arch/powerpc/mm/mmu_context_nohash.c
+++ b/arch/powerpc/mm/mmu_context_nohash.c
@@ -333,11 +333,6 @@ int init_new_context(struct task_struct *t, struct 
mm_struct *mm)
 
mm->context.id = MMU_NO_CONTEXT;
mm->context.active = 0;
-
-#ifdef CONFIG_PPC_MM_SLICES
-   slice_set_user_psize(mm, mmu_virtual_psize);
-#endif
-
return 0;
 }
 
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 99b0ae8acb78..a7c0c1fafe68 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -359,7 +359,7 @@ config PPC_BOOK3E_MMU
 
 config PPC_MM_SLICES
bool
-   default y if (!PPC_FSL_BOOK3E && PPC64 && HUGETLB_PAGE) || 
(PPC_STD_MMU_64 && PPC_64K_PAGES)
+   default y if PPC_STD_MMU_64
default n
 
 config PPC_HAVE_PMU_SUPPORT
-- 
2.7.4

Re: [PATCH] powerpc: sysdev: cpm1: Optimise gpio bit calculation

2017-03-21 Thread Christophe LEROY




Le 10/03/2017 à 16:41, Segher Boessenkool a écrit :

On Fri, Mar 10, 2017 at 03:41:23PM +0100, Christophe LEROY wrote:

gpio_get() and gpio_set() are used extensively by some GPIO based
drivers like SPI, NAND, so it may be worth it as it doesn't impair
readability (if anyone prefers, we could write  (1 << 31) >> i  instead
of  0x8000 >> i )

1 << 31 is undefined behaviour, of course.

Shall it be 1U << 31 ?

Sure, that works.  "1 << (31 - i)" is most readable (but it doesn't yet
generate the code you want).

Euh  I'm a bit lost. Do you mean the form we have today is the
driver is wrong ?

Heh, yes.  But is't okay with GCC, so don't worry about it.

The point is that "0x8000 >> i" is less readable.



FYI, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80131

Christophe

Re: Optimised memset64/memset32 for powerpc

2017-03-21 Thread Matthew Wilcox

On Tue, Mar 21, 2017 at 01:23:36PM +0100, Christophe LEROY wrote:
> > It doesn't look free for you as you only store one register each time
> > around the loop in the 32-bit memset implementation:
> > 
> > 1:  stwur4,4(r6)
> > bdnz1b
> > 
> > (wouldn't you get better performance on 32-bit powerpc by unrolling that
> > loop like you do on 64-bit?)
> 
> In arch/powerpc/lib/copy_32.S, the implementation of memset() is optimised
> when the value to be set is zero. It makes use of the 'dcbz' instruction
> which zeroizes a complete cache line.
> 
> Not much effort has been put on optimising non-zero memset() because there
> are almost none.

Yes, bzero() is much more common than setting an 8-bit pattern.
And setting an 8-bit pattern is almost certainly more common than setting
a 32 or 64 bit pattern.

> Unrolling the loop could help a bit on old powerpc32s that don't have branch
> units, but on those processors the main driver is the time spent to do the
> effective write to memory, and the operations necessary to unroll the loop
> are not worth the cycle added by the branch.
> 
> On more modern powerpc32s, the branch unit implies that branches have a zero
> cost.

Fair enough.  I'm just surprised it was worth unrolling the loop on
powerpc64 and not on powerpc32 -- see mem_64.S.

> A simple static inline C function would probably do the job, based on what I
> get below:
> 
> void memset32(int *p, int v, unsigned int c)
> {
>   int i;
> 
>   for (i = 0; i < c; i++)
>   *p++ = v;
> }
> 
> void memset64(long long *p, long long v, unsigned int c)
> {
>   int i;
> 
>   for (i = 0; i < c; i++)
>   *p++ = v;
> }

Well, those are the generic versions in the first patch:

http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b9776ac925199969bd5af4e994da776d461e7

so if those are good enough for you guys, there's no need for you to
do anything.

Thanks for your time!

Re: Optimised memset64/memset32 for powerpc

2017-03-21 Thread Christophe LEROY


Hi Matthew

Le 20/03/2017 à 22:14, Matthew Wilcox a écrit :

I recently introduced memset32() / memset64().  I've done implementations
for x86 & ARM; akpm has agreed to take the patchset through his tree.
Do you fancy doing a powerpc version?  Minchan Kim got a 7% performance
increase with zram from switching to the optimised version on x86.

Here's the development git tree:
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/memfill
(most recent 7 commits)

ARM probably offers the best model for you to work from; it's basically
just a case of jumping into the middle of your existing memset loop.
It was only three instructions to add support to ARM, but I don't know
PowerPC well enough to understand how your existing memset works.
I'd start with something like this ... note that you don't have to
implement memset64 on 32-bit; I only did it on ARM because it was free.
It doesn't look free for you as you only store one register each time
around the loop in the 32-bit memset implementation:

1:  stwur4,4(r6)
bdnz1b

(wouldn't you get better performance on 32-bit powerpc by unrolling that
loop like you do on 64-bit?)


In arch/powerpc/lib/copy_32.S, the implementation of memset() is 
optimised when the value to be set is zero. It makes use of the 'dcbz' 
instruction which zeroizes a complete cache line.


Not much effort has been put on optimising non-zero memset() because 
there are almost none.


Unrolling the loop could help a bit on old powerpc32s that don't have 
branch units, but on those processors the main driver is the time spent 
to do the effective write to memory, and the operations necessary to 
unroll the loop are not worth the cycle added by the branch.


On more modern powerpc32s, the branch unit implies that branches have a 
zero cost.


A simple static inline C function would probably do the job, based on 
what I get below:


void memset32(int *p, int v, unsigned int c)
{
int i;

for (i = 0; i < c; i++)
*p++ = v;
}

void memset64(long long *p, long long v, unsigned int c)
{
int i;

for (i = 0; i < c; i++)
*p++ = v;
}

test.o: file format elf32-powerpc


Disassembly of section .text:

 :
   0:   2c 05 00 00 cmpwi   r5,0
   4:   38 63 ff fc addir3,r3,-4
   8:   4d 82 00 20 beqlr
   c:   7c a9 03 a6 mtctr   r5
  10:   94 83 00 04 stwur4,4(r3)
  14:   42 00 ff fc bdnz10 
  18:   4e 80 00 20 blr

001c :
  1c:   2c 07 00 00 cmpwi   r7,0
  20:   7c cb 33 78 mr  r11,r6
  24:   7c aa 2b 78 mr  r10,r5
  28:   38 63 ff f8 addir3,r3,-8
  2c:   4d 82 00 20 beqlr
  30:   7c e9 03 a6 mtctr   r7
  34:   95 43 00 08 stwur10,8(r3)
  38:   91 63 00 04 stw r11,4(r3)
  3c:   42 00 ff f8 bdnz34 
  40:   4e 80 00 20 blr



Christophe



diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index da3cdffca440..c02392fced98 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -6,6 +6,7 @@
 #define __HAVE_ARCH_STRNCPY
 #define __HAVE_ARCH_STRNCMP
 #define __HAVE_ARCH_MEMSET
+#define __HAVE_ARCH_MEMSET_PLUS
 #define __HAVE_ARCH_MEMCPY
 #define __HAVE_ARCH_MEMMOVE
 #define __HAVE_ARCH_MEMCMP
@@ -23,6 +24,18 @@ extern void * memmove(void *,const void *,__kernel_size_t);
 extern int memcmp(const void *,const void *,__kernel_size_t);
 extern void * memchr(const void *,int,__kernel_size_t);

+extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
+static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
+{
+   return __memset32(p, v, n * 4);
+}
+
+extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
+static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
+{
+   return __memset64(p, v, n * 8);
+}
+
 #endif /* __KERNEL__ */

 #endif /* _ASM_POWERPC_STRING_H */

Re: [1/6] powerpc/64s: machine check print NIP

2017-03-21 Thread Michael Ellerman

On Tue, 2017-03-14 at 12:36:43 UTC, Nicholas Piggin wrote:
> Print the faulting address of the machine check that may help with
> debugging. The effective address reported can be a target memory address
> rather than the faulting instruction address.
> 
> Fix up a dangling bracket while here.
> 
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/fc84427b7e1471b0a1220c56072f57

cheers

Re: [v2, 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings

2017-03-21 Thread Michael Ellerman

On Sun, 2017-03-12 at 13:17:00 UTC, Geert Uytterhoeven wrote:
> Submitters of device tree binding documentation may forget to CC
> the subsystem maintainer if this is missing.
> 
> Signed-off-by: Geert Uytterhoeven 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/58f169139e9692e290a7d5d9b034b7

cheers

Re: [RESEND] powerpc/pseries: move struct hcall_stats to c file

2017-03-21 Thread Michael Ellerman

On Tue, 2017-03-07 at 09:32:42 UTC, tcharding wrote:
> struct hcall_stats is only used in hvCall_inst.c.
> 
> Move struct hcall_stats to hvCall_inst.c
> 
> Signed-off-by: Tobin C. Harding 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/017614a5d6c09ec9e0dc3fd46a5018

cheers

Re: powerpc/ftrace: add ftrace.h, fix sparse warning

2017-03-21 Thread Michael Ellerman

On Mon, 2017-03-06 at 08:49:46 UTC, "Tobin C. Harding" wrote:
> Sparse emits warning: symbol 'prepare_ftrace_return' was not
> declared. Should it be static? prepare_ftrace_return() is called
> from assembler and should not be static. Adding a header file
> declaring the function will fix the sparse warning while adding
> documentation to the call.
> 
> Add header file ftrace.h with single function declaration. Protect
> declaration with preprocessor guard so it may be included in
> assembly. Include new header in all files that call
> prepare_ftrace_return() and in ftrace.c where function is defined.
> 
> Signed-off-by: Tobin C. Harding 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/b3a7864c6feb0fb30bc2cd37265707

cheers

Re: powerpc: fix sparse warning, include kernel header

2017-03-21 Thread Michael Ellerman

On Mon, 2017-03-06 at 08:25:31 UTC, "Tobin C. Harding" wrote:
> Spares emits two symbol not declared warnings. The two functions in
> question are declared already in a kernel header.
> 
> Add include directive to include kernel header.
> 
> Signed-off-by: Tobin C. Harding 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/1fc439c81312cd27aed553964c0d9d

cheers

Re: [v2,1/2] powerpc: Move THREAD_SHIFT config to KConfig

2017-03-21 Thread Michael Ellerman

On Fri, 2017-02-24 at 00:52:09 UTC, Hamish Martin wrote:
> Shift the logic for defining THREAD_SHIFT logic to Kconfig in order to
> allow override by users.
> 
> Signed-off-by: Hamish Martin 
> Reviewed-by: Chris Packham 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/476134070c037820bd909ff6e43e0d

cheers

Re: [kernel] powerpc/powernv/npu: Remove dead iommu code

2017-03-21 Thread Michael Ellerman

On Tue, 2017-02-21 at 02:40:20 UTC, Alexey Kardashevskiy wrote:
> PNV_IODA_PE_DEV is only used for NPU devices (emulated PCI bridges
> representing NVLink). These are added to IOMMU groups with corresponding
> NVIDIA devices after all non-NPU PEs are setup; a special helper -
> pnv_pci_ioda_setup_iommu_api() - handles this in pnv_pci_ioda_fixup().
> 
> The pnv_pci_ioda2_setup_dma_pe() helper sets up DMA for a PE. It is called
> for VFs (so it does not handle NPU case) and PCI bridges but only
> IODA1 and IODA2 types. An NPU bridge has its own type id (PNV_PHB_NPU)
> so pnv_pci_ioda2_setup_dma_pe() cannot be called on NPU and therefore
> (pe->flags & PNV_IODA_PE_DEV) is always "false".
> 
> This removes not used iommu_add_device(). This should not cause any
> behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Acked-by: Gavin Shan 
> Reviewed-by: David Gibson 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/20f13b95eef1b3ca75535e313357b0

cheers

Re: [kernel] powerpc/powernv: Fix it_ops::get() callback to return in cpu endian

2017-03-21 Thread Michael Ellerman

On Tue, 2017-02-21 at 02:38:54 UTC, Alexey Kardashevskiy wrote:
> The iommu_table_ops callbacks are declared CPU endian as they take and
> return "unsigned long"; underlying hardware tables are big-endian.
> 
> However get() was missing be64_to_cpu(), this adds the missing conversion.
> 
> The only caller of this is crash dump at arch/powerpc/kernel/iommu.c,
> iommu_table_clear() which only compares TCE to zero so this change
> should not cause behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 
> Acked-by: Gavin Shan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/81d5fe1a3b1acfaadc7921d08609e0

cheers

Re: [1/3] powerpc/mm: move mmap_sem unlock up from do_sigbus

2017-03-21 Thread Michael Ellerman

On Tue, 2017-02-14 at 16:45:10 UTC, Laurent Dufour wrote:
> Move mmap_sem releasing in the do_sigbus()'s unique caller : mm_fault_error()
> 
> No functional changes.
> 
> Signed-off-by: Laurent Dufour 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c2294e0ffe741c8b34c630a71c7dc4

cheers

Re: [1/2] selftests/powerpc: Refactor the AUXV routines

2017-03-21 Thread Michael Ellerman

On Mon, 2017-02-06 at 10:13:27 UTC, Michael Ellerman wrote:
> Refactor the AUXV routines so they are more composable. In a future test
> we want to look for many AUXV entries and we don't want to have to read
> /proc/self/auxv each time.
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/e3028437cb45c04a9caae4d6372bfe

cheers

Re: [2/2] powerpc: Fix missing CRCs, add yet more asm-prototypes.h declarations

2017-03-21 Thread Michael Ellerman

On Fri, 2016-12-02 at 02:38:38 UTC, Ben Hutchings wrote:
> Add declarations for:
> - __mfdcr, __mtdcr (if CONFIG_PPC_DCR_NATIVE=y; through )
> - switch_mmu_context (if CONFIG_PPC_BOOK3S_64=n; through )
> 
> Signed-off-by: Ben Hutchings 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/43af0a70db62488fc3ad1505f6

cheers

Re: [1/2] powerpc: Remove Mac-on-Linux hooks

2017-03-21 Thread Michael Ellerman

On Fri, 2016-12-02 at 02:35:52 UTC, Ben Hutchings wrote:
> The symbols exported for use by MOL aren't getting CRCs and I was
> about to fix that.  But MOL is dead upstream, and the latest work on
> it was to make it use KVM instead of its own kernel module.  So remove
> them instead.
> 
> Signed-off-by: Ben Hutchings 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/307260137592bddd0ce810dd64

cheers

Re: drivers/pcmcia: NO_IRQ removal for electra_cf.c

2017-03-21 Thread Michael Ellerman

On Sat, 2016-09-10 at 10:01:30 UTC, Michael Ellerman wrote:
> We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it
> from electra_cf.c which is a powerpc-only driver.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/6c8343e82bec46ece4fe773304a84c

cheers

Re: powerpc/64s: fix idle wakeup potential to clobber registers

2017-03-21 Thread Michael Ellerman

On Fri, 2017-03-17 at 05:13:20 UTC, Nicholas Piggin wrote:
> We concluded there may be a window where the idle wakeup code could
> get to pnv_wakeup_tb_loss (which clobbers non-volatile GPRs), but the
> hardware may set SRR1[46:47] to 01b (no state loss) which would
> result in the wakeup code failing to restore non-volatile GPRs.
> 
> I was not able to trigger this condition with trivial tests on
> real hardware or simulator, but the ISA (at least 2.07) seems to
> allow for it, and Gautham says that it can happen if there is an
> exception pending when the sleep/winkle instruction is executed.
> 
> Signed-off-by: Nicholas Piggin 
> Acked-by: Gautham R. Shenoy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/6d98ce0be541d4a3cfbb52cd75072c

cheers

Re: [RESEND] cxl: Route eeh events to all slices for pci_channel_io_perm_failure state

2017-03-21 Thread Michael Ellerman

On Thu, 2017-02-23 at 03:27:26 UTC, Vaibhav Jain wrote:
> Fix a boundary condition where in some cases an eeh event with
> state == pci_channel_io_perm_failure wont be passed on to a driver
> attached to the virtual pci device associated with a slice. This will
> happen in case the slice just before (n-1) doesn't have any vPHB bus
> associated with it, that results in an early return from
> cxl_pci_error_detected callback.
> 
> With state==pci_channel_io_perm_failure, the adapter will be removed
> irrespective of the return value of cxl_vphb_error_detected. So we now
> always return PCI_ERS_RESULT_DISCONNECTED for this case i.e even if
> the AFU isn't using a vPHB (currently returns PCI_ERS_RESULT_NONE).
> 
> Fixes: e4f5fc001a6("cxl: Do not create vPHB if there are no AFU configuration 
> records")
> Signed-off-by: Vaibhav Jain 
> Reviewed-by: Matthew R. Ochs 
> Reviewed-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/07f5ab6002a4f0b633f3495157166f

cheers

Re: [PATCH V2 2/6] cxl: Keep track of mm struct associated with a context

2017-03-21 Thread Frederic Barrat

Another thought about that patch. Now that we keep track of the mm 
associated to a context, I think we can simplify slightly the function 
_cxl_slbia() in main.c, where we look for the mm based on the pid. We 
now have the information readily available.


  Fred


Le 14/03/2017 à 12:08, Christophe Lombard a écrit :

The mm_struct corresponding to the current task is acquired each time
an interrupt is raised. So to simplify the code, we only get the
mm_struct when attaching an AFU context to the process.
The mm_count reference is increased to ensure that the mm_struct can't
be freed. The mm_struct will be released when the context is detached.
The reference (use count) on the struct mm is not kept to avoid a
circular dependency if the process mmaps its cxl mmio and forget to
unmap before exiting.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/api.c | 17 --
 drivers/misc/cxl/context.c | 26 ++--
 drivers/misc/cxl/cxl.h | 13 ++--
 drivers/misc/cxl/fault.c   | 77 +-
 drivers/misc/cxl/file.c| 15 +++--
 5 files changed, 68 insertions(+), 80 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index bcc030e..1a138c8 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "cxl.h"

@@ -321,19 +322,29 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,

if (task) {
ctx->pid = get_task_pid(task, PIDTYPE_PID);
-   ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID);
kernel = false;
ctx->real_mode = false;
+
+   /* acquire a reference to the task's mm */
+   ctx->mm = get_task_mm(current);
+
+   /* ensure this mm_struct can't be freed */
+   cxl_context_mm_count_get(ctx);
+
+   /* decrement the use count */
+   if (ctx->mm)
+   mmput(ctx->mm);
}

cxl_ctx_get();

if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
-   put_pid(ctx->glpid);
put_pid(ctx->pid);
-   ctx->glpid = ctx->pid = NULL;
+   ctx->pid = NULL;
cxl_adapter_context_put(ctx->afu->adapter);
cxl_ctx_put();
+   if (task)
+   cxl_context_mm_count_put(ctx);
goto out;
}

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 062bf6c..ed0a447 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,7 +42,7 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu 
*afu, bool master)
spin_lock_init(&ctx->sste_lock);
ctx->afu = afu;
ctx->master = master;
-   ctx->pid = ctx->glpid = NULL; /* Set in start work ioctl */
+   ctx->pid = NULL; /* Set in start work ioctl */
mutex_init(&ctx->mapping_lock);
ctx->mapping = NULL;

@@ -242,12 +243,15 @@ int __detach_context(struct cxl_context *ctx)

/* release the reference to the group leader and mm handling pid */
put_pid(ctx->pid);
-   put_pid(ctx->glpid);

cxl_ctx_put();

/* Decrease the attached context count on the adapter */
cxl_adapter_context_put(ctx->afu->adapter);
+
+   /* Decrease the mm count on the context */
+   cxl_context_mm_count_put(ctx);
+
return 0;
 }

@@ -325,3 +329,21 @@ void cxl_context_free(struct cxl_context *ctx)
mutex_unlock(&ctx->afu->contexts_lock);
call_rcu(&ctx->rcu, reclaim_ctx);
 }
+
+void cxl_context_mm_count_get(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   atomic_inc(&ctx->mm->mm_count);
+}
+
+void cxl_context_mm_count_put(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   mmdrop(ctx->mm);
+}
+
+void cxl_context_mm_users_get(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   atomic_inc(&ctx->mm->mm_users);
+}
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 79e60ec..4d1b704 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -482,8 +482,6 @@ struct cxl_context {
unsigned int sst_size, sst_lru;

wait_queue_head_t wq;
-   /* pid of the group leader associated with the pid */
-   struct pid *glpid;
/* use mm context associated with this pid for ds faults */
struct pid *pid;
spinlock_t lock; /* Protects pending_irq_mask, pending_fault and 
fault_addr */
@@ -551,6 +549,8 @@ struct cxl_context {
 * CX4 only:
 */
struct list_head extra_irq_contexts;
+
+   struct mm_struct *mm;
 };

 struct cxl_service_layer_ops {
@@ -1024,4 +1024,13 @@ int cxl_adapter_context_lock(struct cxl *adapter);
 /* Unlock the contexts-lock if taken. Warn and force unlock

Re: [PATCH V2 6/6] cxl: Add psl9 specific code

2017-03-21 Thread Frederic Barrat




Le 14/03/2017 à 12:08, Christophe Lombard a écrit :

The new Coherent Accelerator Interface Architecture, level 2, for the
IBM POWER9 brings new content and features:
- POWER9 Service Layer
- Registers
- Radix mode
- Process element entry
- Dedicated-Shared Process Programming Model
- Translation Fault Handling
- CAPP
- Memory Context ID
If a valid mm_struct is found the memory context id is used for each
transaction associated with the process handle. The PSL uses the
context ID to find the corresponding process element.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/context.c |  13 +++
 drivers/misc/cxl/cxl.h | 124 
 drivers/misc/cxl/debugfs.c |  19 
 drivers/misc/cxl/fault.c   |  48 ++
 drivers/misc/cxl/guest.c   |   8 +-
 drivers/misc/cxl/irq.c |  53 +++
 drivers/misc/cxl/native.c  | 218 +--
 drivers/misc/cxl/pci.c | 228 +++--
 drivers/misc/cxl/trace.h   |  43 +
 9 files changed, 700 insertions(+), 54 deletions(-)

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 0fd3cf8..8d41fc5 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -205,6 +205,19 @@ int cxl_context_iomap(struct cxl_context *ctx, struct 
vm_area_struct *vma)
return -EBUSY;
}

+   if ((ctx->afu->current_mode == CXL_MODE_DEDICATED) &&
+ (cxl_is_psl9(ctx->afu))) {
+   /* make sure there is a valid problem state area space for this 
AFU */
+   if (ctx->master && !ctx->afu->psa) {
+   pr_devel("AFU doesn't support mmio space\n");
+   return -EINVAL;
+   }
+
+   /* Can't mmap until the AFU is enabled */
+   if (!ctx->afu->enabled)
+   return -EBUSY;
+   }
+



It looks we could refactor the if clause with the above...



pr_devel("%s: mmio physical: %llx pe: %i master:%i\n", __func__,
 ctx->psn_phys, ctx->pe , ctx->master);

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index f6a3a34..fbdc511 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -63,7 +63,7 @@ typedef struct {
 /* Memory maps. Ref CXL Appendix A */

 /* PSL Privilege 1 Memory Map */
-/* Configuration and Control area */
+/* Configuration and Control area - CAIA 1&2 */
 static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x};
 static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008};
 static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010};
@@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase  = {0x0100};
 static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108};
 static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
 static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168};
+/* PSL registers - CAIA 2 */
+static const cxl_p1_reg_t CXL_PSL9_CONTROL  = {0x0020};
+static const cxl_p1_reg_t CXL_XSL9_DSNCTL   = {0x0168};
+static const cxl_p1_reg_t CXL_PSL9_FIR1 = {0x0300};
+static const cxl_p1_reg_t CXL_PSL9_FIR2 = {0x0308}; /* TBD NML CAIA 2 */



TBD NML???



+static const cxl_p1_reg_t CXL_PSL9_Timebase = {0x0310};
+static const cxl_p1_reg_t CXL_PSL9_DEBUG= {0x0320};
+static const cxl_p1_reg_t CXL_PSL9_FIR_CNTL = {0x0348};
+static const cxl_p1_reg_t CXL_PSL9_DSNDCTL  = {0x0350};
+static const cxl_p1_reg_t CXL_PSL9_TB_CTLSTAT = {0x0340};
+static const cxl_p1_reg_t CXL_PSL9_TRACECFG = {0x0368};
+static const cxl_p1_reg_t CXL_PSL9_APCDEDALLOC = {0x0378};
+static const cxl_p1_reg_t CXL_PSL9_APCDEDTYPE = {0x0380};
+static const cxl_p1_reg_t CXL_PSL9_TNR_ADDR = {0x0388};
+static const cxl_p1_reg_t CXL_PSL9_GP_CT = {0x0398};
+static const cxl_p1_reg_t CXL_XSL9_IERAT = {0x0588};
+static const cxl_p1_reg_t CXL_XSL9_ILPP  = {0x0590};
+
 /* 0x7F00:7FFF Reserved PCIe MSI-X Pending Bit Array area */
 /* 0x8000: Reserved PCIe MSI-X Table Area */

 /* PSL Slice Privilege 1 Memory Map */
-/* Configuration Area */
+/* Configuration Area - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_PSL_SR_An  = {0x00};
 static const cxl_p1n_reg_t CXL_PSL_LPID_An= {0x08};
 static const cxl_p1n_reg_t CXL_PSL_AMBAR_An   = {0x10};
@@ -111,17 +129,18 @@ static const cxl_p1n_reg_t CXL_PSL_ID_An  = 
{0x20};
 static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28};
 /* Memory Management and Lookaside Buffer Management - CAIA 1*/
 static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30};
+/* Memory Management and Lookaside Buffer Management - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_PSL_AMOR_An= {0x38};
-/* Pointer Area */
+/* Pointer Area - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_HAURP_An   = {0x80};
 static const cxl_p1n_reg_t CXL_PSL_SPAP_An= {0x88};
 static const cxl_p1n_reg_t CXL_PSL_LLCMD_An   = {0x90};
-/* Control Area */
+/* Control Area - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_PSL_SCNTL_An

Re: [PATCH V2 1/6] cxl: Remove unused values in bare-metal environment.

2017-03-21 Thread Frederic Barrat




Le 14/03/2017 à 12:08, Christophe Lombard a écrit :

The two fields pid and tid of the structure cxl_irq_info are only used
in the guest environment. To avoid confusion, it's not necessary
to fill the fields in the bare-metal environment.
The PSL Process and Thread Identification Register is only used when
attaching a dedicated process for PSL8 only.


I forgot to mention: we should probably say in the commit message that 
the code change is justified because the CXL_PSL_PID_TID_An register 
goes away in CAIA2 and cxl was not really using it.


  Fred





Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/native.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 7ae7105..7257e8b 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -859,8 +859,6 @@ static int native_detach_process(struct cxl_context *ctx)

 static int native_get_irq_info(struct cxl_afu *afu, struct cxl_irq_info *info)
 {
-   u64 pidtid;
-
/* If the adapter has gone away, we can't get any meaningful
 * information.
 */
@@ -870,9 +868,6 @@ static int native_get_irq_info(struct cxl_afu *afu, struct 
cxl_irq_info *info)
info->dsisr = cxl_p2n_read(afu, CXL_PSL_DSISR_An);
info->dar = cxl_p2n_read(afu, CXL_PSL_DAR_An);
info->dsr = cxl_p2n_read(afu, CXL_PSL_DSR_An);
-   pidtid = cxl_p2n_read(afu, CXL_PSL_PID_TID_An);
-   info->pid = pidtid >> 32;
-   info->tid = pidtid & 0x;
info->afu_err = cxl_p2n_read(afu, CXL_AFU_ERR_An);
info->errstat = cxl_p2n_read(afu, CXL_PSL_ErrStat_An);
info->proc_handle = 0;

Re: [PATCH v2 2/4] asm-generic/io.h: Remove unused generic __ioremap() definition

2017-03-21 Thread Geert Uytterhoeven

Hi Björn,

On Mon, Mar 20, 2017 at 7:42 PM, Bjorn Helgaas  wrote:
> Several arches use __ioremap() to help implement the generic ioremap(),
> ioremap_nocache(), and ioremap_wc() interfaces, but this usage is all
> inside the arch/ directory.
>
> The only __ioremap() uses outside arch/ are in the ZorroII RAM disk driver
> and some framebuffer drivers that are only buildable on m68k and powerpc,
> and they use the versions provided by those arches.
>
> There's no need for a generic version of __ioremap(), so remove it.

These all predate the ioremap_*() variants, and can be converted to
either ioremap_nocache() or ioremap_wt().

However, PPC doesn't implement ioremap_wt() yet, so asm-generic will
fall back to the less-efficient nocache variant.
PPC does support __ioremap(..., _PAGE_WRITETHRU), so adding a wrapper
is trivial.

> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Arnd Bergmann 

Regardless,
Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH 2/3] powerpc/mm: handle VM_FAULT_RETRY earlier

2017-03-21 Thread Laurent Dufour

On 21/03/2017 10:12, Aneesh Kumar K.V wrote:
> Laurent Dufour  writes:
> 
>> In do_page_fault() if handle_mm_fault() returns VM_FAULT_RETRY, retry
>> the page fault handling before anything else.
>>
>> This would simplify the handling of the mmap_sem lock in this part of
>> the code.
>>
>> Signed-off-by: Laurent Dufour 
>> ---
>>  arch/powerpc/mm/fault.c | 67 
>> -
>>  1 file changed, 38 insertions(+), 29 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index ee09604bbe12..2a6bc7e6e69a 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -434,6 +434,26 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
>> address,
>>   * the fault.
>>   */
>>  fault = handle_mm_fault(vma, address, flags);
>> +
>> +/*
>> + * Handle the retry right now, the mmap_sem has been released in that
>> + * case.
>> + */
>> +if (unlikely(fault & VM_FAULT_RETRY)) {
>> +/* We retry only once */
>> +if (flags & FAULT_FLAG_ALLOW_RETRY) {
>> +/*
>> + * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
>> + * of starvation.
>> + */
>> +flags &= ~FAULT_FLAG_ALLOW_RETRY;
>> +flags |= FAULT_FLAG_TRIED;
>> +if (!fatal_signal_pending(current))
>> +goto retry;
>> +}
>> +/* We will enter mm_fault_error() below */
>> +}
>> +
>>  if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
>>  if (fault & VM_FAULT_SIGSEGV)
>>  goto bad_area;
>> @@ -445,38 +465,27 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
>> address,
>>  }
> 
> We could make it further simpler, by handling the FAULT_RETRY without
> FLAG_ALLOW_RETRY set earlier. But i guess that can be done later ?


Thanks for the review,

I agree that double checking against VM_FAULT_RETRY is confusing here.

But handling all the retry path in the first if() statement means that
we'll have to handle part of the mm_fault_error() code and segv here...
Unless we can't identify what is really relevant in that retry path.

It would take time to review all that tricky part, but I agree it should
be simplified later.

> 
> Reviewed-by: Aneesh Kumar K.V 
> 
> 
>>
>>  /*
>> - * Major/minor page fault accounting is only done on the
>> - * initial attempt. If we go through a retry, it is extremely
>> - * likely that the page will be found in page cache at that point.
>> + * Major/minor page fault accounting.
>>   */
>> -if (flags & FAULT_FLAG_ALLOW_RETRY) {
>> -if (fault & VM_FAULT_MAJOR) {
>> -current->maj_flt++;
>> -perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>> -  regs, address);
>> +if (fault & VM_FAULT_MAJOR) {
>> +current->maj_flt++;
>> +perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>> +  regs, address);
>>  #ifdef CONFIG_PPC_SMLPAR
>> -if (firmware_has_feature(FW_FEATURE_CMO)) {
>> -u32 page_ins;
>> -
>> -preempt_disable();
>> -page_ins = be32_to_cpu(get_lppaca()->page_ins);
>> -page_ins += 1 << PAGE_FACTOR;
>> -get_lppaca()->page_ins = cpu_to_be32(page_ins);
>> -preempt_enable();
>> -}
>> -#endif /* CONFIG_PPC_SMLPAR */
>> -} else {
>> -current->min_flt++;
>> -perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>> -  regs, address);
>> -}
>> -if (fault & VM_FAULT_RETRY) {
>> -/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
>> - * of starvation. */
>> -flags &= ~FAULT_FLAG_ALLOW_RETRY;
>> -flags |= FAULT_FLAG_TRIED;
>> -goto retry;
>> +if (firmware_has_feature(FW_FEATURE_CMO)) {
>> +u32 page_ins;
>> +
>> +preempt_disable();
>> +page_ins = be32_to_cpu(get_lppaca()->page_ins);
>> +page_ins += 1 << PAGE_FACTOR;
>> +get_lppaca()->page_ins = cpu_to_be32(page_ins);
>> +preempt_enable();
>>  }
>> +#endif /* CONFIG_PPC_SMLPAR */
>> +} else {
>> +current->min_flt++;
>> +perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>> +  regs, address);
>>  }
>>
>>  up_read(&mm->mmap_sem);
>> -- 
>> 2.7.4

Re: [PATCH V2 0/6] cxl: Add support for Coherent Accelerator Interface Architecture 2.0

2017-03-21 Thread christophe lombard


Le 21/03/2017 à 03:47, Andrew Donnellan a écrit :

On 14/03/17 22:08, Christophe Lombard wrote:

The first 3 patches are mostly cleanup and fixes, separating the
psl8-specific code from the code which will also be used for psl9.
Patches 4 restructure existing code, to easily add the psl
implementation.
Patch 5 and 6 rename and isolate implementation-specific code.
Patch 7 introduces the core of the PSL9-specific code.


Is patch 1 (cxl: Read vsec perst load image) from the last version of 
the series missing in this one?




Hum. Good point !. It will be present again in V3.
sorry about that.

Re: [PATCH V2 4/6] cxl: Rename some psl8 specific functions

2017-03-21 Thread christophe lombard


Le 20/03/2017 à 17:26, Frederic Barrat a écrit :



Le 14/03/2017 à 12:08, Christophe Lombard a écrit :

Rename a few functions, changing the '_psl' suffix to '_psl8', to make
clear that the implementation is psl8 specific.
Those functions will have an equivalent implementation for the psl9 in
a later patch.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxl.h | 22 ++---
 drivers/misc/cxl/debugfs.c |  6 +++---
 drivers/misc/cxl/guest.c   |  2 +-
 drivers/misc/cxl/irq.c |  2 +-
 drivers/misc/cxl/native.c  | 14 +++---
 drivers/misc/cxl/pci.c | 48 
+++---

 6 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 3e03a66..1f04238 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -815,10 +815,10 @@ void afu_irq_name_free(struct cxl_context *ctx);

 #ifdef CONFIG_DEBUG_FS

-int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, 
u64 amr);

-int cxl_activate_dedicated_process_psl(struct cxl_afu *afu);
-int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 
wed, u64 amr);

-void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx);
+int cxl_attach_afu_directed_psl8(struct cxl_context *ctx, u64 wed, 
u64 amr);

+int cxl_activate_dedicated_process_psl8(struct cxl_afu *afu);
+int cxl_attach_dedicated_process_psl8(struct cxl_context *ctx, u64 
wed, u64 amr);

+void cxl_update_dedicated_ivtes_psl8(struct cxl_context *ctx);



See previous comment regarding "#ifdef CONFIG_DEBUG_FS"

The rest of the patch looks good to me.

  Fred



you are right. Thanks






 int cxl_debugfs_init(void);
 void cxl_debugfs_exit(void);
@@ -826,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter);
 void cxl_debugfs_adapter_remove(struct cxl *adapter);
 int cxl_debugfs_afu_add(struct cxl_afu *afu);
 void cxl_debugfs_afu_remove(struct cxl_afu *afu);
-void cxl_stop_trace_psl(struct cxl *cxl);
-void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct 
dentry *dir);

+void cxl_stop_trace_psl8(struct cxl *cxl);
+void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct 
dentry *dir);
 void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct 
dentry *dir);
-void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry 
*dir);
+void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct 
dentry *dir);


 #else /* CONFIG_DEBUG_FS */

@@ -931,9 +931,9 @@ struct cxl_irq_info {
 };

 void cxl_assign_psn_space(struct cxl_context *ctx);
-int cxl_invalidate_all_psl(struct cxl *adapter);
-irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct 
cxl_irq_info *irq_info);
-irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, struct 
cxl_irq_info *irq_info);

+int cxl_invalidate_all_psl8(struct cxl *adapter);
+irqreturn_t cxl_irq_psl8(int irq, struct cxl_context *ctx, struct 
cxl_irq_info *irq_info);
+irqreturn_t cxl_fail_irq_psl8(struct cxl_afu *afu, struct 
cxl_irq_info *irq_info);

 int cxl_register_one_irq(struct cxl *adapter, irq_handler_t handler,
 void *cookie, irq_hw_number_t *dest_hwirq,
 unsigned int *dest_virq, const char *name);
@@ -944,7 +944,7 @@ int cxl_data_cache_flush(struct cxl *adapter);
 int cxl_afu_disable(struct cxl_afu *afu);
 int cxl_psl_purge(struct cxl_afu *afu);

-void cxl_native_irq_dump_regs_psl(struct cxl_context *ctx);
+void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx);
 void cxl_native_err_irq_dump_regs(struct cxl *adapter);
 int cxl_pci_vphb_add(struct cxl_afu *afu);
 void cxl_pci_vphb_remove(struct cxl_afu *afu);
diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs.c
index 4848ebf..2ff10a9 100644
--- a/drivers/misc/cxl/debugfs.c
+++ b/drivers/misc/cxl/debugfs.c
@@ -15,7 +15,7 @@

 static struct dentry *cxl_debugfs;

-void cxl_stop_trace_psl(struct cxl *adapter)
+void cxl_stop_trace_psl8(struct cxl *adapter)
 {
 int slice;

@@ -53,7 +53,7 @@ static struct dentry *debugfs_create_io_x64(const 
char *name, umode_t mode,

   (void __force *)value, &fops_io_x64);
 }

-void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct 
dentry *dir)
+void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct 
dentry *dir)

 {
 debugfs_create_io_x64("fir1", S_IRUSR, dir, 
_cxl_p1_addr(adapter, CXL_PSL_FIR1));
 debugfs_create_io_x64("fir2", S_IRUSR, dir, 
_cxl_p1_addr(adapter, CXL_PSL_FIR2));

@@ -92,7 +92,7 @@ void cxl_debugfs_adapter_remove(struct cxl *adapter)
 debugfs_remove_recursive(adapter->debugfs);
 }

-void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry 
*dir)
+void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct 
dentry *dir)

 {
 debugfs_create_io_x64("fir", S_IRUSR, dir, _cxl_p1n_addr(afu, 
CXL_PSL_FIR_SLICE_An));
 debugfs_create_io_x64("serr", S_IRUSR, dir, _cxl_p1n_addr(afu, 
CXL_PSL_SERR_An));

diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/g

Re: [PATCH 3/3] powerpc/mm: move mmap_sem unlocking in do_page_fault()

2017-03-21 Thread Aneesh Kumar K.V

Laurent Dufour  writes:

> Since the fault retry is now handled earlier, we can release the
> mmap_sem lock earlier too and remove later unlocking previously done in
> mm_fault_error().
>

Reviewed-by: Aneesh Kumar K.V  


> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/mm/fault.c | 19 ---
>  1 file changed, 4 insertions(+), 15 deletions(-)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 2a6bc7e6e69a..21e06cce8984 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -151,13 +151,6 @@ static int mm_fault_error(struct pt_regs *regs, unsigned 
> long addr, int fault)
>* continue the pagefault.
>*/
>   if (fatal_signal_pending(current)) {
> - /*
> -  * If we have retry set, the mmap semaphore will have
> -  * alrady been released in __lock_page_or_retry(). Else
> -  * we release it now.
> -  */
> - if (!(fault & VM_FAULT_RETRY))
> - up_read(¤t->mm->mmap_sem);
>   /* Coming from kernel, we need to deal with uaccess fixups */
>   if (user_mode(regs))
>   return MM_FAULT_RETURN;
> @@ -170,8 +163,6 @@ static int mm_fault_error(struct pt_regs *regs, unsigned 
> long addr, int fault)
>
>   /* Out of memory */
>   if (fault & VM_FAULT_OOM) {
> - up_read(¤t->mm->mmap_sem);
> -
>   /*
>* We ran out of memory, or some other thing happened to us that
>* made us unable to handle the page fault gracefully.
> @@ -182,10 +173,8 @@ static int mm_fault_error(struct pt_regs *regs, unsigned 
> long addr, int fault)
>   return MM_FAULT_RETURN;
>   }
>
> - if (fault & 
> (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
> - up_read(¤t->mm->mmap_sem);
> + if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))
>   return do_sigbus(regs, addr, fault);
> - }
>
>   /* We don't understand the fault code, this is fatal */
>   BUG();
> @@ -452,11 +441,12 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
> address,
>   goto retry;
>   }
>   /* We will enter mm_fault_error() below */
> - }
> + } else
> + up_read(¤t->mm->mmap_sem);
>
>   if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
>   if (fault & VM_FAULT_SIGSEGV)
> - goto bad_area;
> + goto bad_area_nosemaphore;
>   rc = mm_fault_error(regs, address, fault);
>   if (rc >= MM_FAULT_RETURN)
>   goto bail;
> @@ -488,7 +478,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
> address,
> regs, address);
>   }
>
> - up_read(&mm->mmap_sem);
>   goto bail;
>
>  bad_area:
> -- 
> 2.7.4

Re: [PATCH 2/3] powerpc/mm: handle VM_FAULT_RETRY earlier

2017-03-21 Thread Aneesh Kumar K.V

Laurent Dufour  writes:

> In do_page_fault() if handle_mm_fault() returns VM_FAULT_RETRY, retry
> the page fault handling before anything else.
>
> This would simplify the handling of the mmap_sem lock in this part of
> the code.
>
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/mm/fault.c | 67 
> -
>  1 file changed, 38 insertions(+), 29 deletions(-)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index ee09604bbe12..2a6bc7e6e69a 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -434,6 +434,26 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
> address,
>* the fault.
>*/
>   fault = handle_mm_fault(vma, address, flags);
> +
> + /*
> +  * Handle the retry right now, the mmap_sem has been released in that
> +  * case.
> +  */
> + if (unlikely(fault & VM_FAULT_RETRY)) {
> + /* We retry only once */
> + if (flags & FAULT_FLAG_ALLOW_RETRY) {
> + /*
> +  * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
> +  * of starvation.
> +  */
> + flags &= ~FAULT_FLAG_ALLOW_RETRY;
> + flags |= FAULT_FLAG_TRIED;
> + if (!fatal_signal_pending(current))
> + goto retry;
> + }
> + /* We will enter mm_fault_error() below */
> + }
> +
>   if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
>   if (fault & VM_FAULT_SIGSEGV)
>   goto bad_area;
> @@ -445,38 +465,27 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
> address,
>   }

We could make it further simpler, by handling the FAULT_RETRY without
FLAG_ALLOW_RETRY set earlier. But i guess that can be done later ?

Reviewed-by: Aneesh Kumar K.V 


>
>   /*
> -  * Major/minor page fault accounting is only done on the
> -  * initial attempt. If we go through a retry, it is extremely
> -  * likely that the page will be found in page cache at that point.
> +  * Major/minor page fault accounting.
>*/
> - if (flags & FAULT_FLAG_ALLOW_RETRY) {
> - if (fault & VM_FAULT_MAJOR) {
> - current->maj_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> -   regs, address);
> + if (fault & VM_FAULT_MAJOR) {
> + current->maj_flt++;
> + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> +   regs, address);
>  #ifdef CONFIG_PPC_SMLPAR
> - if (firmware_has_feature(FW_FEATURE_CMO)) {
> - u32 page_ins;
> -
> - preempt_disable();
> - page_ins = be32_to_cpu(get_lppaca()->page_ins);
> - page_ins += 1 << PAGE_FACTOR;
> - get_lppaca()->page_ins = cpu_to_be32(page_ins);
> - preempt_enable();
> - }
> -#endif /* CONFIG_PPC_SMLPAR */
> - } else {
> - current->min_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> -   regs, address);
> - }
> - if (fault & VM_FAULT_RETRY) {
> - /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
> -  * of starvation. */
> - flags &= ~FAULT_FLAG_ALLOW_RETRY;
> - flags |= FAULT_FLAG_TRIED;
> - goto retry;
> + if (firmware_has_feature(FW_FEATURE_CMO)) {
> + u32 page_ins;
> +
> + preempt_disable();
> + page_ins = be32_to_cpu(get_lppaca()->page_ins);
> + page_ins += 1 << PAGE_FACTOR;
> + get_lppaca()->page_ins = cpu_to_be32(page_ins);
> + preempt_enable();
>   }
> +#endif /* CONFIG_PPC_SMLPAR */
> + } else {
> + current->min_flt++;
> + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> +   regs, address);
>   }
>
>   up_read(&mm->mmap_sem);
> -- 
> 2.7.4

Re: [PATCH 1/3] powerpc/mm: move mmap_sem unlock up from do_sigbus

2017-03-21 Thread Aneesh Kumar K.V

Laurent Dufour  writes:

> Move mmap_sem releasing in the do_sigbus()'s unique caller : mm_fault_error()
>
> No functional changes.
>

Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/mm/fault.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 62a50d6d1053..ee09604bbe12 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -119,8 +119,6 @@ static int do_sigbus(struct pt_regs *regs, unsigned long 
> address,
>   siginfo_t info;
>   unsigned int lsb = 0;
>
> - up_read(¤t->mm->mmap_sem);
> -
>   if (!user_mode(regs))
>   return MM_FAULT_ERR(SIGBUS);
>
> @@ -184,8 +182,10 @@ static int mm_fault_error(struct pt_regs *regs, unsigned 
> long addr, int fault)
>   return MM_FAULT_RETURN;
>   }
>
> - if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))
> + if (fault & 
> (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
> + up_read(¤t->mm->mmap_sem);
>   return do_sigbus(regs, addr, fault);
> + }
>
>   /* We don't understand the fault code, this is fatal */
>   BUG();
> -- 
> 2.7.4

Re: [kernel-hardening] Re: [PATCH] gcc-plugins: update architecture list in documentation

2017-03-21 Thread Michael Ellerman

Kees Cook  writes:

> On Mon, Mar 20, 2017 at 1:39 AM, Michael Ellerman  wrote:
>> Andrew Donnellan  writes:
>>
>>> Commit 65c059bcaa73 ("powerpc: Enable support for GCC plugins") enabled GCC
>>> plugins on powerpc, but neglected to update the architecture list in the
>>> docs. Rectify this.
>>>
>>> Fixes: 65c059bcaa73 ("powerpc: Enable support for GCC plugins")
>>> Signed-off-by: Andrew Donnellan 
>>> ---
>>>  Documentation/gcc-plugins.txt | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> It would be nice to merge this for v4.11, so the docs are up to do date
>> with the code for the release.
>>
>> I'm not sure who owns it though, should it go via Jon as docs
>> maintainer (added to CC), or via Kees tree or mine?
>
> If you have other changes queued for v4.11, please take it via your
> tree. Otherwise, perhaps the docs tree or mine? (I don't currently
> have any fixes queued; I'm just trying to minimize pull requests going
> to Linus...)

I have some fixes queued already, so unless Jon yells I'll take it in
the next day or so.

cheers

83 matches

Mail list logo