Re: [PATCH kernel 3/4] powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions

2018-10-16 Thread David Gibson
On Wed, Oct 17, 2018 at 02:34:32PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 17/10/2018 12:00, David Gibson wrote:
> > On Mon, Oct 15, 2018 at 08:24:15PM +1100, Alexey Kardashevskiy wrote:
> >> Since we are going to have 2 different preregistering helpers, let's
> >> make it clear that mm_iommu_new() is only for the normal (i.e. not device)
> >> memory and for existing areas mm_iommu_get() should be used instead.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> > 
> > I think the idea is sensible.  However (and, yes, this is really an
> > existing bug) - shouldn't we check for a request to add anything
> > overlapping with an existing region, not just one that exactly
> > matches?
> 
> The overlap check is below the changed hunk:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/mm/mmu_context_iommu.c#n150

Ah, right.

In that case can't you just drop this whole if.  I don't see that
there's any use in giving different error codes for "tried to register
exactly a region you registered before" and "tried to register a
region overlapping one you registered before.


> 
> 
> > 
> >> ---
> >>  arch/powerpc/mm/mmu_context_iommu.c | 3 +--
> >>  1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> >> b/arch/powerpc/mm/mmu_context_iommu.c
> >> index a8c4a3c..839dbce 100644
> >> --- a/arch/powerpc/mm/mmu_context_iommu.c
> >> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> >> @@ -141,8 +141,7 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long 
> >> ua, unsigned long entries,
> >>list_for_each_entry_rcu(mem, >context.iommu_group_mem_list,
> >>next) {
> >>if ((mem->ua == ua) && (mem->entries == entries)) {
> >> -  ++mem->used;
> >> -  *pmem = mem;
> >> +  ret = -EBUSY;
> >>goto unlock_exit;
> >>}
> >>  
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 3/4] powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions

2018-10-16 Thread Alexey Kardashevskiy



On 17/10/2018 12:00, David Gibson wrote:
> On Mon, Oct 15, 2018 at 08:24:15PM +1100, Alexey Kardashevskiy wrote:
>> Since we are going to have 2 different preregistering helpers, let's
>> make it clear that mm_iommu_new() is only for the normal (i.e. not device)
>> memory and for existing areas mm_iommu_get() should be used instead.
>>
>> Signed-off-by: Alexey Kardashevskiy 
> 
> I think the idea is sensible.  However (and, yes, this is really an
> existing bug) - shouldn't we check for a request to add anything
> overlapping with an existing region, not just one that exactly
> matches?

The overlap check is below the changed hunk:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/mm/mmu_context_iommu.c#n150


> 
>> ---
>>  arch/powerpc/mm/mmu_context_iommu.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
>> b/arch/powerpc/mm/mmu_context_iommu.c
>> index a8c4a3c..839dbce 100644
>> --- a/arch/powerpc/mm/mmu_context_iommu.c
>> +++ b/arch/powerpc/mm/mmu_context_iommu.c
>> @@ -141,8 +141,7 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long 
>> ua, unsigned long entries,
>>  list_for_each_entry_rcu(mem, >context.iommu_group_mem_list,
>>  next) {
>>  if ((mem->ua == ua) && (mem->entries == entries)) {
>> -++mem->used;
>> -*pmem = mem;
>> +ret = -EBUSY;
>>  goto unlock_exit;
>>  }
>>  
> 

-- 
Alexey


Re: [PATCH v4 00/18] of: overlay: validation checks, subsequent fixes

2018-10-16 Thread Frank Rowand
On 10/16/18 02:47, Michael Ellerman wrote:
> frowand.l...@gmail.com writes:
> 
>> From: Frank Rowand 
>>
>> Add checks to (1) overlay apply process and (2) memory freeing
>> triggered by overlay release.  The checks are intended to detect
>> possible memory leaks and invalid overlays.
>>
>> The checks revealed bugs in existing code.  Fixed the bugs.
>>
>> While fixing bugs, noted other issues, which are fixed in
>> separate patches.
>>
>> *  Powerpc folks: I was not able to test the patches that
>> *  directly impact Powerpc systems that use dynamic
>> *  devicetree.  Please review that code carefully and
>> *  test.  The specific patches are: 03/16, 04/16, 07/16
> 
> Hi Frank,
> 
> Do you have this series in a git tree somewhere?
> 
> I tried applying it on top of linux-next but hit some conflicts which I
> couldn't easily resolve.
> 
> cheers
> 



git://git.kernel.org/pub/scm/linux/kernel/git/frowand/linux.git

$ git checkout v4.19-rc1--kfree_validate--v4

$ git log --oneline v4.19-rc1..
2ba1b7d353dd of: unittest: initialize args before calling of_*parse_*()
4f9108209f79 of: unittest: find overlays[] entry by name instead of index
353403c76ff8 of: unittest: allow base devicetree to have symbol metadata
8fc37e04a01b of: overlay: set node fields from properties when add new overlay n
05d5df0e5151 of: unittest: remove unused of_unittest_apply_overlay() argument
8c021cba757a of: overlay: check prevents multiple fragments touching same proper
797a6f66e039 of: overlay: check prevents multiple fragments add or delete same n
c385e25a040d of: overlay: test case of two fragments adding same node
c88fd240f0e0 of: overlay: make all pr_debug() and pr_err() messages unique
1028a215d32a of: overlay: validate overlay properties #address-cells and #size-c
f1a97ef74ce4 of: overlay: reorder fields in struct fragment
ffe78cf7a1fb of: dynamic: change type of of_{at,de}tach_node() to void
5f5ff8ec0c0c of: overlay: do not duplicate properties from overlay for new nodes
06e72dcb2bb0 of: overlay: use prop add changeset entry for property in new nodes
a02f8d326a08 powerpc/pseries: add of_node_put() in dlpar_detach_node()
e203be664330 of: overlay: add missing of_node_get() in __of_attach_node_sysfs
8eb46208e7c8 of: overlay: add missing of_node_put() after add new node to change
b22067db7cf9 of: overlay: add tests to validate kfrees from overlay removal


Re: [PATCH kernel 3/4] powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions

2018-10-16 Thread David Gibson
On Mon, Oct 15, 2018 at 08:24:15PM +1100, Alexey Kardashevskiy wrote:
> Since we are going to have 2 different preregistering helpers, let's
> make it clear that mm_iommu_new() is only for the normal (i.e. not device)
> memory and for existing areas mm_iommu_get() should be used instead.
> 
> Signed-off-by: Alexey Kardashevskiy 

I think the idea is sensible.  However (and, yes, this is really an
existing bug) - shouldn't we check for a request to add anything
overlapping with an existing region, not just one that exactly
matches?

> ---
>  arch/powerpc/mm/mmu_context_iommu.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index a8c4a3c..839dbce 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -141,8 +141,7 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, 
> unsigned long entries,
>   list_for_each_entry_rcu(mem, >context.iommu_group_mem_list,
>   next) {
>   if ((mem->ua == ua) && (mem->entries == entries)) {
> - ++mem->used;
> - *pmem = mem;
> + ret = -EBUSY;
>   goto unlock_exit;
>   }
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 1/4] powerpc/mm/iommu: Rename mm_iommu_get

2018-10-16 Thread David Gibson
On Mon, Oct 15, 2018 at 08:24:13PM +1100, Alexey Kardashevskiy wrote:
> Normally mm_iommu_get() is supposed to add a reference and
> mm_iommu_put() to remove it. However historically mm_iommu_find() does
> the referencing and mm_iommu_get() is doing allocation and referencing.
> 
> This is step 1 towards simpler mm_iommu_get().
> 
> This renames:
> - mm_iommu_get to mm_iommu_new;
> - mm_iommu_find to mm_iommu_get.
> 
> This should cause no behavioural change.
> 
> Signed-off-by: Alexey Kardashevskiy 

Erm.. as far as I can tell, mm_iommu_find() doesn't adjust any
reference counts, so renaming it to mm_iommu_get() doesn't really make
sense.

> ---
>  arch/powerpc/include/asm/mmu_context.h | 4 ++--
>  arch/powerpc/mm/mmu_context_iommu.c| 8 
>  drivers/vfio/vfio_iommu_spapr_tce.c| 6 +++---
>  3 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index b694d6a..59d4941 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
>  
>  extern int isolate_lru_page(struct page *page);  /* from internal.h */
>  extern bool mm_iommu_preregistered(struct mm_struct *mm);
> -extern long mm_iommu_get(struct mm_struct *mm,
> +extern long mm_iommu_new(struct mm_struct *mm,
>   unsigned long ua, unsigned long entries,
>   struct mm_iommu_table_group_mem_t **pmem);
>  extern long mm_iommu_put(struct mm_struct *mm,
> @@ -32,7 +32,7 @@ extern struct mm_iommu_table_group_mem_t 
> *mm_iommu_lookup(struct mm_struct *mm,
>   unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
>   struct mm_struct *mm, unsigned long ua, unsigned long size);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
>   unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>   unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index 56c2234..8eeb99d 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -126,7 +126,7 @@ static int mm_iommu_move_page_from_cma(struct page *page)
>   return 0;
>  }
>  
> -long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
> entries,
> +long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long 
> entries,
>   struct mm_iommu_table_group_mem_t **pmem)
>  {
>   struct mm_iommu_table_group_mem_t *mem;
> @@ -252,7 +252,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
> unsigned long entries,
>  
>   return ret;
>  }
> -EXPORT_SYMBOL_GPL(mm_iommu_get);
> +EXPORT_SYMBOL_GPL(mm_iommu_new);
>  
>  static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  {
> @@ -368,7 +368,7 @@ struct mm_iommu_table_group_mem_t 
> *mm_iommu_lookup_rm(struct mm_struct *mm,
>   return ret;
>  }
>  
> -struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
>   unsigned long ua, unsigned long entries)
>  {
>   struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
> @@ -382,7 +382,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct 
> mm_struct *mm,
>  
>   return ret;
>  }
> -EXPORT_SYMBOL_GPL(mm_iommu_find);
> +EXPORT_SYMBOL_GPL(mm_iommu_get);
>  
>  long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>   unsigned long ua, unsigned int pageshift, unsigned long *hpa)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index ad63725..1701798 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -156,7 +156,7 @@ static long tce_iommu_unregister_pages(struct 
> tce_container *container,
>   if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>   return -EINVAL;
>  
> - mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
> + mem = mm_iommu_get(container->mm, vaddr, size >> PAGE_SHIFT);
>   if (!mem)
>   return -ENOENT;
>  
> @@ -185,7 +185,7 @@ static long tce_iommu_register_pages(struct tce_container 
> *container,
>   ((vaddr + size) < vaddr))
>   return -EINVAL;
>  
> - mem = mm_iommu_find(container->mm, vaddr, entries);
> + mem = mm_iommu_get(container->mm, vaddr, entries);
>   if (mem) {
>   list_for_each_entry(tcemem, >prereg_list, next) {
>   if (tcemem->mem == mem)
> @@ -193,7 +193,7 @@ static long tce_iommu_register_pages(struct tce_container 
> *container,
>   

Re: [PATCH kernel 2/4] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region

2018-10-16 Thread David Gibson
On Mon, Oct 15, 2018 at 08:24:14PM +1100, Alexey Kardashevskiy wrote:
> We are going to add another helper to preregister device memory so
> instead of having mm_iommu_new() which pre-registers the normal memory
> and references the region, we need separate helpers for pre-registerign
> and referencing.
> 
> To make the mm_iommu_get name reflect what it is supposed to do, this
> changes mm_iommu_get() to reference the region so from now on for every
> mm_iommu_get() we need a matching mm_iommu_put().
> 
> Signed-off-by: Alexey Kardashevskiy 

.. ah, I see.

I think this should be folded with the first patch, so we don't have
an interim step where mm_iommu_get() has a misleading name.

> ---
>  arch/powerpc/mm/mmu_context_iommu.c |  5 +
>  drivers/vfio/vfio_iommu_spapr_tce.c | 33 ++---
>  2 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index 8eeb99d..a8c4a3c 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -373,13 +373,18 @@ struct mm_iommu_table_group_mem_t *mm_iommu_get(struct 
> mm_struct *mm,
>  {
>   struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>  
> + mutex_lock(_list_mutex);
> +
>   list_for_each_entry_rcu(mem, >context.iommu_group_mem_list, next) {
>   if ((mem->ua == ua) && (mem->entries == entries)) {
>   ret = mem;
> + ++mem->used;
>   break;
>   }
>   }
>  
> + mutex_unlock(_list_mutex);
> +
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_get);
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 1701798..56db071 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -151,7 +151,8 @@ static long tce_iommu_unregister_pages(struct 
> tce_container *container,
>  {
>   struct mm_iommu_table_group_mem_t *mem;
>   struct tce_iommu_prereg *tcemem;
> - bool found = false;
> + bool found;
> + long ret;
>  
>   if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>   return -EINVAL;
> @@ -168,9 +169,13 @@ static long tce_iommu_unregister_pages(struct 
> tce_container *container,
>   }
>  
>   if (!found)
> - return -ENOENT;
> + ret = -ENOENT;
> + else
> + ret = tce_iommu_prereg_free(container, tcemem);
>  
> - return tce_iommu_prereg_free(container, tcemem);
> + mm_iommu_put(container->mm, mem);
> +
> + return ret;
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -188,19 +193,21 @@ static long tce_iommu_register_pages(struct 
> tce_container *container,
>   mem = mm_iommu_get(container->mm, vaddr, entries);
>   if (mem) {
>   list_for_each_entry(tcemem, >prereg_list, next) {
> - if (tcemem->mem == mem)
> - return -EBUSY;
> + if (tcemem->mem == mem) {
> + ret = -EBUSY;
> + goto put_exit;
> + }
>   }
> + } else {
> + ret = mm_iommu_new(container->mm, vaddr, entries, );
> + if (ret)
> + return ret;
>   }
>  
> - ret = mm_iommu_new(container->mm, vaddr, entries, );
> - if (ret)
> - return ret;
> -
>   tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
>   if (!tcemem) {
> - mm_iommu_put(container->mm, mem);
> - return -ENOMEM;
> + ret = -ENOMEM;
> + goto put_exit;
>   }
>  
>   tcemem->mem = mem;
> @@ -209,6 +216,10 @@ static long tce_iommu_register_pages(struct 
> tce_container *container,
>   container->enabled = true;
>  
>   return 0;
> +
> +put_exit:
> + mm_iommu_put(container->mm, mem);
> + return ret;
>  }
>  
>  static bool tce_page_is_contained(struct page *page, unsigned page_shift)

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-10-16 Thread Alexey Kardashevskiy



On 17/10/2018 06:08, Alex Williamson wrote:
> On Mon, 15 Oct 2018 20:42:33 +1100
> Alexey Kardashevskiy  wrote:
> 
>> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
>> pluggable PCIe devices but implement PCIe links for config space and MMIO.
>> In addition to that the GPUs are interconnected to each other and also
>> have direct links to the P9 CPU. The links are NVLink2 and provide direct
>> access to the system RAM for GPUs via NPU (an NVLink2 "proxy" on P9 chip).
>> These systems also support ATS (address translation services) which is
>> a part of the NVLink2 prototol. Such GPUs also share on-board RAM
>> (16GB in tested config) to the system via the same NVLink2 so a CPU has
>> cache-coherent access to a GPU RAM.
>>
>> This exports GPU RAM to the userspace as a new PCI region. This
>> preregisters the new memory as device memory as it might be used for DMA.
>> This inserts pfns from the fault handler as the GPU memory is not onlined
>> until the NVIDIA driver is loaded and trained the links so doing this
>> earlier produces low level errors which we fence in the firmware so
>> it does not hurt the host system but still better to avoid.
>>
>> This exports ATSD (Address Translation Shootdown) register of NPU which
>> allows the guest to invalidate TLB. The register conviniently occupies
>> a single 64k page. Since NPU maps the GPU memory, it has a "tgt" property
>> (which is an abbreviated host system bus address). This exports the "tgt"
>> as a capability so the guest can program it into the GPU so the GPU can
>> know how to route DMA trafic.
> 
> I'm not really following what "tgt" is and why it's needed.  Is the GPU
> memory here different than the GPU RAM region above?  Why does the user
> need the host system bus address of this "tgt" thing?  Are we not able
> to relocate it in guest physical address space, does this shootdown
> only work in the host physical address space and therefore we need this
> offset?  Please explain, I'm confused.


This "tgt" is made of:
- "memory select" (bits 45, 46)
- "group select" (bits 43, 44)
- "chip select" (bit 42)
- chip internal address (bits 0..41)

These are internal to GPU and this is where GPU RAM is mapped into the
GPU's real space, this fits 46 bits.

On POWER9 CPU the bits are different and higher so the same memory is
mapped higher on P9 CPU. Just because we can map it higher, I guess.

So it is not exactly the address but this provides the exact physical
location of the memory.

We have a group of 3 interconnected GPUs, they got their own
memory/group/chip numbers. The GPUs use ATS service to translate
userspace to physical (host or guest) addresses. Now a GPU needs to know
which specific link to use for a specific physical address, in other
words what this physical address belongs to - a CPU or one of GPUs. This
is when "tgt" is used by the GPU hardware.

A GPU could run all the DMA trafic via the system bus indeed, just not
as fast.

I am also struggling here and adding an Nvidia person in cc: (I should
have done that when I posted the patches, my bad) to correct when/if I
am wrong.



>  
>> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
>> know LPID (a logical partition ID or a KVM guest hardware ID in other
>> words) and PID (a memory context ID of an userspace process, not to be
>> confused with a linux pid). This assigns a GPU to LPID in the NPU and
>> this is why this adds a listener for KVM on an IOMMU group. A PID comes
>> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
>>
>> This requires coherent memory and ATSD to be available on the host as
>> the GPU vendor only supports configurations with both features enabled
>> and other configurations are known not to work. Because of this and
>> because of the ways the features are advertised to the host system
>> (which is a device tree with very platform specific properties),
>> this requires enabled POWERNV platform.
>>
>> This hardcodes the NVLink2 support for specific vendor and device IDs
>> as there is no reliable way of knowing about coherent memory and ATS
>> support. The GPU has an unique vendor PCIe capability 0x23 but it was
>> confirmed that it does not provide required information (and it is still
>> undisclosed what it actually does).
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  drivers/vfio/pci/Makefile   |   1 +
>>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>>  include/uapi/linux/vfio.h   |  18 ++
>>  drivers/vfio/pci/vfio_pci.c |  37 +++-
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 409 
>> 
>>  drivers/vfio/pci/Kconfig|   4 +
>>  6 files changed, 469 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
>>
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 76d8ec0..9662c06 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -1,5 +1,6 @@
>> 

Crash on FSL Book3E due to pte_pgprot()? (was Re: [PATCH v3 12/24] powerpc/mm: use pte helpers in generic code)

2018-10-16 Thread Michael Ellerman
Christophe Leroy  writes:

> Get rid of platform specific _PAGE_ in powerpc common code and
> use helpers instead.
>
> mm/dump_linuxpagetables.c will be handled separately
>
> Reviewed-by: Aneesh Kumar K.V 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/book3s/32/pgtable.h |  9 +++--
>  arch/powerpc/include/asm/nohash/32/pgtable.h | 12 
>  arch/powerpc/include/asm/nohash/pgtable.h|  3 +--
>  arch/powerpc/mm/pgtable.c| 21 +++--
>  arch/powerpc/mm/pgtable_32.c | 15 ---
>  arch/powerpc/mm/pgtable_64.c | 14 +++---
>  arch/powerpc/xmon/xmon.c | 12 +++-
>  7 files changed, 41 insertions(+), 45 deletions(-)

So turns out this patch *also* breaks my p5020ds :)

Even with patch 4 merged, see next.

It's the same crash:

  pcieport 2000:00:00.0: AER enabled with IRQ 480
  Unable to handle kernel paging request for data at address 0x88008008
  Faulting instruction address: 0xc00192cc
  Oops: Kernel access of bad area, sig: 11 [#1]
  BE SMP NR_CPUS=24 CoreNet Generic
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc3-gcc7x-g98c847323b3a #1
  NIP:  c00192cc LR: c05d0f9c CTR: 0010
  REGS: c000f31bb400 TRAP: 0300   Not tainted  
(4.19.0-rc3-gcc7x-g98c847323b3a)
  MSR:  80029000   CR: 24000224  XER: 
  DEAR: 88008008 ESR: 0080 IRQMASK: 0 
  GPR00: c05d0f84 c000f31bb688 c117dc00 88008008 
  GPR04:  0040 0ffbff241010 c000f31b8000 
  GPR08:  0010  c12d4710 
  GPR12: 84000422 c12ff000 c0002774  
  GPR16:     
  GPR20:     
  GPR24:   88008008 c00089a8 
  GPR28: c000f3576400 c000f3576410 0040 c12ecc98 
  NIP [c00192cc] ._memset_io+0x6c/0x9c
  LR [c05d0f9c] .fsl_qman_probe+0x198/0x928
  Call Trace:
  [c000f31bb688] [c05d0f84] .fsl_qman_probe+0x180/0x928 (unreliable)
  [c000f31bb728] [c06432ec] .platform_drv_probe+0x60/0xb4
  [c000f31bb7a8] [c064083c] .really_probe+0x294/0x35c
  [c000f31bb848] [c0640d2c] .__driver_attach+0x148/0x14c
  [c000f31bb8d8] [c063d7dc] .bus_for_each_dev+0xb0/0x118
  [c000f31bb988] [c063ff28] .driver_attach+0x34/0x4c
  [c000f31bba08] [c063f648] .bus_add_driver+0x174/0x2bc
  [c000f31bbaa8] [c06418bc] .driver_register+0x90/0x180
  [c000f31bbb28] [c0643270] .__platform_driver_register+0x60/0x7c
  [c000f31bbba8] [c0ee2a70] .fsl_qman_driver_init+0x24/0x38
  [c000f31bbc18] [c00023fc] .do_one_initcall+0x64/0x2b8
  [c000f31bbcf8] [c0e9f480] .kernel_init_freeable+0x3a8/0x494
  [c000f31bbda8] [c0002798] .kernel_init+0x24/0x148
  [c000f31bbe28] [c9e8] .ret_from_kernel_thread+0x58/0x70
  Instruction dump:
  4e800020 2ba50003 40dd003c 3925fffc 5488402e 7929f082 7d082378 39290001 
  550a801e 7d2903a6 7d4a4378 794a0020 <9143> 38630004 4200fff8 70a50003 


Comparing a working vs broken kernel, it seems to boil down to the fact
that we're filtering out more PTE bits now that we use pte_pgprot() in
ioremap_prot().

With the old code we get:
  ioremap_prot: addr 0xff80 flags 0x241215
  ioremap_prot: addr 0xff80 flags 0x241215
  map_kernel_page: ea 0x88008008 pa 0xff80 pte 0xff800241215


And now we get:
  ioremap_prot: addr 0xff80 flags 0x241215 pte 0x241215
  ioremap_prot: addr 0xff80 pte 0x241215
  ioremap_prot: addr 0xff80 prot 0x241014
  map_kernel_page: ea 0x88008008 pa 0xff80 pte 0xff800241014

So we're losing 0x201, which for nohash book3e is:

  #define _PAGE_PRESENT 0x01 /* software: pte contains a translation */
  #define _PAGE_PSIZE_4K0x000200


I haven't worked out if it's one or both of those that matter.

The question is what's the right way to fix it? Should pte_pgprot() not
be filtering those bits out on book3e?

cheers


Re: [PATCH] powerpc/io: remove old GCC version implementation

2018-10-16 Thread Michael Ellerman
Joakim Tjernlund  writes:

> On Tue, 2018-10-16 at 12:33 +, Christophe Leroy wrote:
>> 
>> 
>> GCC 4.6 is the minimum supported now.
>
> Ouch, from kernel 4.19 or earlier even ?

Yes.

Though to be honest we haven't been testing much with compilers older
than that for several years.

If you care about something older please let us know :)

cheers


Re: [PATCH v06 3/5] migration/memory: Add hotplug READD_MULTIPLE

2018-10-16 Thread Michael Ellerman
Michael Bringmann  writes:
> On 10/16/2018 02:57 PM, Tyrel Datwyler wrote:
>> On 10/15/2018 05:39 PM, Michael Ellerman wrote:
>>> Michael Bringmann  writes:
 diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
 b/arch/powerpc/platforms/pseries/hotplug-memory.c
 index 2b796da..9c76345 100644
 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
 +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
 @@ -541,6 +549,23 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
return rc;
  }

 +static int dlpar_memory_readd_multiple(void)
 +{
 +  struct drmem_lmb *lmb;
 +  int rc;
 +
 +  pr_info("Attempting to update multiple LMBs\n");
 +
 +  for_each_drmem_lmb(lmb) {
 +  if (drmem_lmb_update(lmb)) {
 +  rc = dlpar_memory_readd_helper(lmb);
 +  drmem_remove_lmb_update(lmb);
 +  }
 +  }
 +
 +  return rc;
 +}
>>>
>>> This leaves rc potentially uninitialised.
>>>
>>> What should the result be in that case, -EINVAL ?
>> 
>> On another note if there are multiple LMBs to update the value of rc only 
>> reflects the final dlpar_memory_readd_helper() call.
>
> Correct.  But that is what happens when we compress common code
> between two disparate uses i.e. updating memory association after
> a migration event with no reporting mechanism other than the console
> log, vs re-adding a single LMB by index for the purposes of DLPAR / drmgr.
>
> I could discard the return value from dlpar_memory_readd_helper entirely
> in this function and just return 0, but in my experience, once errors start
> to occur in memory dlpar ops, they tend to keep on occurring, so I was
> returning the last one.  We could also make the code smart enough to
> capture and return the first/last non-zero return code.  I didn't believe
> that the frequency of errors for this operation warranted the overhead.

The actual error value is probably not very relevant.

But dropping errors entirely is almost always a bad idea.

So I think you should at least return an error if any error occurred,
that way at least an error will be returned up to the caller(s).

Something like:

int rc;

rc = 0;
for_each_drmem_lmb(lmb) {
if (drmem_lmb_update(lmb)) {
rc |= dlpar_memory_readd_helper(lmb);
drmem_remove_lmb_update(lmb);
}
}

if (rc)
return -EIO;


cheers


Re: [PATCH v06 3/5] migration/memory: Add hotplug READD_MULTIPLE

2018-10-16 Thread Michael Ellerman
Tyrel Datwyler  writes:
> On 10/15/2018 05:39 PM, Michael Ellerman wrote:
>> Michael Bringmann  writes:
>>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> index 2b796da..9c76345 100644
>>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> @@ -541,6 +549,23 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
>>> return rc;
>>>  }
>>>
>>> +static int dlpar_memory_readd_multiple(void)
>>> +{
>>> +   struct drmem_lmb *lmb;
>>> +   int rc;
>>> +
>>> +   pr_info("Attempting to update multiple LMBs\n");
>>> +
>>> +   for_each_drmem_lmb(lmb) {
>>> +   if (drmem_lmb_update(lmb)) {
>>> +   rc = dlpar_memory_readd_helper(lmb);
>>> +   drmem_remove_lmb_update(lmb);
>>> +   }
>>> +   }
>>> +
>>> +   return rc;
>>> +}
>> 
>> This leaves rc potentially uninitialised.
>> 
>> What should the result be in that case, -EINVAL ?
>
> On another note if there are multiple LMBs to update the value of rc
> only reflects the final dlpar_memory_readd_helper() call.

Good point.

cheers


Re: What is an address translation in powerISA jarogn ?

2018-10-16 Thread Benjamin Herrenschmidt
On Tue, 2018-10-16 at 20:58 +0300, Raz wrote:
> Section 5.7.3
> "Storage accesses in real, hypervisor real, and virtual real
> addressing modes are performed in a manner that depends on the
> contents of MSR HV , VPM, VRMASD, HRMOR, RMLS, RMOR (see Chapter 2),
> bit 0 of the
> effective address (EA0),"
> 
> Hello
> 1. If MSR_IR = 0 and MSR_DR = 0, does it mean that addresses are not
> translated by the MMU ?

It depends.

If HV=1 (hypervisor mode), then they are only translated to the extent
that HRMOR is applied if the MSB is 0, and untranslated if the MSB is
1.

If HV=0 (guest mode), then they *are* translated but using a different
mechanism than what's normally used when IR/DR=1. This mechanism
depends on whether you are using the Radix or the Hash MMU, and the top
2 bits are ignored.

With hash MMU, it's using things like VRMASD etc... (RMOR is deprecated
afaik) to lookup a "Virtual real mode" area in the hash table. It's
essentially a mapping of the guest "physical" space to real physical
space. It's usually initialized (and maintained) by the HV but the
guest can extend it using things like H_ENTER afaik.

With the radix MMU, it's the guest physical space as mapped by the 2nd
level page tables maintained by the hypervisor.

> 2. If EA0 is the 63-rd bit of the effective address e address ? Does
> this mean that the translation model is
>derived from the address ? a non privileged context may access
> privileged memory.

Nope.

Cheers,
Ben.




Re: [PATCH v5 3/6] clk: qoriq: increase array size of cmux_to_group

2018-10-16 Thread Stephen Boyd
Quoting Vabhav Sharma (2018-10-14 10:08:00)
> From: Yogesh Gaur 
> 
> Increase size of cmux_to_group array, to accomdate entry of
> -1 termination.
> 
> Added -1, terminated, entry for 4080_cmux_grpX.
> 
> Signed-off-by: Yogesh Gaur 
> Signed-off-by: Vabhav Sharma 
> ---

Acked-by: Stephen Boyd 



Re: [PATCH 4.18 086/135] KVM: PPC: Book3S HV: Dont use compound_order to determine host mapping size

2018-10-16 Thread Paul Mackerras
On Tue, Oct 16, 2018 at 07:05:16PM +0200, Greg Kroah-Hartman wrote:
> 4.18-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Nicholas Piggin 
> 
> [ Upstream commit 71d29f43b6332badc5598c656616a62575e83342 ]

If you take 71d29f43b633 then you also need 6579804c4317 ("KVM: PPC:
Book3S HV: Avoid crash from THP collapse during radix page fault",
2018-10-04).

Thanks,
Paul.


Re: [PATCH v06 3/5] migration/memory: Add hotplug READD_MULTIPLE

2018-10-16 Thread Michael Bringmann
On 10/16/2018 02:57 PM, Tyrel Datwyler wrote:
> On 10/15/2018 05:39 PM, Michael Ellerman wrote:
>> Michael Bringmann  writes:
>>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> index 2b796da..9c76345 100644
>>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> @@ -541,6 +549,23 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
>>> return rc;
>>>  }
>>>
>>> +static int dlpar_memory_readd_multiple(void)
>>> +{
>>> +   struct drmem_lmb *lmb;
>>> +   int rc;
>>> +
>>> +   pr_info("Attempting to update multiple LMBs\n");
>>> +
>>> +   for_each_drmem_lmb(lmb) {
>>> +   if (drmem_lmb_update(lmb)) {
>>> +   rc = dlpar_memory_readd_helper(lmb);
>>> +   drmem_remove_lmb_update(lmb);
>>> +   }
>>> +   }
>>> +
>>> +   return rc;
>>> +}
>>
>> This leaves rc potentially uninitialised.
>>
>> What should the result be in that case, -EINVAL ?
> 
> On another note if there are multiple LMBs to update the value of rc only 
> reflects the final dlpar_memory_readd_helper() call.

Correct.  But that is what happens when we compress common code
between two disparate uses i.e. updating memory association after
a migration event with no reporting mechanism other than the console
log, vs re-adding a single LMB by index for the purposes of DLPAR / drmgr.

I could discard the return value from dlpar_memory_readd_helper entirely
in this function and just return 0, but in my experience, once errors start
to occur in memory dlpar ops, they tend to keep on occurring, so I was
returning the last one.  We could also make the code smart enough to
capture and return the first/last non-zero return code.  I didn't believe
that the frequency of errors for this operation warranted the overhead.

> 
> -Tyrel

Michael


-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: linux-next: Tree for Oct 15

2018-10-16 Thread Stephen Rothwell
Hi Mike,

On Tue, 16 Oct 2018 16:36:56 +0300 Mike Rapoport  wrote:
>
> After some more grepping and spatching I've found these:
> 
> From 8b014bae53a78ab747dbb76b9aff7df4cefcb604 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport 
> Date: Tue, 16 Oct 2018 16:03:00 +0300
> Subject: [PATCH] memblock: fix missed uses of implicit aligment
> 
> A couple of memblock*alloc uses were missed during conversion from implicit
> alignment setting with 'align = 0' to explictly using SMP_CACHE_BYTES.
> Fix them now.
> 
> Signed-off-by: Mike Rapoport 

Thanks.

I have added that to linux-next today as it should get rid of the stack
trace we are seeing in PowerPC boots.

-- 
Cheers,
Stephen Rothwell


pgpqjyDiNeuWK.pgp
Description: OpenPGP digital signature


[PATCH] powerpc/ftrace: Handle large kernel configs

2018-10-16 Thread Naveen N. Rao
Currently, we expect to be able to reach ftrace_caller() from all
ftrace-enabled functions through a single relative branch. With large
kernel configs, we see functions outside of 32MB of ftrace_caller()
causing ftrace_init() to bail.

In such configurations, gcc/ld emits two types of trampolines for mcount():
1. A long_branch, which has a single branch to mcount() for functions that
   are one hop away from mcount():
c19e8544 <00031b56.long_branch._mcount>:
c19e8544:   4a 69 3f ac b   c007c4f0 
<._mcount>

2. A plt_branch, for functions that are farther away from mcount():
c51f33f8 <0008ba04.plt_branch._mcount>:
c51f33f8:   3d 82 ff a4 addis   r12,r2,-92
c51f33fc:   e9 8c 04 20 ld  r12,1056(r12)
c51f3400:   7d 89 03 a6 mtctr   r12
c51f3404:   4e 80 04 20 bctr

We can reuse those trampolines for ftrace if we can have those
trampolines go to ftrace_caller() instead. However, with ABIv2, we
cannot depend on r2 being valid. As such, we use only the long_branch
trampolines by patching those to instead branch to ftrace_caller or
ftrace_regs_caller.

In addition, we add additional trampolines around .text and .init.text
to catch locations that are covered by the plt branches. This allows
ftrace to work with most large kernel configurations.

For now, we always patch the trampolines to go to ftrace_regs_caller,
which is slightly inefficient. This can be optimized further at a later
point.

Signed-off-by: Naveen N. Rao 
---
Since RFC:
- Change to patch long_branch to go to ftrace_caller, rather than 
  patching mcount()
- Stop using plt_branch since it can't be relied on for ABIv2
- Add trampolines around .text and .init.text to catch remaining 
  locations

- Naveen

 arch/powerpc/kernel/trace/ftrace.c| 261 +-
 arch/powerpc/kernel/trace/ftrace_64.S |  12 ++
 arch/powerpc/kernel/vmlinux.lds.S |  13 +-
 3 files changed, 281 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 4bfbb54dee51..4bf051d3e21e 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -30,6 +30,16 @@
 
 
 #ifdef CONFIG_DYNAMIC_FTRACE
+
+/*
+ * We generally only have a single long_branch tramp and at most 2 or 3 plt
+ * tramps generated. But, we don't use the plt tramps currently. We also allot
+ * 2 tramps after .text and .init.text. So, we only end up with around 3 usable
+ * tramps in total. Set aside 8 just to be sure.
+ */
+#defineNUM_FTRACE_TRAMPS   8
+static unsigned long ftrace_tramps[NUM_FTRACE_TRAMPS];
+
 static unsigned int
 ftrace_call_replace(unsigned long ip, unsigned long addr, int link)
 {
@@ -85,13 +95,16 @@ static int test_24bit_addr(unsigned long ip, unsigned long 
addr)
return create_branch((unsigned int *)ip, addr, 0);
 }
 
-#ifdef CONFIG_MODULES
-
 static int is_bl_op(unsigned int op)
 {
return (op & 0xfc03) == 0x4801;
 }
 
+static int is_b_op(unsigned int op)
+{
+   return (op & 0xfc03) == 0x4800;
+}
+
 static unsigned long find_bl_target(unsigned long ip, unsigned int op)
 {
static int offset;
@@ -104,6 +117,7 @@ static unsigned long find_bl_target(unsigned long ip, 
unsigned int op)
return ip + (long)offset;
 }
 
+#ifdef CONFIG_MODULES
 #ifdef CONFIG_PPC64
 static int
 __ftrace_make_nop(struct module *mod,
@@ -270,6 +284,146 @@ __ftrace_make_nop(struct module *mod,
 #endif /* PPC64 */
 #endif /* CONFIG_MODULES */
 
+static unsigned long find_ftrace_tramp(unsigned long ip)
+{
+   int i;
+
+   /*
+* We have the compiler generated long_branch tramps at the end
+* and we prefer those
+*/
+   for (i = NUM_FTRACE_TRAMPS - 1; i >= 0; i--)
+   if (!ftrace_tramps[i])
+   continue;
+   else if (create_branch((void *)ip, ftrace_tramps[i], 0))
+   return ftrace_tramps[i];
+
+   return 0;
+}
+
+static int add_ftrace_tramp(unsigned long tramp)
+{
+   int i;
+
+   for (i = 0; i < NUM_FTRACE_TRAMPS; i++)
+   if (!ftrace_tramps[i]) {
+   ftrace_tramps[i] = tramp;
+   return 0;
+   }
+
+   return -1;
+}
+
+/*
+ * If this is a compiler generated long_branch trampoline (essentially, a
+ * trampoline that has a branch to _mcount()), we re-write the branch to
+ * instead go to ftrace_[regs_]caller() and note down the location of this
+ * trampoline.
+ */
+static int setup_mcount_compiler_tramp(unsigned long tramp)
+{
+   int i, op;
+   unsigned long ptr;
+   static unsigned long ftrace_plt_tramps[NUM_FTRACE_TRAMPS];
+
+   /* Is this a known long jump tramp? */
+   for (i = 0; i < NUM_FTRACE_TRAMPS; i++)
+   if (!ftrace_tramps[i])
+   break;
+

Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-16 Thread Joel Fernandes
On Tue, Oct 16, 2018 at 01:29:52PM +0200, Vlastimil Babka wrote:
> On 10/16/18 12:33 AM, Joel Fernandes wrote:
> > On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote:
> >> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> >>> Android needs to mremap large regions of memory during memory management
> >>> related operations.
> >>
> >> Just curious: why?
> > 
> > In Android we have a requirement of moving a large (up to a GB now, but may
> > grow bigger in future) memory range from one location to another.
> 
> I think Christoph's "why?" was about the requirement, not why it hurts
> applications. I admit I'm now also curious :)

This issue was discovered when we wanted to be able to move the physical
pages of a memory range to another location quickly so that, after the
application threads are resumed, UFFDIO_REGISTER_MODE_MISSING userfaultfd
faults can be received on the original memory range. The actual operations
performed on the memory range are beyond the scope of this discussion. The
user threads continue to refer to the old address which will now fault. The
reason we want retain the old memory range and receives faults there is to
avoid the need to fix the addresses all over the address space of the threads
after we finish with performing operations on them in the fault handlers, so
we mremap it and receive faults at the old addresses.

Does that answer your question?

thanks,

- Joel



Re: linux-next: Tree for Oct 15

2018-10-16 Thread Mike Rapoport
On Mon, Oct 15, 2018 at 03:13:19PM -0700, Andrew Morton wrote:
> On Tue, 16 Oct 2018 07:24:39 +1100 Stephen Rothwell  
> wrote:
> 
> > On Tue, 16 Oct 2018 07:12:40 +1100 Stephen Rothwell  
> > wrote:
> > >
> > > On Mon, 15 Oct 2018 11:26:37 -0700 Guenter Roeck  
> > > wrote:
> > > >
> > > > ALl ppc qemu tests (including big endian pseries) also generate a 
> > > > warning.
> > > > 
> > > > WARNING: CPU: 0 PID: 0 at mm/memblock.c:1301 
> > > > .memblock_alloc_range_nid+0x20/0x68
> > 
> > That is:
> > 
> > static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> > phys_addr_t align, phys_addr_t 
> > start,
> > phys_addr_t end, int nid,
> > enum memblock_flags flags)
> > {
> >if (WARN_ON_ONCE(!align))
> > align = SMP_CACHE_BYTES;
> > 
> > Looks like patch
> > 
> >   "memblock: stop using implicit alignment to SMP_CACHE_BYTES"
> > 
> > missed some places ...
> 
> To be expected, I guess.  I'm pretty relaxed about this ;) Let's do
> another sweep in a week or so, after which we'll have a couple of
> months to mop up any leftovers.

After some more grepping and spatching I've found these:

>From 8b014bae53a78ab747dbb76b9aff7df4cefcb604 Mon Sep 17 00:00:00 2001
From: Mike Rapoport 
Date: Tue, 16 Oct 2018 16:03:00 +0300
Subject: [PATCH] memblock: fix missed uses of implicit aligment

A couple of memblock*alloc uses were missed during conversion from implicit
alignment setting with 'align = 0' to explictly using SMP_CACHE_BYTES.
Fix them now.

Signed-off-by: Mike Rapoport 
---
 arch/powerpc/kernel/paca.c| 2 +-
 drivers/firmware/efi/memmap.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index f331a00..913bfca 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -198,7 +198,7 @@ void __init allocate_paca_ptrs(void)
paca_nr_cpu_ids = nr_cpu_ids;
 
paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
-   paca_ptrs = __va(memblock_phys_alloc(paca_ptrs_size, 0));
+   paca_ptrs = __va(memblock_phys_alloc(paca_ptrs_size, SMP_CACHE_BYTES));
memset(paca_ptrs, 0x88, paca_ptrs_size);
 }
 
diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index ef618bc..fa2904f 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -15,7 +15,7 @@
 
 static phys_addr_t __init __efi_memmap_alloc_early(unsigned long size)
 {
-   return memblock_phys_alloc(size, 0);
+   return memblock_phys_alloc(size, SMP_CACHE_BYTES);
 }
 
 static phys_addr_t __init __efi_memmap_alloc_late(unsigned long size)
-- 
2.7.4

 

-- 
Sincerely yours,
Mike.



Re: [PATCH v06 3/5] migration/memory: Add hotplug READD_MULTIPLE

2018-10-16 Thread Tyrel Datwyler
On 10/15/2018 05:39 PM, Michael Ellerman wrote:
> Michael Bringmann  writes:
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index 2b796da..9c76345 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -541,6 +549,23 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
>>  return rc;
>>  }
>>
>> +static int dlpar_memory_readd_multiple(void)
>> +{
>> +struct drmem_lmb *lmb;
>> +int rc;
>> +
>> +pr_info("Attempting to update multiple LMBs\n");
>> +
>> +for_each_drmem_lmb(lmb) {
>> +if (drmem_lmb_update(lmb)) {
>> +rc = dlpar_memory_readd_helper(lmb);
>> +drmem_remove_lmb_update(lmb);
>> +}
>> +}
>> +
>> +return rc;
>> +}
> 
> This leaves rc potentially uninitialised.
> 
> What should the result be in that case, -EINVAL ?

On another note if there are multiple LMBs to update the value of rc only 
reflects the final dlpar_memory_readd_helper() call.

-Tyrel

> 
> cheers
> 



Re: [PATCH kernel 3/3] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

2018-10-16 Thread Alex Williamson
On Mon, 15 Oct 2018 20:42:33 +1100
Alexey Kardashevskiy  wrote:

> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> pluggable PCIe devices but implement PCIe links for config space and MMIO.
> In addition to that the GPUs are interconnected to each other and also
> have direct links to the P9 CPU. The links are NVLink2 and provide direct
> access to the system RAM for GPUs via NPU (an NVLink2 "proxy" on P9 chip).
> These systems also support ATS (address translation services) which is
> a part of the NVLink2 prototol. Such GPUs also share on-board RAM
> (16GB in tested config) to the system via the same NVLink2 so a CPU has
> cache-coherent access to a GPU RAM.
> 
> This exports GPU RAM to the userspace as a new PCI region. This
> preregisters the new memory as device memory as it might be used for DMA.
> This inserts pfns from the fault handler as the GPU memory is not onlined
> until the NVIDIA driver is loaded and trained the links so doing this
> earlier produces low level errors which we fence in the firmware so
> it does not hurt the host system but still better to avoid.
> 
> This exports ATSD (Address Translation Shootdown) register of NPU which
> allows the guest to invalidate TLB. The register conviniently occupies
> a single 64k page. Since NPU maps the GPU memory, it has a "tgt" property
> (which is an abbreviated host system bus address). This exports the "tgt"
> as a capability so the guest can program it into the GPU so the GPU can
> know how to route DMA trafic.

I'm not really following what "tgt" is and why it's needed.  Is the GPU
memory here different than the GPU RAM region above?  Why does the user
need the host system bus address of this "tgt" thing?  Are we not able
to relocate it in guest physical address space, does this shootdown
only work in the host physical address space and therefore we need this
offset?  Please explain, I'm confused.
 
> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> know LPID (a logical partition ID or a KVM guest hardware ID in other
> words) and PID (a memory context ID of an userspace process, not to be
> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> 
> This requires coherent memory and ATSD to be available on the host as
> the GPU vendor only supports configurations with both features enabled
> and other configurations are known not to work. Because of this and
> because of the ways the features are advertised to the host system
> (which is a device tree with very platform specific properties),
> this requires enabled POWERNV platform.
> 
> This hardcodes the NVLink2 support for specific vendor and device IDs
> as there is no reliable way of knowing about coherent memory and ATS
> support. The GPU has an unique vendor PCIe capability 0x23 but it was
> confirmed that it does not provide required information (and it is still
> undisclosed what it actually does).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>  include/uapi/linux/vfio.h   |  18 ++
>  drivers/vfio/pci/vfio_pci.c |  37 +++-
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 409 
> 
>  drivers/vfio/pci/Kconfig|   4 +
>  6 files changed, 469 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 93c1738..7639241 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -163,4 +163,6 @@ static inline int vfio_pci_igd_init(struct 
> vfio_pci_device *vdev)
>   return -ENODEV;
>  }
>  #endif
> +extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
> +extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index f378b98..9e9a8d3 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -303,6 +303,12 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG   (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
> +/* NVIDIA GPU NVlink2 RAM */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
> +
> +/* IBM NPU NVlink2 ATSD */

What is an address translation in powerISA jarogn ?

2018-10-16 Thread Raz
Section 5.7.3
"Storage accesses in real, hypervisor real, and virtual real
addressing modes are performed in a manner that depends on the
contents of MSR HV , VPM, VRMASD, HRMOR, RMLS, RMOR (see Chapter 2),
bit 0 of the
effective address (EA0),"

Hello
1. If MSR_IR = 0 and MSR_DR = 0, does it mean that addresses are not
translated by the MMU ?
2. If EA0 is the 63-rd bit of the effective address e address ? Does
this mean that the translation model is
   derived from the address ? a non privileged context may access
privileged memory.

thank you


[PATCH 4.18 086/135] KVM: PPC: Book3S HV: Dont use compound_order to determine host mapping size

2018-10-16 Thread Greg Kroah-Hartman
4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Nicholas Piggin 

[ Upstream commit 71d29f43b6332badc5598c656616a62575e83342 ]

THP paths can defer splitting compound pages until after the actual
remap and TLB flushes to split a huge PMD/PUD. This causes radix
partition scope page table mappings to get out of synch with the host
qemu page table mappings.

This results in random memory corruption in the guest when running
with THP. The easiest way to reproduce is use KVM balloon to free up
a lot of memory in the guest and then shrink the balloon to give the
memory back, while some work is being done in the guest.

Cc: David Gibson 
Cc: "Aneesh Kumar K.V" 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
Signed-off-by: Paul Mackerras 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   91 +
 1 file changed, 37 insertions(+), 54 deletions(-)

--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -538,8 +538,8 @@ int kvmppc_book3s_radix_page_fault(struc
   unsigned long ea, unsigned long dsisr)
 {
struct kvm *kvm = vcpu->kvm;
-   unsigned long mmu_seq, pte_size;
-   unsigned long gpa, gfn, hva, pfn;
+   unsigned long mmu_seq;
+   unsigned long gpa, gfn, hva;
struct kvm_memory_slot *memslot;
struct page *page = NULL;
long ret;
@@ -636,9 +636,10 @@ int kvmppc_book3s_radix_page_fault(struc
 */
hva = gfn_to_hva_memslot(memslot, gfn);
if (upgrade_p && __get_user_pages_fast(hva, 1, 1, ) == 1) {
-   pfn = page_to_pfn(page);
upgrade_write = true;
} else {
+   unsigned long pfn;
+
/* Call KVM generic code to do the slow-path check */
pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
   writing, upgrade_p);
@@ -652,63 +653,45 @@ int kvmppc_book3s_radix_page_fault(struc
}
}
 
-   /* See if we can insert a 1GB or 2MB large PTE here */
-   level = 0;
-   if (page && PageCompound(page)) {
-   pte_size = PAGE_SIZE << compound_order(compound_head(page));
-   if (pte_size >= PUD_SIZE &&
-   (gpa & (PUD_SIZE - PAGE_SIZE)) ==
-   (hva & (PUD_SIZE - PAGE_SIZE))) {
-   level = 2;
-   pfn &= ~((PUD_SIZE >> PAGE_SHIFT) - 1);
-   } else if (pte_size >= PMD_SIZE &&
-  (gpa & (PMD_SIZE - PAGE_SIZE)) ==
-  (hva & (PMD_SIZE - PAGE_SIZE))) {
-   level = 1;
-   pfn &= ~((PMD_SIZE >> PAGE_SHIFT) - 1);
-   }
-   }
-
/*
-* Compute the PTE value that we need to insert.
+* Read the PTE from the process' radix tree and use that
+* so we get the shift and attribute bits.
 */
-   if (page) {
-   pgflags = _PAGE_READ | _PAGE_EXEC | _PAGE_PRESENT | _PAGE_PTE |
-   _PAGE_ACCESSED;
-   if (writing || upgrade_write)
-   pgflags |= _PAGE_WRITE | _PAGE_DIRTY;
-   pte = pfn_pte(pfn, __pgprot(pgflags));
+   local_irq_disable();
+   ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
+   pte = *ptep;
+   local_irq_enable();
+
+   /* Get pte level from shift/size */
+   if (shift == PUD_SHIFT &&
+   (gpa & (PUD_SIZE - PAGE_SIZE)) ==
+   (hva & (PUD_SIZE - PAGE_SIZE))) {
+   level = 2;
+   } else if (shift == PMD_SHIFT &&
+  (gpa & (PMD_SIZE - PAGE_SIZE)) ==
+  (hva & (PMD_SIZE - PAGE_SIZE))) {
+   level = 1;
} else {
-   /*
-* Read the PTE from the process' radix tree and use that
-* so we get the attribute bits.
-*/
-   local_irq_disable();
-   ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
-   pte = *ptep;
-   local_irq_enable();
-   if (shift == PUD_SHIFT &&
-   (gpa & (PUD_SIZE - PAGE_SIZE)) ==
-   (hva & (PUD_SIZE - PAGE_SIZE))) {
-   level = 2;
-   } else if (shift == PMD_SHIFT &&
-  (gpa & (PMD_SIZE - PAGE_SIZE)) ==
-  (hva & (PMD_SIZE - PAGE_SIZE))) {
-   level = 1;
-   } else if (shift && shift != PAGE_SHIFT) {
-   /* Adjust PFN */
-   unsigned long mask = (1ul << shift) - PAGE_SIZE;
-   pte = __pte(pte_val(pte) | (hva & mask));
-   }
-   pte = __pte(pte_val(pte) | _PAGE_EXEC | 

[PATCH v2] powerpc/topology: Update numa mask when cpu node mapping changes

2018-10-16 Thread Srikar Dronamraju
Commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") will update the cpu node topology for shared lpars
on PowerVM.

However shared lpars on PowerVM also support VPHN and PRRN events.
On receiving a VPHN, PRRN events, cpu to node mapping might change.

Scheduler maintains sched_domains_numa_masks[], which is currently not
updated on cpu to node mapping changes. This can lead to machine
hangs or performance hit.

Fix numa_update_cpu_topology() to update sched_domains_numa_masks[].

Signed-off-by: Srikar Dronamraju 
---
Changelog: v1->v2:
- Updated changelog.
- Updated comments to refer to topology_inited.

 arch/powerpc/mm/numa.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3ea5b05ee6be..767c363ebafa 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1424,6 +1425,20 @@ int numa_update_cpu_topology(bool cpus_locked)
changed = 1;
}
 
+   /*
+* Scheduler maintains a sched_domain_numa_masks table that needs to
+* be updated whenever cpu-node mapping changes. Cpu-node mapping
+* changes only with VPHN and PRRN. 'topology_inited' indicates if
+* cpu-hotplug infrastructure is initialized and good to use.
+*/
+   if (!(topology_inited && (vphn_enabled || prrn_enabled)))
+   goto out;
+
+   for_each_cpu(cpu, _cpus) {
+   sched_cpu_deactivate(cpu);
+   sched_cpu_activate(cpu);
+   }
+
 out:
kfree(updates);
topology_update_needed = 0;
-- 
2.12.3



Re: Stack protector crash in pnv_smp_cpu_kill_self()

2018-10-16 Thread Segher Boessenkool
On Wed, Oct 17, 2018 at 12:21:50AM +1100, Michael Ellerman wrote:
> Christophe LEROY  writes:
> 
> > Looks like a lack of initialisation of the canary for the non-boot CPUs 
> > on SMP, you applied this morning the patch I sent you for that.
> >
> > Is the patch in ?
> 
> Yeah it is.
> 
>   $ git log --oneline 4ffe713b7587 arch/powerpc/kernel/smp.c
>   8e8a31d7fd54 powerpc: Use cpu_smallcore_sibling_mask at SMT level on 
> bigcores
>   425752c63b6f powerpc: Detect the presence of big-cores via "ibm, 
> thread-groups"
>   7241d26e8175 powerpc/64: properly initialise the stackprotector canary on 
> SMP.
> 
> 
> It only happens on a specific Power9 machine, not in sim, but it's 100%
> reproducible on that hardware.
> 
> The canary value has changed (?!).
> 
> The value in paca->canary and current->canary agree, but they don't
> match what's in the stack.
> 
> Clearly the idle code is doing something I don't understand :)

Did something actually corrupt the stack?


Segher


Re: [PATCH v06 3/5] migration/memory: Add hotplug READD_MULTIPLE

2018-10-16 Thread Michael Bringmann
On 10/15/2018 07:39 PM, Michael Ellerman wrote:
> Michael Bringmann  writes:
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index 2b796da..9c76345 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -541,6 +549,23 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
>>  return rc;
>>  }
>>
>> +static int dlpar_memory_readd_multiple(void)
>> +{
>> +struct drmem_lmb *lmb;
>> +int rc;
>> +
>> +pr_info("Attempting to update multiple LMBs\n");
>> +
>> +for_each_drmem_lmb(lmb) {
>> +if (drmem_lmb_update(lmb)) {
>> +rc = dlpar_memory_readd_helper(lmb);
>> +drmem_remove_lmb_update(lmb);
>> +}
>> +}
>> +
>> +return rc;
>> +}
> 
> This leaves rc potentially uninitialised.
> 
> What should the result be in that case, -EINVAL ?

I will force it to be zero (0).  Failure to find anything
to update is not an error.

> 
> cheers

Thanks.

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: Stack protector crash in pnv_smp_cpu_kill_self()

2018-10-16 Thread Michael Ellerman
Christophe LEROY  writes:

> Looks like a lack of initialisation of the canary for the non-boot CPUs 
> on SMP, you applied this morning the patch I sent you for that.
>
> Is the patch in ?

Yeah it is.

  $ git log --oneline 4ffe713b7587 arch/powerpc/kernel/smp.c
  8e8a31d7fd54 powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores
  425752c63b6f powerpc: Detect the presence of big-cores via "ibm, 
thread-groups"
  7241d26e8175 powerpc/64: properly initialise the stackprotector canary on SMP.


It only happens on a specific Power9 machine, not in sim, but it's 100%
reproducible on that hardware.

The canary value has changed (?!).

The value in paca->canary and current->canary agree, but they don't
match what's in the stack.

Clearly the idle code is doing something I don't understand :)

cheers


Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: 
pnv_smp_cpu_kill_self+0x2a0/0x2b0

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
4.19.0-rc3-gcc-7.3.1-00190-g98c847323b3a-dirty #103
Call Trace:
[c7967b00] [c0ae864c] dump_stack+0xb0/0xf4 (unreliable)
[c7967b40] [c00e64ac] panic+0x144/0x328
[c7967be0] [c00e5f2c] __stack_chk_fail+0x2c/0x30
[c7967c40] [c009f720] pnv_smp_cpu_kill_self+0x2a0/0x2b0
[c7967e10] [c00475d8] cpu_die+0x48/0x70
[c7967e30] [c0020620] arch_cpu_idle_dead+0x20/0x40
[c7967e50] [c012e574] do_idle+0x274/0x390
[c7967ec0] [c012e8e8] cpu_startup_entry+0x38/0x50
[c7967ef0] [c0047314] start_secondary+0x5e4/0x600
[c7967f90] [c000ac70] start_secondary_prolog+0x10/0x14


c009f480 :
c009f480:   f9 00 4c 3c addis   r2,r12,249
c009f484:   80 77 42 38 addir2,r2,30592
c009f488:   a6 02 08 7c mflrr0
c009f48c:   2d 48 fc 4b bl  c0063cb8 <_mcount>
c009f490:   a6 02 08 7c mflrr0
c009f494:   c0 ff 01 fb std r24,-64(r1)
c009f498:   c8 ff 21 fb std r25,-56(r1)
c009f49c:   d0 ff 41 fb std r26,-48(r1)
c009f4a0:   d8 ff 61 fb std r27,-40(r1)
c009f4a4:   e0 ff 81 fb std r28,-32(r1)
c009f4a8:   e8 ff a1 fb std r29,-24(r1)
c009f4ac:   f0 ff c1 fb std r30,-16(r1)
c009f4b0:   f8 ff e1 fb std r31,-8(r1)
c009f4b4:   10 00 01 f8 std r0,16(r1)
c009f4b8:   31 fe 21 f8 stdur1,-464(r1) -> 
c7967e10 - 464 = c7967c40 
c009f4bc:   e8 0c 2d e9 ld  r9,3304(r13)
c009f4c0:   88 01 21 f9 std r9,392(r1)  
c7967c40 + 392 = c7967dc8

paca->canary= 31fc80016f07fb00  (0xce8)

current->canary =
1:mon> d8 c77ef150
c77ef150 31fc80016f07fb00 c00026600080

1:mon> d8 %r1
c7967a90 c7967af0 0001
c7967aa0 c00c2c00 c1036c00
c7967ab0 fffe c10ff9b0
c7967ac0 c10ff9b0 
1:mon> 
c7967ad0  
c7967ae0 c0f05330 c00c2bd0
c7967af0 c7967b40 31fc80016f07fb00
  

pnv_smp_cpu_kill_self frame:
1:mon> d8 c7967c40
c7967c40 c7967e10 00063fec
c7967c50 c009f720 f6251c2ce0f21b00
c7967c60 c01fec9f0280 c7bc3680
c7967c70 c7967ca0 c77eeb80
c7967c80 c012de20 c001ee7c
c7967c90 c7967cb0 c0ef2b80
c7967ca0 c7967d50 0001
c7967cb0 c7967d90 24028222
c7967cc0 c017b0a0 0001
c7967cd0 c7967d50 0001
c7967ce0 c7967d20 c1036c00
c7967cf0 c7967d30 c01ffe84d980
c7967d00 c001ee7c 001ffd97
c7967d10 c7967d30 c01ffe862b80
c7967d20  c0044c50
c7967d30 c01ffe8451e0 c7bc3680
c7967d40 c7967d60 
c7967d50 c7967d90 c1036c00
c7967d60 c7967d90 c0f3bc80
c7967d70 c01ae0f8 c0f12880
c7967d80 001ffd97 c7967dc0
c7967d90 c7967e10 0004
c7967da0 c01ae2b8 c01ffe84e050
c7967db0 c7967e50 c1070174
c7967dc0  f6251c2ce0f21b00
  
  canary

c7967dd0  0004
c7967de0 0008 0002
c7967df0 c10700b8 0001

Re: [PATCH] powerpc/io: remove old GCC version implementation

2018-10-16 Thread Christophe LEROY




Le 16/10/2018 à 14:49, Joakim Tjernlund a écrit :

On Tue, 2018-10-16 at 12:33 +, Christophe Leroy wrote:



GCC 4.6 is the minimum supported now.


Ouch, from kernel 4.19 or earlier even ?



Don't know, see 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cafa0010


Christophe


[PATCH v2 5/5] mm: optimise pte dirty/accessed bit setting by demand based pte insertion

2018-10-16 Thread Nicholas Piggin
Similarly to the previous patch, this tries to optimise dirty/accessed
bits in ptes to avoid access costs of hardware setting them.

Signed-off-by: Nicholas Piggin 
---
 mm/huge_memory.c | 12 
 mm/memory.c  |  9 ++---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1f43265204d4..38c2cd3b4879 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1197,6 +1197,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct 
vm_fault *vmf,
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
pte_t entry;
entry = mk_pte(pages[i], vma->vm_page_prot);
+   entry = pte_mkyoung(entry);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -2067,7 +2068,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
struct page *page;
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
-   bool young, write, soft_dirty, pmd_migration = false;
+   bool young, write, dirty, soft_dirty, pmd_migration = false;
unsigned long addr;
int i;
 
@@ -2145,7 +2146,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
page = pmd_page(old_pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
-   if (pmd_dirty(old_pmd))
+   dirty = pmd_dirty(old_pmd);
+   if (dirty)
SetPageDirty(page);
write = pmd_write(old_pmd);
young = pmd_young(old_pmd);
@@ -2176,8 +2178,10 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
entry = maybe_mkwrite(entry, vma);
if (!write)
entry = pte_wrprotect(entry);
-   if (!young)
-   entry = pte_mkold(entry);
+   if (young)
+   entry = pte_mkyoung(entry);
+   if (dirty)
+   entry = pte_mkdirty(entry);
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
}
diff --git a/mm/memory.c b/mm/memory.c
index 9e314339a0bd..f907ea7a6303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1804,10 +1804,9 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
entry = pte_mkspecial(pfn_t_pte(pfn, prot));
 
 out_mkwrite:
-   if (mkwrite) {
-   entry = pte_mkyoung(entry);
+   entry = pte_mkyoung(entry);
+   if (mkwrite)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-   }
 
set_pte_at(mm, addr, pte, entry);
update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
@@ -2534,6 +2533,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
+   entry = pte_mkyoung(entry);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
/*
 * Clear the pte entry and flush it first, before updating the
@@ -3043,6 +3043,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
+   pte = pte_mkyoung(pte);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3185,6 +3186,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
__SetPageUptodate(page);
 
entry = mk_pte(page, vma->vm_page_prot);
+   entry = pte_mkyoung(entry);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
 
@@ -3453,6 +3455,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct 
mem_cgroup *memcg,
 
flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
+   entry = pte_mkyoung(entry);
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
/* copy-on-write page */
-- 
2.18.0



[PATCH v2 4/5] mm/cow: optimise pte dirty bit handling in fork

2018-10-16 Thread Nicholas Piggin
fork clears dirty/accessed bits from new ptes in the child. This logic
has existed since mapped page reclaim was done by scanning ptes when
it may have been quite important. Today with physical based pte
scanning, there is less reason to clear these bits, so this patch
avoids clearing the dirty bit in the child.

Dirty bits are all tested and cleared together, and any dirty bit is
the same as many dirty bits, so from a correctness and writeback
bandwidth point-of-view it does not matter if the child gets a dirty
bit.

Dirty ptes are more costly to unmap because they require flushing
under the page table lock, but it is pretty rare to have a shared
dirty mapping that is copied on fork, so just simplify the code and
avoid this dirty clearing logic.

Signed-off-by: Nicholas Piggin 
---
 mm/memory.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0387ee1e3582..9e314339a0bd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1028,11 +1028,12 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
}
 
/*
-* If it's a shared mapping, mark it clean in
-* the child
+* Child inherits dirty and young bits from parent. There is no
+* point clearing them because any cleaning or aging has to walk
+* all ptes anyway, and it will notice the bits set in the parent.
+* Leaving them set avoids stalls and even page faults on CPUs that
+* handle these bits in software.
 */
-   if (vm_flags & VM_SHARED)
-   pte = pte_mkclean(pte);
 
page = vm_normal_page(vma, addr, pte);
if (page) {
-- 
2.18.0



[PATCH v2 3/5] mm/cow: optimise pte accessed bit handling in fork

2018-10-16 Thread Nicholas Piggin
fork clears dirty/accessed bits from new ptes in the child. This logic
has existed since mapped page reclaim was done by scanning ptes when
it may have been quite important. Today with physical based pte
scanning, there is less reason to clear these bits, so this patch
avoids clearing the accessed bit in the child.

Any accessed bit is treated similarly to many, with the difference
today with > 1 referenced bit causing the page to be activated, while
1 bit causes it to be kept. This patch causes pages shared by fork(2)
to be more readily activated, but this heuristic is very fuzzy anyway
-- a page can be accessed by multiple threads via a single pte and be
just as important as one that is accessed via multiple ptes, for
example. In the end I don't believe fork(2) is a significant driver of
page reclaim behaviour that this should matter too much.

This and the following change eliminate a major source of faults that
powerpc/radix requires to set dirty/accessed bits in ptes, speeding
up a fork/exit microbenchmark by about 5% on POWER9 (16600 -> 17500
fork/execs per second).

Skylake appears to have a micro-fault overhead too -- a test which
allocates 4GB anonymous memory, reads each page, then forks, and times
the child reading a byte from each page. The first pass over the pages
takes about 1000 cycles per page, the second pass takes about 27
cycles (TLB miss). With no additional minor faults measured due to
either child pass, and the page array well exceeding TLB capacity, the
large cost must be caused by micro faults caused by setting accessed
bit.

Signed-off-by: Nicholas Piggin 
---
 mm/huge_memory.c | 2 --
 mm/memory.c  | 1 -
 mm/vmscan.c  | 8 
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0fb0e3025f98..1f43265204d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -977,7 +977,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_wrprotect(pmd);
}
-   pmd = pmd_mkold(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
ret = 0;
@@ -1071,7 +1070,6 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
pudp_set_wrprotect(src_mm, addr, src_pud);
pud = pud_wrprotect(pud);
}
-   pud = pud_mkold(pud);
set_pud_at(dst_mm, addr, dst_pud, pud);
 
ret = 0;
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..0387ee1e3582 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1033,7 +1033,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
 */
if (vm_flags & VM_SHARED)
pte = pte_mkclean(pte);
-   pte = pte_mkold(pte);
 
page = vm_normal_page(vma, addr, pte);
if (page) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5ef7240cbcb..e72d5b3336a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1031,6 +1031,14 @@ static enum page_references page_check_references(struct 
page *page,
 * to look twice if a mapped file page is used more
 * than once.
 *
+* fork() will set referenced bits in child ptes despite
+* not having been accessed, to avoid micro-faults of
+* setting accessed bits. This heuristic is not perfectly
+* accurate in other ways -- multiple map/unmap in the
+* same time window would be treated as multiple references
+* despite same number of actual memory accesses made by
+* the program.
+*
 * Mark it and spare it for another trip around the
 * inactive list.  Another page table reference will
 * lead to its activation.
-- 
2.18.0



[PATCH v2 2/5] mm/cow: don't bother write protecting already write-protected huge pages

2018-10-16 Thread Nicholas Piggin
This is the HugePage / THP equivalent for 1b2de5d039c8 ("mm/cow: don't
bother write protecting already write-protected pages").

Signed-off-by: Nicholas Piggin 
---
 mm/huge_memory.c | 14 ++
 mm/hugetlb.c |  2 +-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 58269f8ba7c4..0fb0e3025f98 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -973,8 +973,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 
-   pmdp_set_wrprotect(src_mm, addr, src_pmd);
-   pmd = pmd_mkold(pmd_wrprotect(pmd));
+   if (pmd_write(pmd)) {
+   pmdp_set_wrprotect(src_mm, addr, src_pmd);
+   pmd = pmd_wrprotect(pmd);
+   }
+   pmd = pmd_mkold(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
ret = 0;
@@ -1064,8 +1067,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
/* No huge zero pud yet */
}
 
-   pudp_set_wrprotect(src_mm, addr, src_pud);
-   pud = pud_mkold(pud_wrprotect(pud));
+   if (pud_write(pud)) {
+   pudp_set_wrprotect(src_mm, addr, src_pud);
+   pud = pud_wrprotect(pud);
+   }
+   pud = pud_mkold(pud);
set_pud_at(dst_mm, addr, dst_pud, pud);
 
ret = 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5c390f5a5207..54a4dcb6ee21 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3287,7 +3287,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct 
mm_struct *src,
}
set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
} else {
-   if (cow) {
+   if (cow && huge_pte_write(entry)) {
/*
 * No need to notify as we are downgrading page
 * table protection not changing it to point
-- 
2.18.0



[PATCH v2 1/5] nios2: update_mmu_cache clear the old entry from the TLB

2018-10-16 Thread Nicholas Piggin
Fault paths like do_read_fault will install a Linux pte with the young
bit clear. The CPU will fault again because the TLB has not been
updated, this time a valid pte exists so handle_pte_fault will just
set the young bit with ptep_set_access_flags, which flushes the TLB.

The TLB is flushed so the next attempt will go to the fast TLB handler
which loads the TLB with the new Linux pte. The access then proceeds.

This design is fragile to depend on the young bit being clear after
the initial Linux fault. A proposed core mm change to immediately set
the young bit upon such a fault, results in ptep_set_access_flags not
flushing the TLB because it finds no change to the pte. The spurious
fault fix path only flushes the TLB if the access was a store. If it
was a load, then this results in an infinite loop of page faults.

This change adds a TLB flush in update_mmu_cache, which removes that
TLB entry upon the first fault. This will cause the fast TLB handler
to load the new pte and avoid the Linux page fault entirely.

Reviewed-by: Ley Foon Tan 
Signed-off-by: Nicholas Piggin 
---
 arch/nios2/mm/cacheflush.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/nios2/mm/cacheflush.c b/arch/nios2/mm/cacheflush.c
index 506f6e1c86d5..d58e7e80dc0d 100644
--- a/arch/nios2/mm/cacheflush.c
+++ b/arch/nios2/mm/cacheflush.c
@@ -204,6 +204,8 @@ void update_mmu_cache(struct vm_area_struct *vma,
struct page *page;
struct address_space *mapping;
 
+   flush_tlb_page(vma, address);
+
if (!pfn_valid(pfn))
return;
 
-- 
2.18.0



[PATCH v2 0/5] mm: dirty/accessed pte optimisations

2018-10-16 Thread Nicholas Piggin
Since v1 I fixed the hang in nios2, split the fork patch into two
as Linus asked, and added hugetlb code for the "don't bother write
protecting already writeprotected" patch.

Please consider this for more cooking in -mm.

Thanks,
Nick

Nicholas Piggin (5):
  nios2: update_mmu_cache clear the old entry from the TLB
  mm/cow: don't bother write protecting already write-protected huge
pages
  mm/cow: optimise pte accessed bit handling in fork
  mm/cow: optimise pte dirty bit handling in fork
  mm: optimise pte dirty/accessed bit setting by demand based pte
insertion

 arch/nios2/mm/cacheflush.c |  2 ++
 mm/huge_memory.c   | 24 
 mm/hugetlb.c   |  2 +-
 mm/memory.c| 19 +++
 mm/vmscan.c|  8 
 5 files changed, 38 insertions(+), 17 deletions(-)

-- 
2.18.0



Re: [PATCH] powerpc/io: remove old GCC version implementation

2018-10-16 Thread Joakim Tjernlund
On Tue, 2018-10-16 at 12:33 +, Christophe Leroy wrote:
> 
> 
> GCC 4.6 is the minimum supported now.

Ouch, from kernel 4.19 or earlier even ?

 Jocke


[PATCH] powerpc/io: remove old GCC version implementation

2018-10-16 Thread Christophe Leroy
GCC 4.6 is the minimum supported now.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/io.h | 20 
 1 file changed, 20 deletions(-)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 0a034519957d..3ef40b703c4a 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -111,25 +111,6 @@ extern bool isa_io_special;
 #define IO_SET_SYNC_FLAG()
 #endif
 
-/* gcc 4.0 and older doesn't have 'Z' constraint */
-#if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ == 0)
-#define DEF_MMIO_IN_X(name, size, insn)\
-static inline u##size name(const volatile u##size __iomem *addr)   \
-{  \
-   u##size ret;\
-   __asm__ __volatile__("sync;"#insn" %0,0,%1;twi 0,%0,0;isync"\
-   : "=r" (ret) : "r" (addr), "m" (*addr) : "memory"); \
-   return ret; \
-}
-
-#define DEF_MMIO_OUT_X(name, size, insn)   \
-static inline void name(volatile u##size __iomem *addr, u##size val)   \
-{  \
-   __asm__ __volatile__("sync;"#insn" %1,0,%2" \
-   : "=m" (*addr) : "r" (val), "r" (addr) : "memory"); \
-   IO_SET_SYNC_FLAG(); \
-}
-#else /* newer gcc */
 #define DEF_MMIO_IN_X(name, size, insn)\
 static inline u##size name(const volatile u##size __iomem *addr)   \
 {  \
@@ -146,7 +127,6 @@ static inline void name(volatile u##size __iomem *addr, 
u##size val)\
: "=Z" (*addr) : "r" (val) : "memory"); \
IO_SET_SYNC_FLAG(); \
 }
-#endif
 
 #define DEF_MMIO_IN_D(name, size, insn)\
 static inline u##size name(const volatile u##size __iomem *addr)   \
-- 
2.13.3



Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-16 Thread Vlastimil Babka
On 10/16/18 12:33 AM, Joel Fernandes wrote:
> On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote:
>> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
>>> Android needs to mremap large regions of memory during memory management
>>> related operations.
>>
>> Just curious: why?
> 
> In Android we have a requirement of moving a large (up to a GB now, but may
> grow bigger in future) memory range from one location to another.

I think Christoph's "why?" was about the requirement, not why it hurts
applications. I admit I'm now also curious :)

> This move
> operation has to happen when the application threads are paused for this
> operation. Therefore, an inefficient move like it is now (for example 250ms
> on arm64) will cause response time issues for applications, which is not
> acceptable. Huge pages cannot be used in such memory ranges to avoid this
> inefficiency as (when the application threads are running) our fault handlers
> are designed to process 4KB pages at a time, to keep response times low. So
> using huge pages in this context can, again, cause response time issues.
> 
> Also, the mremap syscall waiting for quarter of a second for a large mremap
> is quite weird and we ought to improve it where possible.


[PATCH v3] powerpc/pseries: Export raw per-CPU VPA data via debugfs

2018-10-16 Thread Aravinda Prasad
This patch exports the raw per-CPU VPA data via debugfs.
A per-CPU file is created which exports the VPA data of
that CPU to help debug some of the VPA related issues or
to analyze the per-CPU VPA related statistics.

v3: Removed offline CPU check.

v2: Included offline CPU check and other review comments.

Signed-off-by: Aravinda Prasad 
---
 arch/powerpc/platforms/pseries/lpar.c |   54 +
 1 file changed, 54 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index d3992ce..1c757a4 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
@@ -1028,3 +1029,56 @@ static int __init reserve_vrma_context_id(void)
return 0;
 }
 machine_device_initcall(pseries, reserve_vrma_context_id);
+
+#ifdef CONFIG_DEBUG_FS
+/* debugfs file interface for vpa data */
+static ssize_t vpa_file_read(struct file *filp, char __user *buf, size_t len,
+ loff_t *pos)
+{
+   int cpu = (long)filp->private_data;
+   struct lppaca *lppaca = _of(cpu);
+
+   return simple_read_from_buffer(buf, len, pos, lppaca,
+   sizeof(struct lppaca));
+}
+
+static const struct file_operations vpa_fops = {
+   .open   = simple_open,
+   .read   = vpa_file_read,
+   .llseek = default_llseek,
+};
+
+static int __init vpa_debugfs_init(void)
+{
+   char name[16];
+   long i;
+   static struct dentry *vpa_dir;
+
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return 0;
+
+   vpa_dir = debugfs_create_dir("vpa", powerpc_debugfs_root);
+   if (!vpa_dir) {
+   pr_warn("%s: can't create vpa root dir\n", __func__);
+   return -ENOMEM;
+   }
+
+   /* set up the per-cpu vpa file*/
+   for_each_possible_cpu(i) {
+   struct dentry *d;
+
+   sprintf(name, "cpu-%ld", i);
+
+   d = debugfs_create_file(name, 0400, vpa_dir, (void *)i,
+   _fops);
+   if (!d) {
+   pr_warn("%s: can't create per-cpu vpa file\n",
+   __func__);
+   return -ENOMEM;
+   }
+   }
+
+   return 0;
+}
+machine_arch_initcall(pseries, vpa_debugfs_init);
+#endif /* CONFIG_DEBUG_FS */



Re: [PATCH 1/3] powerpc: Split user/kernel definitions of struct pt_regs

2018-10-16 Thread Michael Ellerman
Madhavan Srinivasan  writes:
> On Monday 15 October 2018 04:38 PM, Michael Ellerman wrote:
>> Madhavan Srinivasan  writes:
>>
>>> On Saturday 13 October 2018 04:26 PM, Michael Ellerman wrote:
...
 At the moment they're still identical, and we check that at build
 time. That's because we have code (in ptrace etc.) that assumes that
 they are the same. We will fix that code in future patches, and then
 we can break the strict symmetry between the two structs.
>>> Nice and awesome. But just trying to understand. What will
>>> *regs will point to in the "struct sigcontext".
>>
>> It should always point to a user_pt_regs.
...
>>
>> I think it's not actually broken at the moment, because it's just a
>> pointer, and we don't do anything based on the sizeof() the type.
>
> yes. This clarifies. But still perf/perf_regs.c needs changes.
> Because perf support dumping user_space regs and interrupt regs.
> Once again, we dont use any sizeof(), but need to handle the
> user_pt_regs changes.
>
> I will have a look at that in the morning.

I did look at that and convinced myself that it was OK, but maybe I'm
wrong :D

My reasoning was that the regs we're using there are always the
in-kernel regs for the process at the point it took the PMU interrupt.
And the regs values aren't exported directly as a struct but rather via
regs_get_register().

But we may still want to change it to make things clearer.

cheers


Re: [PATCH] driver core: device: add BUS_ATTR_WO macro

2018-10-16 Thread Greg KH
On Tue, Oct 02, 2018 at 09:43:35AM +, Ioana Ciornei wrote:
> > > Add BUS_ATTR_WO macro to make it easier to add attributes without
> > > auditing the mode settings. Also, use the newly added macro where
> > > appropriate.
> > >
> > > Signed-off-by: Ioana Ciornei 
> > > ---
> > >  arch/powerpc/platforms/pseries/ibmebus.c | 12 
> > >  drivers/block/rbd.c  | 48 
> > > 
> > >  drivers/scsi/fcoe/fcoe_sysfs.c   |  4 +--
> > >  drivers/scsi/fcoe/fcoe_transport.c   | 10 +++
> > >  include/linux/device.h   |  2 ++
> > >  include/scsi/libfcoe.h   |  8 +++---
> > >  6 files changed, 43 insertions(+), 41 deletions(-)
> > 
> > Nice!  This duplicates a lot of the work I did back in July but have not 
> > pushed out
> > very far due to the other things that ended up happening around that time:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git/log/?h=bus_cleanup
> >  
> > 
> > As the patch series seen at that link shows, you can do this in more places 
> > than
> > just what you did here.
> > 
> > Either way, you should break this up into the individual patches, like I 
> > did or you
> > can take my patches if you want.  Getting the BUS_ATTR_WO() macro added is
> > good to do now, and then you can go and hit up all of the different 
> > subsystems
> > that should be converted over to it.
> 
> I can of course split my patch into individual ones and resubmit them, but as 
> you already have the entire patch set ready, I feel like we can just push 
> those. I looked through your changes and it seems like you covered all the 
> subsystems. Please let me know if there is something else I should do.
> 
> My intention here was to first add the _WO attribute so that afterwards I can 
> add a new bus attribute in the fsl-mc bus.

Ok, I've queued up the patch that adds the _WO attribute now, so that
you should be able to use this after 4.20-rc1.  I'll work on getting my
other patches merged in that series at that time as well.

thanks,

greg k-h


Re: [PATCH v4 00/18] of: overlay: validation checks, subsequent fixes

2018-10-16 Thread Michael Ellerman
frowand.l...@gmail.com writes:

> From: Frank Rowand 
>
> Add checks to (1) overlay apply process and (2) memory freeing
> triggered by overlay release.  The checks are intended to detect
> possible memory leaks and invalid overlays.
>
> The checks revealed bugs in existing code.  Fixed the bugs.
>
> While fixing bugs, noted other issues, which are fixed in
> separate patches.
>
> *  Powerpc folks: I was not able to test the patches that
> *  directly impact Powerpc systems that use dynamic
> *  devicetree.  Please review that code carefully and
> *  test.  The specific patches are: 03/16, 04/16, 07/16

Hi Frank,

Do you have this series in a git tree somewhere?

I tried applying it on top of linux-next but hit some conflicts which I
couldn't easily resolve.

cheers


Re: linux-next: qemu boot failures with today's linux-next

2018-10-16 Thread Michael Ellerman
Stephen Rothwell  writes:
>> > Booting Linux via __start() @ 0x0040 ...
>> 
>> If you git Ctrl-a-c you should get the qemu prompt. Then you can run
>> 'info registers' to print the regs and maybe see where it's stuck.
>> 
>> And/or build with EARLY_DEBUG_LPAR to get early console output.
>
> That gave one more line:
>
> [0.00] printk: bootconsole [udbg0] enabled

Yeah I tried it too, it must have crashed *really* early :)

cheers


Patch "KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size" has been added to the 4.18-stable tree

2018-10-16 Thread gregkh


This is a note to let you know that I've just added the patch titled

KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size

to the 4.18-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 
kvm-ppc-book3s-hv-don-t-use-compound_order-to-determine-host-mapping-size.patch
and it can be found in the queue-4.18 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From foo@baz Tue Oct 16 11:10:21 CEST 2018
From: Nicholas Piggin 
Date: Tue, 11 Sep 2018 20:48:34 +1000
Subject: KVM: PPC: Book3S HV: Don't use compound_order to determine host 
mapping size

From: Nicholas Piggin 

[ Upstream commit 71d29f43b6332badc5598c656616a62575e83342 ]

THP paths can defer splitting compound pages until after the actual
remap and TLB flushes to split a huge PMD/PUD. This causes radix
partition scope page table mappings to get out of synch with the host
qemu page table mappings.

This results in random memory corruption in the guest when running
with THP. The easiest way to reproduce is use KVM balloon to free up
a lot of memory in the guest and then shrink the balloon to give the
memory back, while some work is being done in the guest.

Cc: David Gibson 
Cc: "Aneesh Kumar K.V" 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
Signed-off-by: Paul Mackerras 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   91 +
 1 file changed, 37 insertions(+), 54 deletions(-)

--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -538,8 +538,8 @@ int kvmppc_book3s_radix_page_fault(struc
   unsigned long ea, unsigned long dsisr)
 {
struct kvm *kvm = vcpu->kvm;
-   unsigned long mmu_seq, pte_size;
-   unsigned long gpa, gfn, hva, pfn;
+   unsigned long mmu_seq;
+   unsigned long gpa, gfn, hva;
struct kvm_memory_slot *memslot;
struct page *page = NULL;
long ret;
@@ -636,9 +636,10 @@ int kvmppc_book3s_radix_page_fault(struc
 */
hva = gfn_to_hva_memslot(memslot, gfn);
if (upgrade_p && __get_user_pages_fast(hva, 1, 1, ) == 1) {
-   pfn = page_to_pfn(page);
upgrade_write = true;
} else {
+   unsigned long pfn;
+
/* Call KVM generic code to do the slow-path check */
pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
   writing, upgrade_p);
@@ -652,63 +653,45 @@ int kvmppc_book3s_radix_page_fault(struc
}
}
 
-   /* See if we can insert a 1GB or 2MB large PTE here */
-   level = 0;
-   if (page && PageCompound(page)) {
-   pte_size = PAGE_SIZE << compound_order(compound_head(page));
-   if (pte_size >= PUD_SIZE &&
-   (gpa & (PUD_SIZE - PAGE_SIZE)) ==
-   (hva & (PUD_SIZE - PAGE_SIZE))) {
-   level = 2;
-   pfn &= ~((PUD_SIZE >> PAGE_SHIFT) - 1);
-   } else if (pte_size >= PMD_SIZE &&
-  (gpa & (PMD_SIZE - PAGE_SIZE)) ==
-  (hva & (PMD_SIZE - PAGE_SIZE))) {
-   level = 1;
-   pfn &= ~((PMD_SIZE >> PAGE_SHIFT) - 1);
-   }
-   }
-
/*
-* Compute the PTE value that we need to insert.
+* Read the PTE from the process' radix tree and use that
+* so we get the shift and attribute bits.
 */
-   if (page) {
-   pgflags = _PAGE_READ | _PAGE_EXEC | _PAGE_PRESENT | _PAGE_PTE |
-   _PAGE_ACCESSED;
-   if (writing || upgrade_write)
-   pgflags |= _PAGE_WRITE | _PAGE_DIRTY;
-   pte = pfn_pte(pfn, __pgprot(pgflags));
+   local_irq_disable();
+   ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
+   pte = *ptep;
+   local_irq_enable();
+
+   /* Get pte level from shift/size */
+   if (shift == PUD_SHIFT &&
+   (gpa & (PUD_SIZE - PAGE_SIZE)) ==
+   (hva & (PUD_SIZE - PAGE_SIZE))) {
+   level = 2;
+   } else if (shift == PMD_SHIFT &&
+  (gpa & (PMD_SIZE - PAGE_SIZE)) ==
+  (hva & (PMD_SIZE - PAGE_SIZE))) {
+   level = 1;
} else {
-   /*
-* Read the PTE from the process' radix tree and use that
-* so we get the attribute bits.
-*/
-   local_irq_disable();
-   ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
-   pte = *ptep;
-   local_irq_enable();
-   if (shift == PUD_SHIFT &&
-

Re: [PATCH V3 1/2] mm: Add get_user_pages_cma_migrate

2018-10-16 Thread Aneesh Kumar K.V
Alexey Kardashevskiy  writes:

> On 16/10/2018 18:16, Aneesh Kumar K.V wrote:
>> Alexey Kardashevskiy  writes:
>> 
>>> +   }
 +  }
 +  }
 +  if (!list_empty(_page_list)) {
 +  /*
 +   * drop the above get_user_pages reference.
 +   */
>
>
> btw, can these pages be used by somebody else in this short window
> before we migrated and pinned them?

isolate lru page make sure that we remove them from lru list. So lru
walkers won't find the page. If somebody happen to increment the page
reference count in that window, the migrate_pages will fail. That is
handled via migrate_page_move_mapping returning EAGAIN

>
>
 +  for (i = 0; i < ret; ++i)
 +  put_page(pages[i]);
 +
 +  if (migrate_pages(_page_list, new_non_cma_page,
 +NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
 +  /*
 +   * some of the pages failed migration. Do get_user_pages
 +   * without migration.
 +   */
 +  migrate_allow = false;
>>>
>>>
>>> migrate_allow seems useless, simply calling get_user_pages_fast() should
>>> make the code easier to read imho. And the comment says
>>> get_user_pages(), where does this guy hide?
>> 
>> I didn't get that suggestion. What we want to do here is try to migrate 
>> pages as
>> long as we find CMA pages in the result of get_user_pages_fast. If we
>> failed any migration attempt, don't try to migrate again.
>
>
> Setting migrate_allow to false here means you jump up, call
> get_user_pages_fast() and then run the loop which will do nothing just
> because if(...migrate_allow) is false. Instead of jumping up you could
> just call get_user_pages_fast().

ok, that is coding preference I guess, I prefer to avoid multiple
get_user_pages_fast there. Since we droped the page reference, we need
to _go back_ and get the page reference without attempting to migrate. That
is the way I was looking at this.

>
> btw what is migrate_pages() leaves something in cma_page_list (I cannot
> see it removing pages)? Won't it loop indefinitely?
>

putback_movable_pages take care of that. The below hunk.

if (!list_empty(_page_list))
putback_movable_pages(_page_list);

-aneesh



Re: [PATCH kernel v2] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2

2018-10-16 Thread Alexey Kardashevskiy



On 16/10/2018 18:32, Alistair Popple wrote:
> On Tuesday, 16 October 2018 1:22:53 PM AEDT Alexey Kardashevskiy wrote:
>>
>> On 16/10/2018 13:19, Alistair Popple wrote:
 reset_ntl() does what npu2_dev_procedure_reset() does plus more stuff,
 there nothing really in npu2_dev_procedure_reset() which reset_ntl()
 does not do already from the hardware standpoint. And it did stop HMIs
 for me though.

 but ok, what will be sufficient then if not reset_ntl()?
>>>
>>> Argh, yes you are correct. Specifically both npu2_dev_procedure_reset() and
>>> reset_ntl() contain:
>>>
>>> /* NTL Reset */
>>> val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
>>> val |= PPC_BIT(8) | PPC_BIT(9);
>>> npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
>>>
>>> Which should fence the brick. However from what I recall there was more to
>>> reliably preventing HMIs than merely fencing the brick. It invovled a 
>>> sequence
>>> of fencing and flushing the cache with dcbf instructions at the right time 
>>> which
>>> is why we also have the FLR. Unfortunately I don't know the precise details,
>>> perhaps if we send enough coffee Balbir's way he might be able remind us?
>>
>>
>> He suggested and ack'ed that skiboot patch, I can repeat beers^wcoffee
>> but it won't change much ;)
> 
> Ha. Couldn't hurt ;)
> 
> I was pretty sure flushing the caches was an important part of the sequence to
> avoid HMI's. I believe you are trying to deal with unexpected guest 
> terminations
> which means the driver won't have a chance to flush the caches prior to
> termination so

Correct.

> wouldn't you also need to do that somewhere? Unless the driver
> does it at startup?


VFIO performs GPU reset so I'd expect the GPUs to flush its caches
without any software interactions. Am I hoping for too much here?




> 
> - Alistair
> 
>>>
>>> - Alistair
>>>


>
> - Alistair
>
>>
>>
>>
>>>
>>> - Alistair
>>>
>>> On Monday, 15 October 2018 6:17:51 PM AEDT Alexey Kardashevskiy wrote:
 Ping?


 On 02/10/2018 13:20, Alexey Kardashevskiy wrote:
> The skiboot firmware has a hot reset handler which fences the NVIDIA 
> V100
> GPU RAM on Witherspoons and makes accesses no-op instead of throwing 
> HMIs:
> https://github.com/open-power/skiboot/commit/fca2b2b839a67
>
> Now we are going to pass V100 via VFIO which most certainly involves
> KVM guests which are often terminated without getting a chance to 
> offline
> GPU RAM so we end up with a running machine with misconfigured memory.
> Accessing this memory produces hardware management interrupts (HMI)
> which bring the host down.
>
> To suppress HMIs, this wires up this hot reset hook to 
> vfio_pci_disable()
> via pci_disable_device() which switches NPU2 to a safe mode and 
> prevents
> HMIs.
>
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * updated the commit log
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index cde7102..e37b9cc 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -3688,6 +3688,15 @@ static void pnv_pci_release_device(struct 
> pci_dev *pdev)
>   pnv_ioda_release_pe(pe);
>  }
>  
> +static void pnv_npu_disable_device(struct pci_dev *pdev)
> +{
> + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
> + struct eeh_pe *eehpe = edev ? edev->pe : NULL;
> +
> + if (eehpe && eeh_ops && eeh_ops->reset)
> + eeh_ops->reset(eehpe, EEH_RESET_HOT);
> +}
> +
>  static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
>  {
>   struct pnv_phb *phb = hose->private_data;
> @@ -3732,6 +3741,7 @@ static const struct pci_controller_ops 
> pnv_npu_ioda_controller_ops = {
>   .reset_secondary_bus= pnv_pci_reset_secondary_bus,
>   .dma_set_mask   = pnv_npu_dma_set_mask,
>   .shutdown   = pnv_pci_ioda_shutdown,
> + .disable_device = pnv_npu_disable_device,
>  };
>  
>  static const struct pci_controller_ops 
> pnv_npu_ocapi_ioda_controller_ops = {
>


>>>
>>>
>>
>>
>
>


>>>
>>>
>>
>>
> 
> 

-- 
Alexey


Re: [PATCH V3 1/2] mm: Add get_user_pages_cma_migrate

2018-10-16 Thread Alexey Kardashevskiy



On 16/10/2018 18:16, Aneesh Kumar K.V wrote:
> Alexey Kardashevskiy  writes:
> 
>> On 18/09/2018 21:58, Aneesh Kumar K.V wrote:
>>> This helper does a get_user_pages_fast and if it find pages in the CMA area
>>> it will try to migrate them before taking page reference. This makes sure 
>>> that
>>> we don't keep non-movable pages (due to page reference count) in the CMA 
>>> area.
>>> Not able to move pages out of CMA area result in CMA allocation failures.
>>>
>>> Signed-off-by: Aneesh Kumar K.V 
>>> ---
>>>  include/linux/hugetlb.h |   2 +
>>>  include/linux/migrate.h |   3 +
>>>  mm/hugetlb.c|   4 +-
>>>  mm/migrate.c| 132 
>>>  4 files changed, 139 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index 6b68e345f0ca..1abccb1a1ecc 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -357,6 +357,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, 
>>> int preferred_nid,
>>> nodemask_t *nmask);
>>>  struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct 
>>> *vma,
>>> unsigned long address);
>>> +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
>>> +int nid, nodemask_t *nmask);
>>>  int huge_add_to_page_cache(struct page *page, struct address_space 
>>> *mapping,
>>> pgoff_t idx);
>>>  
>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>> index f2b4abbca55e..d82b35afd2eb 100644
>>> --- a/include/linux/migrate.h
>>> +++ b/include/linux/migrate.h
>>> @@ -286,6 +286,9 @@ static inline int migrate_vma(const struct 
>>> migrate_vma_ops *ops,
>>>  }
>>>  #endif /* IS_ENABLED(CONFIG_MIGRATE_VMA_HELPER) */
>>>  
>>> +extern int get_user_pages_cma_migrate(unsigned long start, int nr_pages, 
>>> int write,
>>> + struct page **pages);
>>> +
>>>  #endif /* CONFIG_MIGRATION */
>>>  
>>>  #endif /* _LINUX_MIGRATE_H */
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 3c21775f196b..1abbfcb84f66 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -1585,8 +1585,8 @@ static struct page *alloc_surplus_huge_page(struct 
>>> hstate *h, gfp_t gfp_mask,
>>> return page;
>>>  }
>>>  
>>> -static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t 
>>> gfp_mask,
>>> -   int nid, nodemask_t *nmask)
>>> +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
>>> +int nid, nodemask_t *nmask)
>>>  {
>>> struct page *page;
>>>  
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index d6a2e89b086a..2f92534ea7a1 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -3006,3 +3006,135 @@ int migrate_vma(const struct migrate_vma_ops *ops,
>>>  }
>>>  EXPORT_SYMBOL(migrate_vma);
>>>  #endif /* defined(MIGRATE_VMA_HELPER) */
>>> +
>>> +static struct page *new_non_cma_page(struct page *page, unsigned long 
>>> private)
>>> +{
>>> +   /*
>>> +* We want to make sure we allocate the new page from the same node
>>> +* as the source page.
>>> +*/
>>> +   int nid = page_to_nid(page);
>>> +   gfp_t gfp_mask = GFP_USER | __GFP_THISNODE;
>>> +
>>> +   if (PageHighMem(page))
>>> +   gfp_mask |= __GFP_HIGHMEM;
>>> +
>>> +   if (PageTransHuge(page)) {
>>> +   struct page *thp;
>>> +   gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_THISNODE;
>>> +
>>> +   /*
>>> +* Remove the movable mask so that we don't allocate from
>>> +* CMA area again.
>>> +*/
>>> +   thp_gfpmask &= ~__GFP_MOVABLE;
>>> +   thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
>>
>>
>> HPAGE_PMD_ORDER is 2MB or 1GB? THP are always that PMD order?
> 
> 2M or 16M. THP is at PMD level.
> 
>>
>>
>>> +   if (!thp)
>>> +   return NULL;
>>> +   prep_transhuge_page(thp);
>>> +   return thp;
>>> +
>>> +#ifdef  CONFIG_HUGETLB_PAGE
>>> +   } else if (PageHuge(page)) {
>>> +
>>> +   struct hstate *h = page_hstate(page);
>>> +   /*
>>> +* We don't want to dequeue from the pool because pool pages 
>>> will
>>> +* mostly be from the CMA region.
>>> +*/
>>> +   return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
>>> +#endif
>>> +   }
>>> +
>>> +   return __alloc_pages_node(nid, gfp_mask, 0);
>>> +}
>>> +
>>> +/**
>>> + * get_user_pages_cma_migrate() - pin user pages in memory by migrating 
>>> pages in CMA region
>>> + * @start: starting user address
>>> + * @nr_pages:  number of pages from start to pin
>>> + * @write: whether pages will be written to
>>> + * @pages: array that receives pointers to the pages pinned.
>>> + * Should be at least nr_pages long.
>>> + *
>>> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
>>> + * If 

Re: [PATCH] powerpc/pseries: Export maximum memory value

2018-10-16 Thread Aravinda Prasad



On Wednesday 10 October 2018 10:19 PM, Naveen N. Rao wrote:
> Nathan Fontenot wrote:
>> On 10/10/2018 05:22 AM, Aravinda Prasad wrote:
>>> This patch exports the maximum possible amount of memory
>>> configured on the system via /proc/powerpc/lparcfg.
>>>
>>> Signed-off-by: Aravinda Prasad 
>>> ---
>>>  arch/powerpc/platforms/pseries/lparcfg.c |   13 +
>>>  1 file changed, 13 insertions(+)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/lparcfg.c
>>> b/arch/powerpc/platforms/pseries/lparcfg.c
>>> index 7c872dc..aa82f55 100644
>>> --- a/arch/powerpc/platforms/pseries/lparcfg.c
>>> +++ b/arch/powerpc/platforms/pseries/lparcfg.c
>>> @@ -26,6 +26,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include 
>>>  #include 
>>> @@ -36,6 +37,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>
>>>  #include "pseries.h"
>>>
>>> @@ -433,6 +435,16 @@ static void parse_em_data(struct seq_file *m)
>>>  seq_printf(m, "power_mode_data=%016lx\n", retbuf[0]);
>>>  }
>>>
>>> +static void maxmem_data(struct seq_file *m)
>>> +{
>>> +    unsigned long maxmem = 0;
>>> +
>>> +    maxmem += drmem_info->n_lmbs * drmem_info->lmb_size;
>>> +    maxmem += hugetlb_total_pages() * PAGE_SIZE;
>>> +
>>> +    seq_printf(m, "MaxMem=%ld\n", maxmem);
>>
>> Should this be MaxPossibleMem?
>>
>> At least for the drmem memory the value calculated is the maximum
>> possible
>> memory. I wonder if calling it MaxMem would lead users to think they have
>> that much memory available to them.
> 
> That's a good point. This seems to be referred to as just 'maximum
> memory' in the LPAR configuration as well as in the lparstat
> documentation, but it shouldn't hurt to rename it here.

As Naveen mentioned, I am using the term referred in the LPAR
configuration. But, I can rename it.

Regards,
Aravinda

> 
> - Naveen
> 

-- 
Regards,
Aravinda



Re: [PATCH kernel v2] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2

2018-10-16 Thread Alistair Popple
On Tuesday, 16 October 2018 1:22:53 PM AEDT Alexey Kardashevskiy wrote:
> 
> On 16/10/2018 13:19, Alistair Popple wrote:
> >> reset_ntl() does what npu2_dev_procedure_reset() does plus more stuff,
> >> there nothing really in npu2_dev_procedure_reset() which reset_ntl()
> >> does not do already from the hardware standpoint. And it did stop HMIs
> >> for me though.
> >>
> >> but ok, what will be sufficient then if not reset_ntl()?
> > 
> > Argh, yes you are correct. Specifically both npu2_dev_procedure_reset() and
> > reset_ntl() contain:
> > 
> > /* NTL Reset */
> > val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
> > val |= PPC_BIT(8) | PPC_BIT(9);
> > npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
> > 
> > Which should fence the brick. However from what I recall there was more to
> > reliably preventing HMIs than merely fencing the brick. It invovled a 
> > sequence
> > of fencing and flushing the cache with dcbf instructions at the right time 
> > which
> > is why we also have the FLR. Unfortunately I don't know the precise details,
> > perhaps if we send enough coffee Balbir's way he might be able remind us?
> 
> 
> He suggested and ack'ed that skiboot patch, I can repeat beers^wcoffee
> but it won't change much ;)

Ha. Couldn't hurt ;)

I was pretty sure flushing the caches was an important part of the sequence to
avoid HMI's. I believe you are trying to deal with unexpected guest terminations
which means the driver won't have a chance to flush the caches prior to
termination so wouldn't you also need to do that somewhere? Unless the driver
does it at startup?

- Alistair

> >
> > - Alistair
> > 
> >>
> >>
> >>>
> >>> - Alistair
> >>>
> 
> 
> 
> >
> > - Alistair
> >
> > On Monday, 15 October 2018 6:17:51 PM AEDT Alexey Kardashevskiy wrote:
> >> Ping?
> >>
> >>
> >> On 02/10/2018 13:20, Alexey Kardashevskiy wrote:
> >>> The skiboot firmware has a hot reset handler which fences the NVIDIA 
> >>> V100
> >>> GPU RAM on Witherspoons and makes accesses no-op instead of throwing 
> >>> HMIs:
> >>> https://github.com/open-power/skiboot/commit/fca2b2b839a67
> >>>
> >>> Now we are going to pass V100 via VFIO which most certainly involves
> >>> KVM guests which are often terminated without getting a chance to 
> >>> offline
> >>> GPU RAM so we end up with a running machine with misconfigured memory.
> >>> Accessing this memory produces hardware management interrupts (HMI)
> >>> which bring the host down.
> >>>
> >>> To suppress HMIs, this wires up this hot reset hook to 
> >>> vfio_pci_disable()
> >>> via pci_disable_device() which switches NPU2 to a safe mode and 
> >>> prevents
> >>> HMIs.
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy 
> >>> ---
> >>> Changes:
> >>> v2:
> >>> * updated the commit log
> >>> ---
> >>>  arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++
> >>>  1 file changed, 10 insertions(+)
> >>>
> >>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> >>> b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>> index cde7102..e37b9cc 100644
> >>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>> @@ -3688,6 +3688,15 @@ static void pnv_pci_release_device(struct 
> >>> pci_dev *pdev)
> >>>   pnv_ioda_release_pe(pe);
> >>>  }
> >>>  
> >>> +static void pnv_npu_disable_device(struct pci_dev *pdev)
> >>> +{
> >>> + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
> >>> + struct eeh_pe *eehpe = edev ? edev->pe : NULL;
> >>> +
> >>> + if (eehpe && eeh_ops && eeh_ops->reset)
> >>> + eeh_ops->reset(eehpe, EEH_RESET_HOT);
> >>> +}
> >>> +
> >>>  static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
> >>>  {
> >>>   struct pnv_phb *phb = hose->private_data;
> >>> @@ -3732,6 +3741,7 @@ static const struct pci_controller_ops 
> >>> pnv_npu_ioda_controller_ops = {
> >>>   .reset_secondary_bus= pnv_pci_reset_secondary_bus,
> >>>   .dma_set_mask   = pnv_npu_dma_set_mask,
> >>>   .shutdown   = pnv_pci_ioda_shutdown,
> >>> + .disable_device = pnv_npu_disable_device,
> >>>  };
> >>>  
> >>>  static const struct pci_controller_ops 
> >>> pnv_npu_ocapi_ioda_controller_ops = {
> >>>
> >>
> >>
> >
> >
> 
> 
> >>>
> >>>
> >>
> >>
> > 
> > 
> 
> 




Re: [PATCH V3 1/2] mm: Add get_user_pages_cma_migrate

2018-10-16 Thread Aneesh Kumar K.V
Alexey Kardashevskiy  writes:

> On 18/09/2018 21:58, Aneesh Kumar K.V wrote:
>> This helper does a get_user_pages_fast and if it find pages in the CMA area
>> it will try to migrate them before taking page reference. This makes sure 
>> that
>> we don't keep non-movable pages (due to page reference count) in the CMA 
>> area.
>> Not able to move pages out of CMA area result in CMA allocation failures.
>> 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  include/linux/hugetlb.h |   2 +
>>  include/linux/migrate.h |   3 +
>>  mm/hugetlb.c|   4 +-
>>  mm/migrate.c| 132 
>>  4 files changed, 139 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 6b68e345f0ca..1abccb1a1ecc 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -357,6 +357,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, 
>> int preferred_nid,
>>  nodemask_t *nmask);
>>  struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct 
>> *vma,
>>  unsigned long address);
>> +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
>> + int nid, nodemask_t *nmask);
>>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>>  pgoff_t idx);
>>  
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index f2b4abbca55e..d82b35afd2eb 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -286,6 +286,9 @@ static inline int migrate_vma(const struct 
>> migrate_vma_ops *ops,
>>  }
>>  #endif /* IS_ENABLED(CONFIG_MIGRATE_VMA_HELPER) */
>>  
>> +extern int get_user_pages_cma_migrate(unsigned long start, int nr_pages, 
>> int write,
>> +  struct page **pages);
>> +
>>  #endif /* CONFIG_MIGRATION */
>>  
>>  #endif /* _LINUX_MIGRATE_H */
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 3c21775f196b..1abbfcb84f66 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1585,8 +1585,8 @@ static struct page *alloc_surplus_huge_page(struct 
>> hstate *h, gfp_t gfp_mask,
>>  return page;
>>  }
>>  
>> -static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t 
>> gfp_mask,
>> -int nid, nodemask_t *nmask)
>> +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
>> + int nid, nodemask_t *nmask)
>>  {
>>  struct page *page;
>>  
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index d6a2e89b086a..2f92534ea7a1 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -3006,3 +3006,135 @@ int migrate_vma(const struct migrate_vma_ops *ops,
>>  }
>>  EXPORT_SYMBOL(migrate_vma);
>>  #endif /* defined(MIGRATE_VMA_HELPER) */
>> +
>> +static struct page *new_non_cma_page(struct page *page, unsigned long 
>> private)
>> +{
>> +/*
>> + * We want to make sure we allocate the new page from the same node
>> + * as the source page.
>> + */
>> +int nid = page_to_nid(page);
>> +gfp_t gfp_mask = GFP_USER | __GFP_THISNODE;
>> +
>> +if (PageHighMem(page))
>> +gfp_mask |= __GFP_HIGHMEM;
>> +
>> +if (PageTransHuge(page)) {
>> +struct page *thp;
>> +gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_THISNODE;
>> +
>> +/*
>> + * Remove the movable mask so that we don't allocate from
>> + * CMA area again.
>> + */
>> +thp_gfpmask &= ~__GFP_MOVABLE;
>> +thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
>
>
> HPAGE_PMD_ORDER is 2MB or 1GB? THP are always that PMD order?

2M or 16M. THP is at PMD level.

>
>
>> +if (!thp)
>> +return NULL;
>> +prep_transhuge_page(thp);
>> +return thp;
>> +
>> +#ifdef  CONFIG_HUGETLB_PAGE
>> +} else if (PageHuge(page)) {
>> +
>> +struct hstate *h = page_hstate(page);
>> +/*
>> + * We don't want to dequeue from the pool because pool pages 
>> will
>> + * mostly be from the CMA region.
>> + */
>> +return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
>> +#endif
>> +}
>> +
>> +return __alloc_pages_node(nid, gfp_mask, 0);
>> +}
>> +
>> +/**
>> + * get_user_pages_cma_migrate() - pin user pages in memory by migrating 
>> pages in CMA region
>> + * @start:  starting user address
>> + * @nr_pages:   number of pages from start to pin
>> + * @write:  whether pages will be written to
>> + * @pages:  array that receives pointers to the pages pinned.
>> + *  Should be at least nr_pages long.
>> + *
>> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
>> + * If not successful, it will fall back to taking the lock and
>> + * calling get_user_pages().
>
>
> I do not see any locking or get_user_pages(), hidden