Re: [PATCH kernel v3 0/6] powerpc/powernv/iommu: Optimize memory use

2018-07-12 Thread Alexey Kardashevskiy
On Wed,  4 Jul 2018 16:13:43 +1000
Alexey Kardashevskiy  wrote:

> This patchset aims to reduce actual memory use for guests with
> sparse memory. The pseries guest uses dynamic DMA windows to map
> the entire guest RAM but it only actually maps onlined memory
> which may be not be contiguous. I hit this when tried passing
> through NVLink2-connected GPU RAM of NVIDIA V100 and trying to
> map this RAM at the same offset as in the real hardware
> forced me to rework I handle these windows.
> 
> This moves userspace-to-host-physical translation table
> (iommu_table::it_userspace) from VFIO TCE IOMMU subdriver to
> the platform code and reuses the already existing multilevel
> TCE table code which we have for the hardware tables.
> At last in 6/6 I switch to on-demand allocation so we do not
> allocate huge chunks of the table if we do not have to;
> there is some math in 6/6.
> 
> Changes:
> v3:
> * rebased on v4.18-rc3 and fixed compile error in 6/6
> 
> v2:
> * bugfix and error handling in 6/6
> 
> 
> This is based on sha1
> 021c917 Linus Torvalds "Linux 4.18-rc3".
> 
> Please comment. Thanks.


Ping?

> 
> 
> 
> Alexey Kardashevskiy (6):
>   powerpc/powernv: Remove useless wrapper
>   powerpc/powernv: Move TCE manupulation code to its own file
>   KVM: PPC: Make iommu_table::it_userspace big endian
>   powerpc/powernv: Add indirect levels to it_userspace
>   powerpc/powernv: Rework TCE level allocation
>   powerpc/powernv/ioda: Allocate indirect TCE levels on demand
> 
>  arch/powerpc/platforms/powernv/Makefile   |   2 +-
>  arch/powerpc/include/asm/iommu.h  |  11 +-
>  arch/powerpc/platforms/powernv/pci.h  |  44 ++-
>  arch/powerpc/kvm/book3s_64_vio.c  |  11 +-
>  arch/powerpc/kvm/book3s_64_vio_hv.c   |  18 +-
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 399 
> ++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 184 ++--
>  arch/powerpc/platforms/powernv/pci.c  | 158 --
>  drivers/vfio/vfio_iommu_spapr_tce.c   |  65 +
>  9 files changed, 478 insertions(+), 414 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/pci-ioda-tce.c



--
Alexey


[PATCH 13/18] ibmvscsi: change strncpy+truncation to strlcpy

2018-07-12 Thread Dominique Martinet
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci

Signed-off-by: Dominique Martinet 
---

Please see https://marc.info/?l=linux-kernel=153144450722324=2 (the
first patch of the serie) for the motivation behind this patch

 drivers/scsi/ibmvscsi/ibmvscsi.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 17df76f0be3c..79eb8af03a19 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -1274,14 +1274,12 @@ static void send_mad_capabilities(struct 
ibmvscsi_host_data *hostdata)
if (hostdata->client_migrated)
hostdata->caps.flags |= cpu_to_be32(CLIENT_MIGRATED);
 
-   strncpy(hostdata->caps.name, dev_name(>host->shost_gendev),
+   strlcpy(hostdata->caps.name, dev_name(>host->shost_gendev),
sizeof(hostdata->caps.name));
-   hostdata->caps.name[sizeof(hostdata->caps.name) - 1] = '\0';
 
location = of_get_property(of_node, "ibm,loc-code", NULL);
location = location ? location : dev_name(hostdata->dev);
-   strncpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc));
-   hostdata->caps.loc[sizeof(hostdata->caps.loc) - 1] = '\0';
+   strlcpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc));
 
req->common.type = cpu_to_be32(VIOSRP_CAPABILITIES_TYPE);
req->buffer = cpu_to_be64(hostdata->caps_addr);
-- 
2.17.1



Re: [PATCH kernel v6 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-07-12 Thread Nicholas Piggin
On Wed, 11 Jul 2018 21:00:44 +1000
Alexey Kardashevskiy  wrote:

> A VM which has:
>  - a DMA capable device passed through to it (eg. network card);
>  - running a malicious kernel that ignores H_PUT_TCE failure;
>  - capability of using IOMMU pages bigger that physical pages
> can create an IOMMU mapping that exposes (for example) 16MB of
> the host physical memory to the device when only 64K was allocated to the VM.
> 
> The remaining 16MB - 64K will be some other content of host memory, possibly
> including pages of the VM, but also pages of host kernel memory, host
> programs or other VMs.
> 
> The attacking VM does not control the location of the page it can map,
> and is only allowed to map as many pages as it has pages of RAM.
> 
> We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> an IOMMU page is contained in the physical page so the PCI hardware won't
> get access to unassigned host memory; however this check is missing in
> the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> did not hit this yet as the very first time when the mapping happens
> we do not have tbl::it_userspace allocated yet and fall back to
> the userspace which in turn calls VFIO IOMMU driver, this fails and
> the guest does not retry,
> 
> This stores the smallest preregistered page size in the preregistered
> region descriptor and changes the mm_iommu_xxx API to check this against
> the IOMMU page size.
> 
> This calculates maximum page size as a minimum of the natural region
> alignment and compound page size. For the page shift this uses the shift
> returned by find_linux_pte() which indicates how the page is mapped to
> the current userspace - if the page is huge and this is not a zero, then
> it is a leaf pte and the page is mapped within the range.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v6:
> * replaced hugetlbfs with pageshift from find_linux_pte()
> 
> v5:
> * only consider compound pages from hugetlbfs
> 
> v4:
> * reimplemented max pageshift calculation
> 
> v3:
> * fixed upper limit for the page size
> * added checks that we don't register parts of a huge page
> 
> v2:
> * explicitely check for compound pages before calling compound_order()
> 
> ---
> The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> for IOMMU pages without checking the mmu pagesize and this will fail
> at 
> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> 
> With the change, mapping will fail in KVM and the guest will print:
> 
> mlx5_core :00:00.0: ibm,create-pe-dma-window(2027) 0 800 2000 18 
> 1f returned 0 (liobn = 0x8001 starting addr = 800 0)
> mlx5_core :00:00.0: created tce table LIOBN 0x8001 for 
> /pci@8002000/ethernet@0
> mlx5_core :00:00.0: failed to map direct window for 
> /pci@8002000/ethernet@0: -1
> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 ++--
>  arch/powerpc/kvm/book3s_64_vio.c   |  2 +-
>  arch/powerpc/kvm/book3s_64_vio_hv.c|  6 --
>  arch/powerpc/mm/mmu_context_iommu.c| 39 
> --
>  drivers/vfio/vfio_iommu_spapr_tce.c|  2 +-
>  5 files changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index 896efa5..79d570c 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t 
> *mm_iommu_lookup_rm(
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>   unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> - unsigned long ua, unsigned long *hpa);
> + unsigned long ua, unsigned int pageshift, unsigned long *hpa);
>  extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> - unsigned long ua, unsigned long *hpa);
> + unsigned long ua, unsigned int pageshift, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index d066e37..8c456fa 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct 
> iommu_table *tbl,
>   /* This only handles v2 IOMMU type, v1 is handled via ioctl() */
>   return H_TOO_HARD;
>  
> - if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, )))
> + if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, )))
>   return H_HARDWARE;
>  
>   if 

[PATCH] powerpc/Makefile: Assemble with -me500 when building for E500

2018-07-12 Thread James Clarke
Some of the assembly files use instructions specific to BookE or E500,
which are rejected with the now-default -mcpu=powerpc, so we must pass
-me500 to the assembler just as we pass -me200 for E200.

Fixes: 4bf4f42a2feb ("powerpc/kbuild: Set default generic machine type for 
32-bit compile")
Signed-off-by: James Clarke 
---
 arch/powerpc/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 2ea575cb..fb96206d 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -243,6 +243,7 @@ endif
 cpu-as-$(CONFIG_4xx)   += -Wa,-m405
 cpu-as-$(CONFIG_ALTIVEC)   += $(call as-option,-Wa$(comma)-maltivec)
 cpu-as-$(CONFIG_E200)  += -Wa,-me200
+cpu-as-$(CONFIG_E500)  += -Wa,-me500
 cpu-as-$(CONFIG_PPC_BOOK3S_64) += -Wa,-mpower4
 cpu-as-$(CONFIG_PPC_E500MC)+= $(call as-option,-Wa$(comma)-me500mc)
 
-- 
2.18.0



Re: [next-20180711][Oops] linux-next kernel boot is broken on powerpc

2018-07-12 Thread Pavel Tatashin
> Related commit could be one of below ? I see lots of patches related to mm 
> and could not bisect
>
> 5479976fda7d3ab23ba0a4eb4d60b296eb88b866 mm: page_alloc: restore 
> memblock_next_valid_pfn() on arm/arm64
> 41619b27b5696e7e5ef76d9c692dd7342c1ad7eb 
> mm-drop-vm_bug_on-from-__get_free_pages-fix
> 531bbe6bd2721f4b66cdb0f5cf5ac14612fa1419 mm: drop VM_BUG_ON from 
> __get_free_pages
> 479350dd1a35f8bfb2534697e5ca68ee8a6e8dea mm, page_alloc: actually ignore 
> mempolicies for high priority allocations
> 088018f6fe571444caaeb16e84c9f24f22dfc8b0 mm: skip invalid pages block at a 
> time in zero_resv_unresv()

Looks like:
0ba29a108979 mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER

This patch is going to be reverted from linux-next. Abdul, please
verify that issue is gone once  you revert this patch.

Thank you,
Pavel


[PATCH] powerpc/prom_init: remove linux,stdout-package property

2018-07-12 Thread Murilo Opsfelder Araujo
This property was added in 2004 by


https://github.com/mpe/linux-fullhistory/commit/689fe5072fe9a0dec914bfa4fa60aed1e54563e6

and the only use of it, which was already inside `#if 0`, was removed a month
later by


https://github.com/mpe/linux-fullhistory/commit/1fbe5a6d90f6cd4ea610737ef488719d1a875de7

Fixes: https://github.com/linuxppc/linux/issues/125
Signed-off-by: Murilo Opsfelder Araujo 
---
 arch/powerpc/kernel/prom_init.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 5425dd3d6a9f..c45fb463c9e5 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2102,8 +2102,6 @@ static void __init prom_init_stdout(void)
stdout_node = call_prom("instance-to-package", 1, 1, prom.stdout);
if (stdout_node != PROM_ERROR) {
val = cpu_to_be32(stdout_node);
-   prom_setprop(prom.chosen, "/chosen", "linux,stdout-package",
-, sizeof(val));
 
/* If it's a display, note it */
memset(type, 0, sizeof(type));
-- 
2.17.1



Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)

2018-07-12 Thread Pavel Tatashin
On Thu, Jul 12, 2018 at 5:50 AM Oscar Salvador
 wrote:
>
> > > I just roughly check, but if I checked the right place,
> > > vmemmap_populated() checks for the section to contain the flags we are
> > > setting in sparse_init_one_section().
> >
> > Yes.
> >
> > > But with this patch, we populate first everything, and then we call
> > > sparse_init_one_section() in sparse_init().
> > > As I said I could be mistaken because I just checked the surface.

Yes, this is right, sparse_init_one_section() is needed after every
populate call on ppc64. I am adding this to my sparse_init re-write,
and it actually simplifies code, as it avoids one extra loop, and
makes ppc64 to work.

Pavel


Re: [PATCH v5 5/7] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-07-12 Thread Michal Suchánek
On Tue, 3 Jul 2018 08:08:14 +1000
"Nicholas Piggin"  wrote:

> On Mon, 02 Jul 2018 11:17:06 +0530
> Mahesh J Salgaonkar  wrote:
> 
> > From: Mahesh Salgaonkar 
> > 
> > On pseries, as of today system crashes if we get a machine check
> > exceptions due to SLB errors. These are soft errors and can be
> > fixed by flushing the SLBs so the kernel can continue to function
> > instead of system crash. We do this in real mode before turning on
> > MMU. Otherwise we would run into nested machine checks. This patch
> > now fetches the rtas error log in real mode and flushes the SLBs on
> > SLB errors.
> > 
> > Signed-off-by: Mahesh Salgaonkar 
> > ---
> >  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
> >  arch/powerpc/include/asm/machdep.h|1 
> >  arch/powerpc/kernel/exceptions-64s.S  |   42
> > + arch/powerpc/kernel/mce.c
> > |   16 +++- arch/powerpc/mm/slb.c |
> > 6 +++ arch/powerpc/platforms/powernv/opal.c |1 
> >  arch/powerpc/platforms/pseries/pseries.h  |1 
> >  arch/powerpc/platforms/pseries/ras.c  |   51
> > +
> > arch/powerpc/platforms/pseries/setup.c|1 9 files
> > changed, 116 insertions(+), 4 deletions(-) 
> 
> 
> > +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
> > +BEGIN_FTR_SECTION
> > +   EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
> > +   mr  r10,r1  /* Save r1 */
> > +   ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency
> > stack */
> > +   subir1,r1,INT_FRAME_SIZE/* alloc stack
> > frame   */
> > +   mfspr   r11,SPRN_SRR0   /* Save SRR0 */
> > +   mfspr   r12,SPRN_SRR1   /* Save SRR1 */
> > +   EXCEPTION_PROLOG_COMMON_1()
> > +   EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
> > +   EXCEPTION_PROLOG_COMMON_3(0x200)
> > +   addir3,r1,STACK_FRAME_OVERHEAD
> > +   BRANCH_LINK_TO_FAR(machine_check_early) /* Function call
> > ABI */  
> 
> Is there any reason you can't use the existing
> machine_check_powernv_early code to do all this?
> 
> > diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> > index efdd16a79075..221271c96a57 100644
> > --- a/arch/powerpc/kernel/mce.c
> > +++ b/arch/powerpc/kernel/mce.c
> > @@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs)
> >  {
> > long handled = 0;
> >  
> > -   __this_cpu_inc(irq_stat.mce_exceptions);
> > +   /*
> > +* For pSeries we count mce when we go into virtual mode
> > machine
> > +* check handler. Hence skip it. Also, We can't access per
> > cpu
> > +* variables in real mode for LPAR.
> > +*/
> > +   if (early_cpu_has_feature(CPU_FTR_HVMODE))
> > +   __this_cpu_inc(irq_stat.mce_exceptions);
> >  
> > -   if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
> > +   /*
> > +* See if platform is capable of handling machine check.
> > +* Otherwise fallthrough and allow CPU to handle this
> > machine check.
> > +*/
> > +   if (ppc_md.machine_check_early)
> > +   handled = ppc_md.machine_check_early(regs);
> > +   else if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
> > handled =
> > cur_cpu_spec->machine_check_early(regs);  
> 
> Would be good to add a powernv ppc_md handler which does the
> cur_cpu_spec->machine_check_early() call now that other platforms are
> calling this code. Because those aren't valid as a fallback call, but
> specific to powernv.
> 

Something like this (untested)?

Subject: [PATCH] powerpc/powernv: define platform MCE handler.

---
 arch/powerpc/kernel/mce.c  |  3 ---
 arch/powerpc/platforms/powernv/setup.c | 11 +++
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 221271c96a57..ae17d8aa60c4 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -498,12 +498,9 @@ long machine_check_early(struct pt_regs *regs)
 
/*
 * See if platform is capable of handling machine check.
-* Otherwise fallthrough and allow CPU to handle this machine check.
 */
if (ppc_md.machine_check_early)
handled = ppc_md.machine_check_early(regs);
-   else if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
-   handled = cur_cpu_spec->machine_check_early(regs);
return handled;
 }
 
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index f96df0a25d05..b74c93bc2e55 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -431,6 +431,16 @@ static unsigned long pnv_get_proc_freq(unsigned int cpu)
return ret_freq;
 }
 
+static long pnv_machine_check_early(struct pt_regs *regs)
+{
+   long handled = 0;
+
+   if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
+   handled = cur_cpu_spec->machine_check_early(regs);
+
+   return handled;
+}
+
 

Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)

2018-07-12 Thread Oscar Salvador
> > I just roughly check, but if I checked the right place,
> > vmemmap_populated() checks for the section to contain the flags we are
> > setting in sparse_init_one_section().
> 
> Yes.
> 
> > But with this patch, we populate first everything, and then we call
> > sparse_init_one_section() in sparse_init().
> > As I said I could be mistaken because I just checked the surface.
> 
> Yeah I think that's correct.
> 
> This might just be a bug in our code, let me look at it a bit.

I wonder if something like this could make the trick:

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 51ce091914f9..e281651f50cd 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -177,6 +177,8 @@ static __meminit void vmemmap_list_populate(unsigned long 
phys,
vmemmap_list = vmem_back;
 }
 
+static unsigned long last_addr_populated = 0;
+
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
struct vmem_altmap *altmap)
 {
@@ -191,7 +193,7 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
void *p;
int rc;
 
-   if (vmemmap_populated(start, page_size))
+   if (start + page_size <= last_addr_populated)
continue;
 
if (altmap)
@@ -212,6 +214,7 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
__func__, rc);
return -EFAULT;
}
+   last_addr_populated = start + page_size;
}

I know it looks hacky, and chances are that are wrong, but could you give it a 
try?
I will try to grab a ppc server and try it out too.
 
Thanks
-- 
Oscar Salvador
SUSE L3


[RFC] macintosh: Use common code to access RTC

2018-07-12 Thread Finn Thain
Once the 68k Mac port adopts the via-pmu driver, it must access
the PMU RTC using the appropriate command format. The same code can
then be used for both m68k and powerpc.

Replace the RTC code that's duplicated in arch/powerpc and arch/m68k
with common RTC accessors for Cuda and PMU devices.

While we're at it, drop the problematic WARN_ON that was introduced in
commit 22db552b50fa ("powerpc/powermac: Fix rtc read/write functions").

---
The patch below hasn't been tested yet and would need to be applied
after my PMU patch series (v4). So I'll probably append it to v5.

The patch below is an alternative both to the patch Arnd posted here,
https://lore.kernel.org/lkml/20180619140229.3615110-2-a...@arndb.de/
as well as another future patch to remove the WARN_ON from 
arch/powerpc/platforms/powermac/time.c.

---
 arch/m68k/mac/misc.c   |  75 +++
 arch/powerpc/platforms/powermac/time.c | 130 -
 drivers/macintosh/via-cuda.c   |  35 +
 drivers/macintosh/via-pmu.c|  33 +
 include/linux/cuda.h   |   4 +
 include/linux/pmu.h|   4 +
 6 files changed, 99 insertions(+), 182 deletions(-)

diff --git a/arch/m68k/mac/misc.c b/arch/m68k/mac/misc.c
index 28090a44fa09..21e3afa48de9 100644
--- a/arch/m68k/mac/misc.c
+++ b/arch/m68k/mac/misc.c
@@ -33,34 +33,6 @@
 static void (*rom_reset)(void);
 
 #ifdef CONFIG_ADB_CUDA
-static long cuda_read_time(void)
-{
-   struct adb_request req;
-   long time;
-
-   if (cuda_request(, NULL, 2, CUDA_PACKET, CUDA_GET_TIME) < 0)
-   return 0;
-   while (!req.complete)
-   cuda_poll();
-
-   time = (req.reply[3] << 24) | (req.reply[4] << 16) |
-  (req.reply[5] << 8) | req.reply[6];
-   return time - RTC_OFFSET;
-}
-
-static void cuda_write_time(long data)
-{
-   struct adb_request req;
-
-   data += RTC_OFFSET;
-   if (cuda_request(, NULL, 6, CUDA_PACKET, CUDA_SET_TIME,
-(data >> 24) & 0xFF, (data >> 16) & 0xFF,
-(data >> 8) & 0xFF, data & 0xFF) < 0)
-   return;
-   while (!req.complete)
-   cuda_poll();
-}
-
 static __u8 cuda_read_pram(int offset)
 {
struct adb_request req;
@@ -86,34 +58,6 @@ static void cuda_write_pram(int offset, __u8 data)
 #endif /* CONFIG_ADB_CUDA */
 
 #ifdef CONFIG_ADB_PMU
-static long pmu_read_time(void)
-{
-   struct adb_request req;
-   long time;
-
-   if (pmu_request(, NULL, 1, PMU_READ_RTC) < 0)
-   return 0;
-   while (!req.complete)
-   pmu_poll();
-
-   time = (req.reply[1] << 24) | (req.reply[2] << 16) |
-  (req.reply[3] << 8) | req.reply[4];
-   return time - RTC_OFFSET;
-}
-
-static void pmu_write_time(long data)
-{
-   struct adb_request req;
-
-   data += RTC_OFFSET;
-   if (pmu_request(, NULL, 5, PMU_SET_RTC,
-   (data >> 24) & 0xFF, (data >> 16) & 0xFF,
-   (data >> 8) & 0xFF, data & 0xFF) < 0)
-   return;
-   while (!req.complete)
-   pmu_poll();
-}
-
 static __u8 pmu_read_pram(int offset)
 {
struct adb_request req;
@@ -291,13 +235,17 @@ static long via_read_time(void)
  * is basically any machine with Mac II-style ADB.
  */
 
-static void via_write_time(long time)
+static void via_set_rtc_time(struct rtc_time *tm)
 {
union {
__u8 cdata[4];
long idata;
} data;
__u8 temp;
+   unsigned long time;
+
+   time = mktime(tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
+ tm->tm_hour, tm->tm_min, tm->tm_sec);
 
/* Clear the write protect bit */
 
@@ -635,12 +583,12 @@ int mac_hwclk(int op, struct rtc_time *t)
 #ifdef CONFIG_ADB_CUDA
case MAC_ADB_EGRET:
case MAC_ADB_CUDA:
-   now = cuda_read_time();
+   now = cuda_get_time();
break;
 #endif
 #ifdef CONFIG_ADB_PMU
case MAC_ADB_PB2:
-   now = pmu_read_time();
+   now = pmu_get_time();
break;
 #endif
default:
@@ -659,24 +607,21 @@ int mac_hwclk(int op, struct rtc_time *t)
 __func__, t->tm_year + 1900, t->tm_mon + 1, t->tm_mday,
 t->tm_hour, t->tm_min, t->tm_sec);
 
-   now = mktime(t->tm_year + 1900, t->tm_mon + 1, t->tm_mday,
-t->tm_hour, t->tm_min, t->tm_sec);
-
switch (macintosh_config->adb_type) {
case MAC_ADB_IOP:
case MAC_ADB_II:
case MAC_ADB_PB1:
-   via_write_time(now);
+   via_set_rtc_time(t);
break;
 #ifdef CONFIG_ADB_CUDA
case MAC_ADB_EGRET:
case 

Re: [PATCH kernel v6 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-07-12 Thread Nicholas Piggin
On Wed, 11 Jul 2018 21:00:44 +1000
Alexey Kardashevskiy  wrote:

> A VM which has:
>  - a DMA capable device passed through to it (eg. network card);
>  - running a malicious kernel that ignores H_PUT_TCE failure;
>  - capability of using IOMMU pages bigger that physical pages
> can create an IOMMU mapping that exposes (for example) 16MB of
> the host physical memory to the device when only 64K was allocated to the VM.
> 
> The remaining 16MB - 64K will be some other content of host memory, possibly
> including pages of the VM, but also pages of host kernel memory, host
> programs or other VMs.
> 
> The attacking VM does not control the location of the page it can map,
> and is only allowed to map as many pages as it has pages of RAM.
> 
> We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> an IOMMU page is contained in the physical page so the PCI hardware won't
> get access to unassigned host memory; however this check is missing in
> the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> did not hit this yet as the very first time when the mapping happens
> we do not have tbl::it_userspace allocated yet and fall back to
> the userspace which in turn calls VFIO IOMMU driver, this fails and
> the guest does not retry,
> 
> This stores the smallest preregistered page size in the preregistered
> region descriptor and changes the mm_iommu_xxx API to check this against
> the IOMMU page size.
> 
> This calculates maximum page size as a minimum of the natural region
> alignment and compound page size. For the page shift this uses the shift
> returned by find_linux_pte() which indicates how the page is mapped to
> the current userspace - if the page is huge and this is not a zero, then
> it is a leaf pte and the page is mapped within the range.
> 
> Signed-off-by: Alexey Kardashevskiy 


> @@ -199,6 +209,25 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long 
> ua, unsigned long entries,
>   }
>   }
>  populate:
> + pageshift = PAGE_SHIFT;
> + if (PageCompound(page)) {
> + pte_t *pte;
> + struct page *head = compound_head(page);
> + unsigned int compshift = compound_order(head);
> +
> + local_irq_save(flags); /* disables as well */
> + pte = find_linux_pte(mm->pgd, ua, NULL, );
> + local_irq_restore(flags);
> + if (!pte) {
> + ret = -EFAULT;
> + goto unlock_exit;
> + }
> + /* Double check it is still the same pinned page */
> + if (pte_page(*pte) == head && pageshift == compshift)
> + pageshift = max_t(unsigned int, pageshift,
> + PAGE_SHIFT);

I don't understand this logic. If the page was different, the shift
would be wrong. You're not retrying but instead ignoring it in that
case.

I think I would be slightly happier with the definitely-not-racy
get_user_pages slow approach. Anything lock-less like this would be
a premature optimisation without performance numbers...

Thanks,
Nick


Re: [PATCH] powerpc: Replaced msleep(x) with msleep(OPAL_BUSY_DELAY_MS)

2018-07-12 Thread Nicholas Piggin
On Thu, 12 Jul 2018 15:46:06 +1000
Michael Ellerman  wrote:

> Daniel Klamt  writes:
> 
> > Replaced msleep(x) with with msleep(OPAL_BUSY_DELAY_MS)
> > to diocument these sleep is to wait for opal.
> >
> > Signed-off-by: Daniel Klamt 
> > Signed-off-by: Bjoern Noetel   
> 
> Thanks.
> 
> Your change log should be in the imperative mood, see:
> 
>   
> https://git.kernel.org/pub/scm/git/git.git/tree/Documentation/SubmittingPatches?id=HEAD#n133
> 
> 
> In this case that just means saying "Replace" rather than "Replaced".
> 
> Also the prefix should be "powerpc/xive". You can guess that by doing:
> 
>   $ git log --oneline arch/powerpc/sysdev/xive/native.c
> 
> And notice that the majority of commits use that prefix.
> 
> 
> I've fixed both of those things up for you.

Sorry, just noticed this. I've got a patch which changes the xive stuff
to the "standard" format this will clash with.

   if (rc == OPAL_BUSY_EVENT) {
   msleep(OPAL_BUSY_DELAY_MS);
   opal_poll_events(NULL);
   } else if (rc == OPAL_BUSY) {
   msleep(OPAL_BUSY_DELAY_MS);
   }

If it's already merged that's fine, I can rebase.

Thanks,
Nick


[RFC PATCH] mm: optimise pte dirty/accessed bits handling in fork

2018-07-12 Thread Nicholas Piggin
fork clears dirty/accessed bits from new ptes in the child, even
though the mapping allows such accesses. This logic has existed
for ~ever, and certainly well before physical page reclaim and
cleaning was not strongly tied to pte access state as it is today.
Now that is the case, this access bit clearing logic does not do
much.

Other than this case, Linux is "eager" to set dirty/accessed bits
when setting up mappings, which avoids micro-faults (and page
faults on CPUs that implement these bits in software). With this
patch, there are no cases I could instrument where dirty/accessed
bits do not match the access permissions without memory pressure
(and without more exotic things like migration).

This speeds up a fork/exit microbenchmark by about 5% on POWER9
(which uses a software fault fallback mechanism to set these bits).
I expect x86 CPUs will barely be noticable, but would be interesting
to see. Other archs might care more, and anyway it's always good if
we can remove code and make things a bit faster.

I don't *think* I'm missing anything fundamental, but would be good
to be sure. Comments?

Thanks,
Nick

---

 mm/huge_memory.c |  4 ++--
 mm/memory.c  | 10 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a57a14..c1d41cad9aad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -974,7 +974,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 
pmdp_set_wrprotect(src_mm, addr, src_pmd);
-   pmd = pmd_mkold(pmd_wrprotect(pmd));
+   pmd = pmd_wrprotect(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
ret = 0;
@@ -1065,7 +1065,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
}
 
pudp_set_wrprotect(src_mm, addr, src_pud);
-   pud = pud_mkold(pud_wrprotect(pud));
+   pud = pud_wrprotect(pud);
set_pud_at(dst_mm, addr, dst_pud, pud);
 
ret = 0;
diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..3fea40da3a58 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1023,12 +1023,12 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
}
 
/*
-* If it's a shared mapping, mark it clean in
-* the child
+* Child inherits dirty and young bits from parent. There is no
+* point clearing them because any cleaning or aging has to walk
+* all ptes anyway, and it will notice the bits set in the parent.
+* Leaving them set avoids stalls and even page faults on CPUs that
+* handle these bits in software.
 */
-   if (vm_flags & VM_SHARED)
-   pte = pte_mkclean(pte);
-   pte = pte_mkold(pte);
 
page = vm_normal_page(vma, addr, pte);
if (page) {
-- 
2.17.0



Re: [PATCH kernel] powerpc/powernv/ioda2: Add 256M IOMMU page size to the default POWER8 case

2018-07-12 Thread Russell Currey
On Mon, 2018-07-02 at 17:42 +1000, Alexey Kardashevskiy wrote:
> The sketchy bypass uses 256M pages so add this page size as well.
> 
> This should cause no behavioral change but will be used later.
> 
> Fixes: 477afd6ea6 "powerpc/ioda: Use ibm,supported-tce-sizes for
> IOMMU page size mask"
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Russell Currey 

> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5bd0eb6..557c11d 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2925,7 +2925,7 @@ static unsigned long
> pnv_ioda_parse_tce_sizes(struct pnv_phb *phb)
>   /* Add 16M for POWER8 by default */
>   if (cpu_has_feature(CPU_FTR_ARCH_207S) &&
>   !cpu_has_feature(CPU_FTR_ARCH_300))
> - mask |= SZ_16M;
> + mask |= SZ_16M | SZ_256M;
>   return mask;
>   }
>  


[PATCH kernel] KVM: PPC: Expose userspace mm context id via debugfs

2018-07-12 Thread Alexey Kardashevskiy
This adds a debugfs entry with mm context id of a process which is using
KVM. This id is an index in the process table so the userspace can dump
that tree provided it is granted access to /dev/mem.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_host.h |  1 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 58 +
 2 files changed, 59 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index fa4efa7..bb72667 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -284,6 +284,7 @@ struct kvm_arch {
u64 process_table;
struct dentry *debugfs_dir;
struct dentry *htab_dentry;
+   struct dentry *mm_ctxid_dentry;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f3a8cf..3b9eb17 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -2138,11 +2138,69 @@ static const struct file_operations debugfs_htab_fops = 
{
.llseek  = generic_file_llseek,
 };
 
+static int debugfs_mm_ctxid_open(struct inode *inode, struct file *file)
+{
+   struct kvm *kvm = inode->i_private;
+
+   kvm_get_kvm(kvm);
+   file->private_data = kvm;
+
+   return nonseekable_open(inode, file);
+}
+
+static int debugfs_mm_ctxid_release(struct inode *inode, struct file *file)
+{
+   struct kvm *kvm = file->private_data;
+
+   kvm_put_kvm(kvm);
+   return 0;
+}
+
+static ssize_t debugfs_mm_ctxid_read(struct file *file, char __user *buf,
+size_t len, loff_t *ppos)
+{
+   struct kvm *kvm = file->private_data;
+   ssize_t n, left, ret;
+   char tmp[64];
+
+   if (!kvm_is_radix(kvm))
+   return 0;
+
+   ret = snprintf(tmp, sizeof(tmp) - 1, "%lu\n", kvm->mm->context.id);
+   if (*ppos >= ret)
+   return 0;
+
+   left = min_t(ssize_t, ret - *ppos, len);
+   n = copy_to_user(buf, tmp + *ppos, left);
+   ret = left - n;
+   *ppos += ret;
+
+   return ret;
+}
+
+static ssize_t debugfs_mm_ctxid_write(struct file *file, const char __user 
*buf,
+  size_t len, loff_t *ppos)
+{
+   return -EACCES;
+}
+
+static const struct file_operations debugfs_mm_ctxid_fops = {
+   .owner   = THIS_MODULE,
+   .open= debugfs_mm_ctxid_open,
+   .release = debugfs_mm_ctxid_release,
+   .read= debugfs_mm_ctxid_read,
+   .write   = debugfs_mm_ctxid_write,
+   .llseek  = generic_file_llseek,
+};
+
 void kvmppc_mmu_debugfs_init(struct kvm *kvm)
 {
kvm->arch.htab_dentry = debugfs_create_file("htab", 0400,
kvm->arch.debugfs_dir, kvm,
_htab_fops);
+   kvm->arch.mm_ctxid_dentry = debugfs_create_file("mm_ctxid", 0400,
+   kvm->arch.debugfs_dir, kvm,
+   _mm_ctxid_fops);
 }
 
 void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
-- 
2.11.0



Re: [PATCH 1/2] mm/cma: remove unsupported gfp_mask parameter from cma_alloc()

2018-07-12 Thread Christoph Hellwig
On Thu, Jul 12, 2018 at 11:48:47AM +0900, Joonsoo Kim wrote:
> One of existing user is general DMA layer and it takes gfp flags that is
> provided by user. I don't check all the DMA allocation sites but how do
> you convince that none of them try to use anything other
> than GFP_KERNEL [|__GFP_NOWARN]?

They use a few others things still like __GFP_COMP, __GPF_DMA or
GFP_HUGEPAGE.  But all these are bogus as we have various implementations
that can't respect them.  I plan to get rid of the gfp_t argument
in the dma_map_ops alloc method in a few merge windows because of that,
but it needs further implementation consolidation first.


Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)

2018-07-12 Thread Stephen Rothwell
Hi all,

On Thu, 12 Jul 2018 09:47:29 +1000 Stephen Rothwell  
wrote:
>
> On Wed, 11 Jul 2018 14:13:44 -0700 Andrew Morton  
> wrote:
> >
> > OK, I shall drop
> > mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch for now.  
> 
> I have dropped it from linux-next today (in case you don't get time).

I am certain I did drop it, but some how it is still in there :-(

I will drop it for tomorrow.
-- 
Cheers,
Stephen Rothwell


pgpyj6MGOa6aW.pgp
Description: OpenPGP digital signature


Re: Several suspected memory leaks

2018-07-12 Thread Michael Ellerman
Michael Ellerman  writes:
> Hi Paul,
>
> Paul Menzel  writes:
>> Dear Liunx folks,
>>
>> On a the IBM S822LC (8335-GTA) with Ubuntu 18.04 I built Linux master
>> – 4.18-rc4+, commit 092150a2 (Merge branch 'for-linus'
>> of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid) – with
>> kmemleak. Several issues are found.
>
> Is this the first time you've tested it?
> Or did these warnings only show up recently?
>
>> ```
>> $ grep KMEMLEAK /boot/config-4.18.0-rc4+
>> CONFIG_HAVE_DEBUG_KMEMLEAK=y
>> CONFIG_DEBUG_KMEMLEAK=y
>> CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=1
>> # CONFIG_DEBUG_KMEMLEAK_TEST is not set
>> # CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set
>
>
> I'm not seeing any warnings on my machine here, maybe it's something
> config related. Can you send your full .config ?

Scratch that. I just didn't wait long enough.

cheers


Re: [PATCH] [RESEND] powerpc: xmon: use ktime_get_coarse_boottime64

2018-07-12 Thread Michael Ellerman
Arnd Bergmann  writes:

> get_monotonic_boottime() is deprecated, and may not be safe to call in
> every context, as it has to read a hardware clocksource.
>
> This changes xmon to print the time using ktime_get_coarse_boottime64()
> instead, which avoids the old timespec type and the HW access.
>
> Acked-by: Balbir Singh 
> Signed-off-by: Arnd Bergmann 
> ---
> Originally sent Jun 18, but this hasn't appeared in linux-next yet.
>
> Resending to make sure this is still on the radar. Please apply
> to the powerpc git for 4.19

I had applied it but forgot to push.

Should be there today.

cheers