Re: generic DMA bypass flag

2019-11-13 Thread Christoph Hellwig
On Wed, Nov 13, 2019 at 02:45:15PM +, Robin Murphy wrote:
> In all honesty, this seems silly. If we can set a per-device flag to say 
> "oh, bypass these ops and use some other ops instead", then we can just as 
> easily simply give the device the appropriate ops in the first place. 
> Because, y'know, the name of the game is "per-device ops".

Except that we can't do it _that_ easily.  The problem is that for both
the powerpc and intel case the selection is dynamic.  If a device is in
the identify domain with intel-iommu (or the equivalent on powerpc which
doesn't use the normal iommu framework), we still want to use the iommu
to be able to map memory for devices with a too small dma mask using
the iommu instead of using swiotlb bouncing.  So to "just" use the
per-device dma ops we'd need:

  a) a hook in dma_direct_supported to pick another set of ops for
 small dma masks
  b) a hook in the IOMMU ops to propagate to the direct ops for full
 64-bit masks

I looked into that for powerpc a while ago and it wasn't pretty at all.
Compared to that just checking another flag for the DMA direct calls
is relatively clean and trivial as seens in the diffstat for this series
alone.

> I don't see a great benefit to pulling legacy cruft out into common code 
> instead of just working to get rid of it in-place, when said cruft pulls in 
> the opposite direction to where we're taking the common code (i.e. it's 
> inherently based on the premise of global ops).

I'm not sure what legacy cruft it pull in.  I think it actually fits very
much into a mental model of "direct mapping is the default, to be overriden
if needed" which is pretty close to what we have at the moment.  Just
with a slightly more complicated processing of the override.


Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-13 Thread Christoph Hellwig
On Wed, Nov 13, 2019 at 05:13:39PM +0100, Nicolas Saenz Julienne wrote:
> Using a mask to represent bus DMA constraints has a set of limitations.
> The biggest one being it can only hold a power of two (minus one). The
> DMA mapping code is already aware of this and treats dev->bus_dma_mask
> as a limit. This quirk is already used by some architectures although
> still rare.
> 
> With the introduction of the Raspberry Pi 4 we've found a new contender
> for the use of bus DMA limits, as its PCIe bus can only address the
> lower 3GB of memory (of a total of 4GB). This is impossible to represent
> with a mask. To make things worse the device-tree code rounds non power
> of two bus DMA limits to the next power of two, which is unacceptable in
> this case.
> 
> In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
> over the tree and treat it as such. Note that dev->bus_dma_limit is
> meant to contain the higher accesible DMA address.

This looks sensible modulo the minor comments in this thread.

> 
> Signed-off-by: Nicolas Saenz Julienne 
> 
> ---
> 
> Note this is rebased on top of Christoph's latest DMA series:
> https://www.spinics.net/lists/arm-kernel/msg768600.html

FYI, I'll plan to merge those tonight unless anyone screams.


Re: [PATCH] powerpc/perf: Disable trace_imc pmu

2019-11-13 Thread maddy



On 11/14/19 12:50 PM, Oliver O'Halloran wrote:

On Thu, Nov 14, 2019 at 6:19 PM Madhavan Srinivasan
 wrote:

When a root user or a user with CAP_SYS_ADMIN
privilege use trace_imc performance monitoring
unit events, to monitor application or KVM threads,
may result in a checkstop (System crash). Reason
being frequent switch of the "trace/accumulation"
mode of In-Memory Collection hardware.
This patch disables trace_imc pmu unit, but will
be re-enabled at a later stage with a fix patchset.
---
  arch/powerpc/platforms/powernv/opal-imc.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index e04b20625cb9..5790f078771f 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -285,7 +285,12 @@ static int opal_imc_counters_probe(struct platform_device 
*pdev)
 domain = IMC_DOMAIN_THREAD;
 break;
 case IMC_TYPE_TRACE:
-   domain = IMC_DOMAIN_TRACE;
+   /* Using trace_imc events to monitor
+* application or KVM thread performances
+* may result in a checkstop (system crash).
+* So disabling it for now.
+*/
+   domain = -1;
 break;
 default:
 pr_warn("IMC Unknown Device type \n");
--
2.21.0


Does this need a Fixes: tag?
I was thinking of adding this commit as a fixes tag for fix patchset. 
But if thats
not right, i can add the fixes tag along with a request to send to 
Stable and post a v2


Maddy



Re: [PATCH] powerpc/perf: Disable trace_imc pmu

2019-11-13 Thread Oliver O'Halloran
On Thu, Nov 14, 2019 at 6:19 PM Madhavan Srinivasan
 wrote:
>
> When a root user or a user with CAP_SYS_ADMIN
> privilege use trace_imc performance monitoring
> unit events, to monitor application or KVM threads,
> may result in a checkstop (System crash). Reason
> being frequent switch of the "trace/accumulation"
> mode of In-Memory Collection hardware.
> This patch disables trace_imc pmu unit, but will
> be re-enabled at a later stage with a fix patchset.
> ---
>  arch/powerpc/platforms/powernv/opal-imc.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
> b/arch/powerpc/platforms/powernv/opal-imc.c
> index e04b20625cb9..5790f078771f 100644
> --- a/arch/powerpc/platforms/powernv/opal-imc.c
> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> @@ -285,7 +285,12 @@ static int opal_imc_counters_probe(struct 
> platform_device *pdev)
> domain = IMC_DOMAIN_THREAD;
> break;
> case IMC_TYPE_TRACE:
> -   domain = IMC_DOMAIN_TRACE;
> +   /* Using trace_imc events to monitor
> +* application or KVM thread performances
> +* may result in a checkstop (system crash).
> +* So disabling it for now.
> +*/
> +   domain = -1;
> break;
> default:
> pr_warn("IMC Unknown Device type \n");
> --
> 2.21.0
>

Does this need a Fixes: tag?


[PATCH] powerpc/perf: Disable trace_imc pmu

2019-11-13 Thread Madhavan Srinivasan
When a root user or a user with CAP_SYS_ADMIN
privilege use trace_imc performance monitoring
unit events, to monitor application or KVM threads,
may result in a checkstop (System crash). Reason
being frequent switch of the "trace/accumulation"
mode of In-Memory Collection hardware.
This patch disables trace_imc pmu unit, but will
be re-enabled at a later stage with a fix patchset.
---
 arch/powerpc/platforms/powernv/opal-imc.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index e04b20625cb9..5790f078771f 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -285,7 +285,12 @@ static int opal_imc_counters_probe(struct platform_device 
*pdev)
domain = IMC_DOMAIN_THREAD;
break;
case IMC_TYPE_TRACE:
-   domain = IMC_DOMAIN_TRACE;
+   /* Using trace_imc events to monitor
+* application or KVM thread performances
+* may result in a checkstop (system crash).
+* So disabling it for now.
+*/
+   domain = -1;
break;
default:
pr_warn("IMC Unknown Device type \n");
-- 
2.21.0



Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-13 Thread Ram Pai
On Thu, Nov 14, 2019 at 04:08:25PM +1100, Paul Mackerras wrote:
> On Wed, Nov 13, 2019 at 01:50:42PM -0800, Ram Pai wrote:
> > On Thu, Nov 14, 2019 at 08:18:24AM +1100, Paul Mackerras wrote:
> > > On Tue, Nov 12, 2019 at 10:32:33PM -0800, Ram Pai wrote:
> > > > On Wed, Nov 13, 2019 at 11:14:27AM +1100, Paul Mackerras wrote:
> > > > > On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> > > > > > On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > > > > > > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > > > > > > There is subtle problem removing that code from the assembly.
> > > > > > > > 
> > > > > > > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without 
> > > > > > > > clearing
> > > > > > > > kvm->arch.secure_guest, the hypervisor will continue to think 
> > > > > > > > that the
> > > > > > > > VM is a secure VM.   However the primary reason the 
> > > > > > > > H_SVM_INIT_ABORT
> > > > > > > > hcall was invoked, was to inform the Hypervisor that it should 
> > > > > > > > no longer
> > > > > > > > consider the VM as a Secure VM. So there is a inconsistency 
> > > > > > > > there.
> > > > > > > 
> > > > > > > Most of the checks that look at whether a VM is a secure VM use 
> > > > > > > code
> > > > > > > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > > > > > > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that 
> > > > > > > will
> > > > > > > take the false branch once we have set kvm->arch.secure_guest to
> > > > > > > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact 
> > > > > > > in
> > > > > > > most places we will treat the VM as a normal VM from then on.  If
> > > > > > > there are any places where we still need to treat the VM as a 
> > > > > > > secure
> > > > > > > VM while we are processing the abort we can easily do that too.
> > > > > > 
> > > > > > Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return 
> > > > > > back
> > > > > > to the Ultravisor?   Because removing that assembly code will NOT 
> > > > > > lead the
> > > > > 
> > > > > No.  The suggestion is that vcpu->arch.secure_guest stays set to
> > > > > KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.
> > > > 
> > > > In the fast_guest_return path, if it finds 
> > > > (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_ABORT) is true, should it 
> > > > return to
> > > > UV or not?
> > > > 
> > > > Ideally it should return back to the ultravisor the first time
> > > > KVMPPC_SECURE_INIT_ABORT is set, and not than onwards.
> > > 
> > > What is ideal about that behavior?  Why would that be a particularly
> > > good thing to do?
> > 
> > It is following the rule -- "return back to the caller".
> 
> That doesn't address the question of why vcpu->arch.secure_guest
> should be cleared at the point where we do that.

Ok. here is the sequence that I expect to happen.

---
1) VM makes a ESM(Enter Secure Mode) ucall.

  1A) UV does the H_SVM_INIT_START hcall... the chit-chat between UV and HV 
happens
and somewhere in that chit-chat, UV decides that the ESM call
cannot be successfully completed.  Hence it calls
H_SVM_INIT_ABORT to inform the Hypervisor.

1A_a) Hypervisor aborts the VM's transition and returns back to the 
ultravisor.

  1B) UV cleans up the state of the VM on its side and returns
back to the VM, with failure.  VM is still in a normal VM state.
Its MSR(S) is still 0.

2) VM gets a HDEC exception.

   2A) Hypervisor receives that HDEC exception. It handles the exception
and returns back to the VM.
---

If you agree with the above sequence than, the patch needs all the proposed 
changes.


At step (1A_a) and step (2A),  the hypervisor is faced with the question
--  Where should the control be returned to?

  for step 1A_a it has to return to the UV, and for step 2A it has to
  return to the VM. In other words the control has to **return back to
  the caller**.


It certainly has to rely on the vcpu->arch.secure_guest flag to do so.
If any flag in vcpu->arch.secure_guest is set, it has to return to 
UV.  Otherwise it has to return to VM.

This is the reason, the proposed patch in the fast_guest_return path
looks at the vcpu->arch.secure_guest. If it finds any flag set, it
returns to UV. However before returning to UV, it also clears all the
flags if  H_SVM_INIT_ABORT is set. This is to ensure that HV does not
return to the wrong place; i.e to UV, but returns to the VM at step 2A.

RP



Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-13 Thread John Hubbard
On 11/13/19 3:22 PM, John Hubbard wrote:
> On 11/13/19 2:43 AM, Jan Kara wrote:
> ...
>> How does FOLL_PIN result in grabbing (at least normal, for now) page 
>> reference?
>> I didn't find that anywhere in this patch but it is a prerequisite to
>> converting any user to pin_user_pages() interface, right?
> 
> 
> ohhh, I messed up on this intermediate patch: it doesn't quite stand alone as
> it should, as you noticed. To correct this, I can do one of the following:
> 
> a) move the new pin*() routines into the later patch 16 ("mm/gup:
> track FOLL_PIN pages"), or
> 
> b) do a temporary thing here, such as setting FOLL_GET and adding a TODO,
> within the pin*() implementations. And this switching it over to FOLL_PIN
> in patch 16.
> 
> I'm thinking (a) is less error-prone, so I'm going with that unless someone
> points out that that is stupid. :)
> 

OK, just to save anyone from wasting time reading the above: (a) is, in fact,
stupid, after all. ha. That is because pin_user_pages() is called in the 
intervening patches.
 
So anyway, I'll work out an ordering to fix it up, it's not complicated.


thanks,
-- 
John Hubbard
NVIDIA



Re: [PATCH stable 4.4] powerpc/boot: Request no dynamic linker for boot wrapper

2019-11-13 Thread Greg KH
On Tue, Nov 12, 2019 at 05:59:41PM +1100, Andrew Donnellan wrote:
> From: Nicholas Piggin 
> 
> Commit ff45000fcb56b5b0f1a14a865d3541746d838a0a upstream.
> 
> The boot wrapper performs its own relocations and does not require
> PT_INTERP segment. However currently we don't tell the linker that.
> 
> Prior to binutils 2.28 that works OK. But since binutils commit
> 1a9ccd70f9a7 ("Fix the linker so that it will not silently generate ELF
> binaries with invalid program headers. Fix readelf to report such
> invalid binaries.") binutils tries to create a program header segment
> due to PT_INTERP, and the link fails because there is no space for it:
> 
>   ld: arch/powerpc/boot/zImage.pseries: Not enough room for program headers, 
> try linking with -N
>   ld: final link failed: Bad value
> 
> So tell the linker not to do that, by passing --no-dynamic-linker.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Anton Blanchard 
> Signed-off-by: Nicholas Piggin 
> [mpe: Drop dependency on ld-version.sh and massage change log]
> Signed-off-by: Michael Ellerman 
> [ajd: backport to v4.4 (resolve conflict with a comment line)]
> Signed-off-by: Andrew Donnellan 
> ---
>  arch/powerpc/boot/wrapper | 24 +++-
>  1 file changed, 23 insertions(+), 1 deletion(-)

Now queud up, thanks.

greg k-h


Re: [4.4] Backport request: powerpc: Fix compiling a BE kernel with a powerpc64le toolchain

2019-11-13 Thread Greg KH
On Tue, Nov 12, 2019 at 07:52:24PM +1100, Andrew Donnellan wrote:
> Dear stable team
> 
> Please backport the following patches.
> 
> Commits:
> 
> - 164af597ce945751e2dcd53d0a86e84203a6d117
>   ("powerpc/Makefile: Use cflags-y/aflags-y for setting endian options")
> 
> - 4dc831aa88132f835cefe876aa0206977c4d7710
>   ("powerpc: Fix compiling a BE kernel with a powerpc64le toolchain")
> 
> Stable tree targeted: 4.4 (applies cleanly)
> 
> Justification: This fixes the build when attempting to compile a BE powerpc
> kernel using a bi-endian toolchain that defaults to LE, which is a common
> setup.
> 
> I have tested that these patches apply cleanly and appear to rectify the
> build failure on my machine.

Now queued up, thanks.

greg k-h


Re: Pull request: scottwood/linux.git next

2019-11-13 Thread Jason Yan




On 2019/11/13 17:23, Michael Ellerman wrote:

It needs to use PTRRELOC() for the kernstart_virt_addr accesses.

I've made that change and squashed it into the series. I've pushed that
as a branch to here:
   https://github.com/linuxppc/linux/commits/topic/kaslr-book3e32


That boots for me on qemu mac99.

Jason can you please test it on your setup with KASLR enabled to make
sure it still works.



Thanks Michael,

I have confirmed it works with KASLR enabled on my machine.

[0.00] Memory CAM mapping: 256/256/256 Mb, residual: 3328Mb
[0.00] Linux version 5.4.0-rc2-00012-gc2d1a13520ee 
(yanaijie@138) (gcc version 7.3.0 (Compiler CPU V200R003C00SPC010B006)) 
#1 SMP Thu Nov 14 14:16:28 CST 2019

[0.00] Using QEMU e500 machine description
[0.00] printk: bootconsole [udbg0] enabled
[0.00] CPU maps initialized for 1 thread per core
[0.00] -
[0.00] phys_mem_size = 0x1
[0.00] dcache_bsize  = 0x40
[0.00] icache_bsize  = 0x40
[0.00] cpu_features  = 0x01b4
[0.00]   possible= 0x010103bc
[0.00]   always  = 0x0020
[0.00] cpu_user_features = 0x8c008000 0x0800
[0.00] mmu_features  = 0x000a0010
[0.00] physical_start= 0x93a4000
[0.00] -
[0.00] barrier-nospec: using isync; sync as speculation barrier
[0.00] Zone ranges:
[0.00]   Normal   [mem 0x-0x2fff]
[0.00]   HighMem  [mem 0x3000-0x]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x-0x]
[0.00] Initmem setup node 0 [mem 
0x-0x]

[0.00] MMU: Allocated 1088 bytes of context maps for 255 contexts
[0.00] percpu: Embedded 19 pages/cpu s47884 r8192 d21748 u77824
[0.00] Built 1 zonelists, mobility grouping on.  Total pages: 
1046656
[0.00] Kernel command line: console=ttyS0 IP=192.168.25.187 
root=/dev/vda rw
[0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 
bytes, linear)
[0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 
bytes, linear)

[0.00] mem auto-init: stack:off, heap alloc:off, heap free:off
[0.00] Memory: 4066576K/4194304K available (11852K kernel code, 
780K rwdata, 2844K rodata, 3128K init, 265K bss, 127728K reserved, 0K 
cma-reserved, 3407872K highmem)

[0.00] Kernel virtual memory layout:
[0.00]   * 0xfff5f000..0xf000  : fixmap
[0.00]   * 0xffc0..0xffe0  : highmem PTEs
[0.00]   * 0xffbfe000..0xffc0  : early ioremap
[0.00]   * 0xf100..0xffbfe000  : vmalloc & ioremap
[0.00] rcu: Hierarchical RCU implementation.
[0.00] rcu: RCU event tracing is enabled.
[0.00] rcu: RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=1.
[0.00] rcu: RCU calculated value of scheduler-enlistment delay 
is 25 jiffies.

[0.00] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[0.00] NR_IRQS: 512, nr_irqs: 512, preallocated irqs: 16
[0.00] mpic: Setting up MPIC " OpenPIC  " version 1.2 at 
fe004, max 1 CPUs

[0.00] mpic: ISU size: 256, shift: 8, mask: ff
[0.00] mpic: Initializing for 256 sources
[0.00] random: get_random_u32 called from 
start_kernel+0x32c/0x4f4 with crng_init=0
[0.000581] clocksource: timebase: mask: 0x 
max_cycles: 0x5c4093a7d1, max_idle_ns: 440795210635 ns

[0.001136] clocksource: timebase mult[280] shift[24] registered
[0.012660] Console: colour dummy device 80x25
[0.013981] pid_max: default: 32768 minimum: 301
[0.015472] Mount-cache hash table entries: 2048 (order: 1, 8192 
bytes, linear)
[0.015838] Mountpoint-cache hash table entries: 2048 (order: 1, 8192 
bytes, linear)

[0.018991] *** VALIDATE tmpfs ***
[0.031847] *** VALIDATE proc ***
[0.034801] *** VALIDATE cgroup1 ***
[0.034974] *** VALIDATE cgroup2 ***
[0.053184] e500 family performance monitor hardware support registered
[0.054372] rcu: Hierarchical SRCU implementation.
[0.057294] smp: Bringing up secondary CPUs ...
[0.057549] smp: Brought up 1 node, 1 CPU
[0.057855] Using standard scheduler topology
[0.068552] devtmpfs: initialized
[0.077040] clocksource: jiffies: mask: 0x max_cycles: 
0x, max_idle_ns: 764504178510 ns

[0.077544] futex hash table entries: 256 (order: 2, 16384 bytes, linear)
[0.086677] NET: Registered protocol family 16
[0.089312] audit: initializing netlink subsys (disabled)

[0.115531] Found FSL PCI host bridge at 0x000fe0008000. Firmware 
bus number: 0->255

[0.115920] PCI 

Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing

2019-11-13 Thread Al Viro
On Thu, Nov 14, 2019 at 03:49:45PM +1100, Aleksa Sarai wrote:
> On 2019-11-13, Al Viro  wrote:
> > On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> > 
> > > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> > >  void nd_jump_link(struct path *path)
> > >  {
> > >   struct nameidata *nd = current->nameidata;
> > > +
> > > + nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> > >   path_put(&nd->path);
> > >  
> > >   nd->path = *path;
> > > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> > >   if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> > >   if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> > >   return ERR_PTR(-ELOOP);
> > > + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > > + if (!nd->last_magiclink.same_mnt)
> > > + return ERR_PTR(-EXDEV);
> > > + }
> > >   }
> > 
> > Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> > into nd_jump_link()?  Or just have nd_jump_link() return an error...
> 
> This could be done, but the reason for stashing it away in
> last_magiclink is because of the future magic-link re-opening patches
> which can't be implemented like that without putting the open_flags
> inside nameidata (which was decided to be too ugly a while ago).
> 
> My point being that I could implement it this way for this series, but
> I'd have to implement something like last_magiclink when I end up
> re-posting the magic-link stuff in a few weeks.
> 
> Looking at all the nd_jump_link() users, the other option is to just
> disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
> thing allowing them permits is to resolve file descriptors that are
> pointing to the same procfs mount -- and it's unclear to me how useful
> that really is (apparmorfs and nsfs will always give -EXDEV because
> aafs_mnt and nsfs_mnt are internal kernel vfsmounts).

I would rather keep the entire if (nd->flags & LOOKUP_MAGICLINK_JUMPED)
out of the get_link().  If you want to generate some error if
nd_jump_link() has been called, just do it right there.  The fewer
pieces of state need to be carried around, the better...

And as for opening them...  Why would you need full open_flags in there?
Details, please...


Re: [PATCH v4 30/47] serial: ucc_uart: factor out soft_uart initialization

2019-11-13 Thread Timur Tabi
On Fri, Nov 8, 2019 at 7:03 AM Rasmus Villemoes
 wrote:
>
> -   /*
> -* Determine if we need Soft-UART mode
> -*/
> if (of_find_property(np, "soft-uart", NULL)) {
> dev_dbg(&ofdev->dev, "using Soft-UART mode\n");
> soft_uart = 1;
> +   } else {
> +   return 0;
> }

How about:

if (!of_find_property(np, "soft-uart", NULL))
return 0;

And I think you should be able to get rid of the "soft_uart" variable.


Re: [PATCH v4 04/47] soc: fsl: qe: introduce qe_io{read,write}* wrappers

2019-11-13 Thread Timur Tabi

On 11/12/19 1:14 AM, Rasmus Villemoes wrote:

but that's because readl and writel by definition work on little-endian
registers. I.e., on a BE platform, the readl and writel implementation
must themselves contain a swab, so the above would end up doing two
swabs on a BE platform.


Do you know whether the compiler optimizes-out the double swab?


(On PPC, there's a separate definition of mmio_read32be, namely
writel_be, which in turn does a out_be32, so on PPC that doesn't
actually end up doing two swabs).

So ioread32be etc. have well-defined semantics: access a big-endian
register and return the result in native endianness.


It seems weird that there aren't any cross-arch lightweight 
endian-specific I/O accessors.


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-13 Thread Paul Mackerras
On Wed, Nov 13, 2019 at 01:50:42PM -0800, Ram Pai wrote:
> On Thu, Nov 14, 2019 at 08:18:24AM +1100, Paul Mackerras wrote:
> > On Tue, Nov 12, 2019 at 10:32:33PM -0800, Ram Pai wrote:
> > > On Wed, Nov 13, 2019 at 11:14:27AM +1100, Paul Mackerras wrote:
> > > > On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> > > > > On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > > > > > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > > > > > There is subtle problem removing that code from the assembly.
> > > > > > > 
> > > > > > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without 
> > > > > > > clearing
> > > > > > > kvm->arch.secure_guest, the hypervisor will continue to think 
> > > > > > > that the
> > > > > > > VM is a secure VM.   However the primary reason the 
> > > > > > > H_SVM_INIT_ABORT
> > > > > > > hcall was invoked, was to inform the Hypervisor that it should no 
> > > > > > > longer
> > > > > > > consider the VM as a Secure VM. So there is a inconsistency there.
> > > > > > 
> > > > > > Most of the checks that look at whether a VM is a secure VM use code
> > > > > > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > > > > > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that 
> > > > > > will
> > > > > > take the false branch once we have set kvm->arch.secure_guest to
> > > > > > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
> > > > > > most places we will treat the VM as a normal VM from then on.  If
> > > > > > there are any places where we still need to treat the VM as a secure
> > > > > > VM while we are processing the abort we can easily do that too.
> > > > > 
> > > > > Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return 
> > > > > back
> > > > > to the Ultravisor?   Because removing that assembly code will NOT 
> > > > > lead the
> > > > 
> > > > No.  The suggestion is that vcpu->arch.secure_guest stays set to
> > > > KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.
> > > 
> > > In the fast_guest_return path, if it finds 
> > > (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_ABORT) is true, should it 
> > > return to
> > > UV or not?
> > > 
> > > Ideally it should return back to the ultravisor the first time
> > > KVMPPC_SECURE_INIT_ABORT is set, and not than onwards.
> > 
> > What is ideal about that behavior?  Why would that be a particularly
> > good thing to do?
> 
> It is following the rule -- "return back to the caller".

That doesn't address the question of why vcpu->arch.secure_guest
should be cleared at the point where we do that.

Paul.


Re: [PATCH v10 6/8] KVM: PPC: Support reset of secure guest

2019-11-13 Thread Paul Mackerras
On Wed, Nov 13, 2019 at 08:59:08PM +0530, Bharata B Rao wrote:
> On Tue, Nov 12, 2019 at 04:34:34PM +1100, Paul Mackerras wrote:
> > On Mon, Nov 04, 2019 at 09:47:58AM +0530, Bharata B Rao wrote:
> > [snip]
> > > @@ -5442,6 +5471,64 @@ static int kvmhv_store_to_eaddr(struct kvm_vcpu 
> > > *vcpu, ulong *eaddr, void *ptr,
> > >   return rc;
> > >  }
> > >  
> > > +/*
> > > + *  IOCTL handler to turn off secure mode of guest
> > > + *
> > > + * - Issue ucall to terminate the guest on the UV side
> > > + * - Unpin the VPA pages (Enables these pages to be migrated back
> > > + *   when VM becomes secure again)
> > > + * - Recreate partition table as the guest is transitioning back to
> > > + *   normal mode
> > > + * - Release all device pages
> > > + */
> > > +static int kvmhv_svm_off(struct kvm *kvm)
> > > +{
> > > + struct kvm_vcpu *vcpu;
> > > + int srcu_idx;
> > > + int ret = 0;
> > > + int i;
> > > +
> > > + if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
> > > + return ret;
> > > +
> > 
> > A further comment on this code: it should check that no vcpus are
> > running and fail if any are running, and it should prevent any vcpus
> > from running until the function is finished, using code like that in
> > kvmhv_configure_mmu().  That is, it should do something like this:
> > 
> > mutex_lock(&kvm->arch.mmu_setup_lock);
> > mmu_was_ready = kvm->arch.mmu_ready;
> > if (kvm->arch.mmu_ready) {
> > kvm->arch.mmu_ready = 0;
> > /* order mmu_ready vs. vcpus_running */
> > smp_mb();
> > if (atomic_read(&kvm->arch.vcpus_running)) {
> > kvm->arch.mmu_ready = 1;
> > ret = -EBUSY;
> > goto out_unlock;
> > }
> > }
> > 
> > and then after clearing kvm->arch.secure_guest below:
> > 
> > > + srcu_idx = srcu_read_lock(&kvm->srcu);
> > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > + struct kvm_memory_slot *memslot;
> > > + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > > +
> > > + if (!slots)
> > > + continue;
> > > +
> > > + kvm_for_each_memslot(memslot, slots) {
> > > + kvmppc_uvmem_drop_pages(memslot, kvm, true);
> > > + uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
> > > + }
> > > + }
> > > + srcu_read_unlock(&kvm->srcu, srcu_idx);
> > > +
> > > + ret = uv_svm_terminate(kvm->arch.lpid);
> > > + if (ret != U_SUCCESS) {
> > > + ret = -EINVAL;
> > > + goto out;
> > > + }
> > > +
> > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > > + spin_lock(&vcpu->arch.vpa_update_lock);
> > > + unpin_vpa_reset(kvm, &vcpu->arch.dtl);
> > > + unpin_vpa_reset(kvm, &vcpu->arch.slb_shadow);
> > > + unpin_vpa_reset(kvm, &vcpu->arch.vpa);
> > > + spin_unlock(&vcpu->arch.vpa_update_lock);
> > > + }
> > > +
> > > + ret = kvmppc_reinit_partition_table(kvm);
> > > + if (ret)
> > > + goto out;
> > > +
> > > + kvm->arch.secure_guest = 0;
> > 
> > you need to do:
> > 
> > kvm->arch.mmu_ready = mmu_was_ready;
> >  out_unlock:
> > mutex_unlock(&kvm->arch.mmu_setup_lock);
> > 
> > > +out:
> > > + return ret;
> > > +}
> > > +
> > 
> > With that extra check in place, it should be safe to unpin the vpas if
> > there is a good reason to do so.  ("Userspace has some bug that we
> > haven't found" isn't a good reason to do so.)
> 
> QEMU indeed does set_one_reg to reset the VPAs but that only marks
> the VPA update as pending. The actual unpinning happens when vcpu
> gets to run after reset at which time the VPAs are updated after
> any unpinning (if required)
> 
> When secure guest reboots, vpu 0 gets to run and does unpin its
> VPA pages and then proceeds with switching to secure. Here UV
> tries to page-in all the guest pages, including the still pinned
> VPA pages corresponding to other vcpus which haven't had a chance
> to run till now. They are all still pinned and hence page-in fails.
> 
> To prevent this, we have to explicitly unpin the VPA pages during
> this svm off ioctl. This will ensure that SMP secure guest is able
> to reboot correctly.

OK, that makes sense.  Please put a comment in the code explaining
this briefly.

> So I will incorporate the code chunk you have shown above to fail
> if any vcpu is running and prevent any vcpu from running when
> we unpin VPAs from this ioctl.

Sounds good.

Paul.


Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution

2019-11-13 Thread Aleksa Sarai
On 2019-11-13, Aleksa Sarai  wrote:
> On 2019-11-13, Al Viro  wrote:
> > Minor nit here - I'd split "move the conditional call of set_root()
> > into nd_jump_root()" into a separate patch before that one.  Makes
> > for fewer distractions in this one.  I'd probably fold "and be
> > ready for errors other than -ECHILD" into the same preliminary
> > patch.
> 
> Will do.
> 
> > > + /* Not currently safe for scoped-lookups. */
> > > + if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> > > + return ERR_PTR(-EXDEV);
> > 
> > Also a candidate for doing in nd_jump_link()...
> > 
> > > @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > >   struct inode *inode = nd->inode;
> > >  
> > >   while (1) {
> > > - if (path_equal(&nd->path, &nd->root))
> > > + if (path_equal(&nd->path, &nd->root)) {
> > > + if (unlikely(nd->flags & LOOKUP_BENEATH))
> > > + return -EXDEV;
> > 
> > Umm...  Are you sure it's not -ECHILD?
> 
> It wouldn't hurt to be -ECHILD -- though it's not clear to me how likely
> a success would be in REF-walk if the parent components didn't already
> trigger an unlazy_walk() in RCU-walk.
> 
> I guess that also means LOOKUP_NO_XDEV should trigger -ECHILD in
> follow_dotdot_rcu()?

Scratch the last question -- AFAICS we don't need to do that for
LOOKUP_NO_XDEV because we check against mount_lock so it's very unlikely
that -ECHILD will have any benefit.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing

2019-11-13 Thread Aleksa Sarai
On 2019-11-13, Al Viro  wrote:
> On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> 
> > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> >  void nd_jump_link(struct path *path)
> >  {
> > struct nameidata *nd = current->nameidata;
> > +
> > +   nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> > path_put(&nd->path);
> >  
> > nd->path = *path;
> > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> > if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> > if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> > return ERR_PTR(-ELOOP);
> > +   if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > +   if (!nd->last_magiclink.same_mnt)
> > +   return ERR_PTR(-EXDEV);
> > +   }
> > }
> 
> Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> into nd_jump_link()?  Or just have nd_jump_link() return an error...

This could be done, but the reason for stashing it away in
last_magiclink is because of the future magic-link re-opening patches
which can't be implemented like that without putting the open_flags
inside nameidata (which was decided to be too ugly a while ago).

My point being that I could implement it this way for this series, but
I'd have to implement something like last_magiclink when I end up
re-posting the magic-link stuff in a few weeks.

Looking at all the nd_jump_link() users, the other option is to just
disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
thing allowing them permits is to resolve file descriptors that are
pointing to the same procfs mount -- and it's unclear to me how useful
that really is (apparmorfs and nsfs will always give -EXDEV because
aafs_mnt and nsfs_mnt are internal kernel vfsmounts).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v4 23/23] mm/gup: remove support for gup(FOLL_LONGTERM)

2019-11-13 Thread John Hubbard

On 11/13/19 11:09 AM, Ira Weiny wrote:
...

diff --git a/mm/gup.c b/mm/gup.c
index 82e7e4ce5027..90f5f95ee7ac 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1756,11 +1756,11 @@ long get_user_pages(unsigned long start, unsigned long 
nr_pages,
struct vm_area_struct **vmas)
  {
/*
-* FOLL_PIN must only be set internally by the pin_user_page*() and
-* pin_longterm_*() APIs, never directly by the caller, so enforce that
-* with an assertion:
+* FOLL_PIN and FOLL_LONGTERM must only be set internally by the
+* pin_user_page*() and pin_longterm_*() APIs, never directly by the
+* caller, so enforce that with an assertion:
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+   if (WARN_ON_ONCE(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))


Don't we want to block FOLL_LONGTERM in get_user_pages_fast() as well after all
this?



Yes. But with the latest idea to restore FOLL_LONGTERM to its original glory,
that won't be an issue in the next version. heh.


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-13 Thread John Hubbard

On 11/13/19 2:43 AM, Jan Kara wrote:
...

How does FOLL_PIN result in grabbing (at least normal, for now) page reference?
I didn't find that anywhere in this patch but it is a prerequisite to
converting any user to pin_user_pages() interface, right?



ohhh, I messed up on this intermediate patch: it doesn't quite stand alone as
it should, as you noticed. To correct this, I can do one of the following:

a) move the new pin*() routines into the later patch 16 ("mm/gup:
track FOLL_PIN pages"), or

b) do a temporary thing here, such as setting FOLL_GET and adding a TODO,
within the pin*() implementations. And this switching it over to FOLL_PIN
in patch 16.

I'm thinking (a) is less error-prone, so I'm going with that unless someone
points out that that is stupid. :)


...

I was somewhat wondering about the number of functions you add here. So we
have:> 
pin_user_pages()

pin_user_pages_fast()
pin_user_pages_remote()

and then longterm variants:

pin_longterm_pages()
pin_longterm_pages_fast()
pin_longterm_pages_remote()

and obviously we have gup like:
get_user_pages()
get_user_pages_fast()
get_user_pages_remote()
... and some other gup variants ...

I think we really should have pin_* vs get_* variants as they are very
different in terms of guarantees and after conversion, any use of get_*
variant in non-mm code should be closely scrutinized. OTOH pin_longterm_*
don't look *that* useful to me and just using pin_* instead with
FOLL_LONGTERM flag would look OK to me and somewhat reduce the number of
functions which is already large enough? What do people think? I don't feel
too strongly about this but wanted to bring this up.

Honza


Sounds just right to me, and I see that Dan and Ira also like it.
So I'll proceed with that.

thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v4 02/23] mm/gup: factor out duplicate code from four routines

2019-11-13 Thread John Hubbard

On 11/13/19 3:15 AM, Jan Kara wrote:

On Tue 12-11-19 20:26:49, John Hubbard wrote:

There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Reviewed-by: Jérôme Glisse 
Cc: Ira Weiny 
Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 



diff --git a/mm/gup.c b/mm/gup.c
index 85caf76b3012..199da99e8ffc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1969,6 +1969,34 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
  }
  #endif
  
+static int __record_subpages(struct page *page, unsigned long addr,

+unsigned long end, struct page **pages, int nr)
+{
+   int nr_recorded_pages = 0;
+
+   do {
+   pages[nr] = page;
+   nr++;
+   page++;
+   nr_recorded_pages++;
+   } while (addr += PAGE_SIZE, addr != end);
+   return nr_recorded_pages;
+}


Why don't you pass in already pages + nr?


Aha, that does save a function argument. Will do.

...

+static void __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
+{
+   *nr += nr_recorded_pages;
+   SetPageReferenced(head);
+}


I don't find this last helper very useful. It seems to muddy water more
than necessary...


Yes, I suspect it's rather unloved, and the fact that it was hard to accurately
name should have been a big hint to not do it. I'll remove the helper and
put the lines back in directly.


thanks,
--
John Hubbard
NVIDIA



Other than that the cleanup looks nice to me.

Honza



Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Jerome Glisse
On Wed, Nov 13, 2019 at 02:00:06PM -0800, Dan Williams wrote:
> On Wed, Nov 13, 2019 at 11:23 AM Dan Williams  
> wrote:
> >
> > On Tue, Nov 12, 2019 at 8:27 PM John Hubbard  wrote:
> > >
> > > An upcoming patch changes and complicates the refcounting and
> > > especially the "put page" aspects of it. In order to keep
> > > everything clean, refactor the devmap page release routines:
> > >
> > > * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> > >   and limit the functionality to "read only": return a bool,
> > >   with no side effects.
> > >
> > > * Add a new routine, put_devmap_managed_page(), to handle checking
> > >   what kind of page it is, and what kind of refcount handling it
> > >   requires.
> > >
> > > * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> > >   and limit the functionality to unconditionally freeing a devmap
> > >   page.
> > >
> > > This is originally based on a separate patch by Ira Weiny, which
> > > applied to an early version of the put_user_page() experiments.
> > > Since then, Jérôme Glisse suggested the refactoring described above.
> > >
> > > Suggested-by: Jérôme Glisse 
> > > Signed-off-by: Ira Weiny 
> > > Signed-off-by: John Hubbard 
> > > ---
> > >  include/linux/mm.h | 27 ---
> > >  mm/memremap.c  | 67 --
> > >  2 files changed, 53 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index a2adf95b3f9c..96228376139c 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct 
> > > page *page)
> > >  #endif
> > >
> > >  #ifdef CONFIG_DEV_PAGEMAP_OPS
> > > -void __put_devmap_managed_page(struct page *page);
> > > +void free_devmap_managed_page(struct page *page);
> > >  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> > > -static inline bool put_devmap_managed_page(struct page *page)
> > > +
> > > +static inline bool page_is_devmap_managed(struct page *page)
> > >  {
> > > if (!static_branch_unlikely(&devmap_managed_key))
> > > return false;
> > > @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct 
> > > page *page)
> > > switch (page->pgmap->type) {
> > > case MEMORY_DEVICE_PRIVATE:
> > > case MEMORY_DEVICE_FS_DAX:
> > > -   __put_devmap_managed_page(page);
> > > return true;
> > > default:
> > > break;
> > > @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct 
> > > page *page)
> > > return false;
> > >  }
> > >
> > > +static inline bool put_devmap_managed_page(struct page *page)
> > > +{
> > > +   bool is_devmap = page_is_devmap_managed(page);
> > > +
> > > +   if (is_devmap) {
> > > +   int count = page_ref_dec_return(page);
> > > +
> > > +   /*
> > > +* devmap page refcounts are 1-based, rather than 
> > > 0-based: if
> > > +* refcount is 1, then the page is free and the refcount 
> > > is
> > > +* stable because nobody holds a reference on the page.
> > > +*/
> > > +   if (count == 1)
> > > +   free_devmap_managed_page(page);
> > > +   else if (!count)
> > > +   __put_page(page);
> > > +   }
> > > +
> > > +   return is_devmap;
> > > +}
> > > +
> > >  #else /* CONFIG_DEV_PAGEMAP_OPS */
> > >  static inline bool put_devmap_managed_page(struct page *page)
> > >  {
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 03ccbdfeb697..bc7e2a27d025 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long 
> > > pfn,
> > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >
> > >  #ifdef CONFIG_DEV_PAGEMAP_OPS
> > > -void __put_devmap_managed_page(struct page *page)
> > > +void free_devmap_managed_page(struct page *page)
> > >  {
> > > -   int count = page_ref_dec_return(page);
> > > +   /* Clear Active bit in case of parallel mark_page_accessed */
> > > +   __ClearPageActive(page);
> > > +   __ClearPageWaiters(page);
> > > +
> > > +   mem_cgroup_uncharge(page);
> >
> > Ugh, when did all this HMM specific manipulation sneak into the
> > generic ZONE_DEVICE path? It used to be gated by pgmap type with its
> > own put_zone_device_private_page(). For example it's certainly
> > unnecessary and might be broken (would need to check) to call
> > mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
> > monolith and the HMM use case leaks pages into code paths that DAX
> > explicitly avoids.
> 
> It's been this way for a while and I did not react previously,
> apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
> mem_cgroup_uncharge, belong behind a device-private conditional. The
> history here is:
> 
> Move some, 

Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread John Hubbard

On 11/13/19 2:55 PM, Dan Williams wrote:

On Wed, Nov 13, 2019 at 2:49 PM John Hubbard  wrote:


On 11/13/19 2:00 PM, Dan Williams wrote:
...

Ugh, when did all this HMM specific manipulation sneak into the
generic ZONE_DEVICE path? It used to be gated by pgmap type with its
own put_zone_device_private_page(). For example it's certainly
unnecessary and might be broken (would need to check) to call
mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
monolith and the HMM use case leaks pages into code paths that DAX
explicitly avoids.


It's been this way for a while and I did not react previously,
apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
mem_cgroup_uncharge, belong behind a device-private conditional. The
history here is:

Move some, but not all HMM specifics to hmm_devmem_free():
  2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put

Remove the clearing of mapping since no upstream consumers needed it:
  b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free

Add it back in once an upstream consumer arrived:
  7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse

We're now almost entirely free of ->page_free callbacks except for
that weird nouveau case, can that FIXME in nouveau_dmem_page_free()
also result in killing the ->page_free() callback altogether? In the
meantime I'm proposing a cleanup like this:



OK, assuming this is acceptable (no obvious problems jump out at me,
and we can also test it with HMM), then how would you like to proceed, as
far as patches go: add such a patch as part of this series here, or as a
stand-alone patch either before or after this series? Or something else?
And did you plan on sending it out as such?


I think it makes sense to include it in your series since you're
looking to refactor the implementation. I can send you one based on
current linux-next as a lead-in cleanup before the refactor. Does that
work for you?



That would be perfect!



Also, the diffs didn't quite make it through intact to my "git apply", so
I'm re-posting the diff in hopes that this time it survives:


Apologies for that. For quick "how about this" patch examples, I just
copy and paste into gmail and it sometimes clobbers it.



No problem at all, I do the same thing and *usually* it works. ha. And
as you say, it's good enough for discussions.


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Dan Williams
On Wed, Nov 13, 2019 at 2:49 PM John Hubbard  wrote:
>
> On 11/13/19 2:00 PM, Dan Williams wrote:
> ...
> >> Ugh, when did all this HMM specific manipulation sneak into the
> >> generic ZONE_DEVICE path? It used to be gated by pgmap type with its
> >> own put_zone_device_private_page(). For example it's certainly
> >> unnecessary and might be broken (would need to check) to call
> >> mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
> >> monolith and the HMM use case leaks pages into code paths that DAX
> >> explicitly avoids.
> >
> > It's been this way for a while and I did not react previously,
> > apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
> > mem_cgroup_uncharge, belong behind a device-private conditional. The
> > history here is:
> >
> > Move some, but not all HMM specifics to hmm_devmem_free():
> >  2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put
> >
> > Remove the clearing of mapping since no upstream consumers needed it:
> >  b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free
> >
> > Add it back in once an upstream consumer arrived:
> >  7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse
> >
> > We're now almost entirely free of ->page_free callbacks except for
> > that weird nouveau case, can that FIXME in nouveau_dmem_page_free()
> > also result in killing the ->page_free() callback altogether? In the
> > meantime I'm proposing a cleanup like this:
>
>
> OK, assuming this is acceptable (no obvious problems jump out at me,
> and we can also test it with HMM), then how would you like to proceed, as
> far as patches go: add such a patch as part of this series here, or as a
> stand-alone patch either before or after this series? Or something else?
> And did you plan on sending it out as such?

I think it makes sense to include it in your series since you're
looking to refactor the implementation. I can send you one based on
current linux-next as a lead-in cleanup before the refactor. Does that
work for you?

>
> Also, the diffs didn't quite make it through intact to my "git apply", so
> I'm re-posting the diff in hopes that this time it survives:

Apologies for that. For quick "how about this" patch examples, I just
copy and paste into gmail and it sometimes clobbers it.

>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index f9f76f6ba07b..21db1ce8c0ae 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -338,13 +338,7 @@ static void pmem_release_disk(void *__pmem)
> put_disk(pmem->disk);
>   }
>
> -static void pmem_pagemap_page_free(struct page *page)
> -{
> -   wake_up_var(&page->_refcount);
> -}
> -
>   static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> -   .page_free  = pmem_pagemap_page_free,
> .kill   = pmem_pagemap_kill,
> .cleanup= pmem_pagemap_cleanup,
>   };
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..157edb8f7cf8 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
>  * holds a reference on the page.
>  */
> if (count == 1) {
> -   /* Clear Active bit in case of parallel mark_page_accessed */
> -   __ClearPageActive(page);
> -   __ClearPageWaiters(page);
> -
> -   mem_cgroup_uncharge(page);
> -
> /*
>  * When a device_private page is freed, the page->mapping 
> field
>  * may still contain a (stale) mapping value. For example, the
> @@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page)
>  * handled differently or not done at all, so there is no need
>  * to clear page->mapping.
>  */
> -   if (is_device_private_page(page))
> -   page->mapping = NULL;
> +   if (is_device_private_page(page)) {
> +   /* Clear Active bit in case of parallel 
> mark_page_accessed */
> +   __ClearPageActive(page);
> +   __ClearPageWaiters(page);
>
> -   page->pgmap->ops->page_free(page);
> +   mem_cgroup_uncharge(page);
> +
> +   page->mapping = NULL;
> +   page->pgmap->ops->page_free(page);
> +   } else
> +   wake_up_var(&page->_refcount);
> } else if (!count)
> __put_page(page);
>   }
> --
> 2.24.0
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index ad8e4df1282b..4eae441f86c9 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
> >  put_disk(pmem->disk);
> >   }
> >
> > -static void pmem_pagemap_page_free(struct page *page)
> > -{
> > -

Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread John Hubbard

On 11/13/19 2:00 PM, Dan Williams wrote:
...

Ugh, when did all this HMM specific manipulation sneak into the
generic ZONE_DEVICE path? It used to be gated by pgmap type with its
own put_zone_device_private_page(). For example it's certainly
unnecessary and might be broken (would need to check) to call
mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
monolith and the HMM use case leaks pages into code paths that DAX
explicitly avoids.


It's been this way for a while and I did not react previously,
apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
mem_cgroup_uncharge, belong behind a device-private conditional. The
history here is:

Move some, but not all HMM specifics to hmm_devmem_free():
 2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put

Remove the clearing of mapping since no upstream consumers needed it:
 b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free

Add it back in once an upstream consumer arrived:
 7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse

We're now almost entirely free of ->page_free callbacks except for
that weird nouveau case, can that FIXME in nouveau_dmem_page_free()
also result in killing the ->page_free() callback altogether? In the
meantime I'm proposing a cleanup like this:



OK, assuming this is acceptable (no obvious problems jump out at me,
and we can also test it with HMM), then how would you like to proceed, as
far as patches go: add such a patch as part of this series here, or as a
stand-alone patch either before or after this series? Or something else?
And did you plan on sending it out as such?

Also, the diffs didn't quite make it through intact to my "git apply", so
I'm re-posting the diff in hopes that this time it survives:

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f9f76f6ba07b..21db1ce8c0ae 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -338,13 +338,7 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
 }
 
-static void pmem_pagemap_page_free(struct page *page)

-{
-   wake_up_var(&page->_refcount);
-}
-
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-   .page_free  = pmem_pagemap_page_free,
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
 };
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..157edb8f7cf8 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
 * holds a reference on the page.
 */
if (count == 1) {
-   /* Clear Active bit in case of parallel mark_page_accessed */
-   __ClearPageActive(page);
-   __ClearPageWaiters(page);
-
-   mem_cgroup_uncharge(page);
-
/*
 * When a device_private page is freed, the page->mapping field
 * may still contain a (stale) mapping value. For example, the
@@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page)
 * handled differently or not done at all, so there is no need
 * to clear page->mapping.
 */
-   if (is_device_private_page(page))
-   page->mapping = NULL;
+   if (is_device_private_page(page)) {
+   /* Clear Active bit in case of parallel 
mark_page_accessed */
+   __ClearPageActive(page);
+   __ClearPageWaiters(page);
 
-		page->pgmap->ops->page_free(page);

+   mem_cgroup_uncharge(page);
+
+   page->mapping = NULL;
+   page->pgmap->ops->page_free(page);
+   } else
+   wake_up_var(&page->_refcount);
} else if (!count)
__put_page(page);
 }
--
2.24.0


thanks,
--
John Hubbard
NVIDIA



diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ad8e4df1282b..4eae441f86c9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
 put_disk(pmem->disk);
  }

-static void pmem_pagemap_page_free(struct page *page)
-{
-   wake_up_var(&page->_refcount);
-}
-
  static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-   .page_free  = pmem_pagemap_page_free,
 .kill   = pmem_pagemap_kill,
 .cleanup= pmem_pagemap_cleanup,
  };
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..157edb8f7cf8 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
  * holds a reference on the page.
  */
 if (count == 1) {
-   /* Clear Active bit in case of parallel mark_page_accessed */
-   __ClearPageActive(page);
-   __ClearPageWaiters(page);
-
- 

Re: [PATCH 12/23] y2038: syscalls: change remaining timeval to __kernel_old_timeval

2019-11-13 Thread Rafael J. Wysocki
On Friday, November 8, 2019 10:12:11 PM CET Arnd Bergmann wrote:
> All of the remaining syscalls that pass a timeval (gettimeofday, utime,
> futimesat) can trivially be changed to pass a __kernel_old_timeval
> instead, which has a compatible layout, but avoids ambiguity with
> the timeval type in user space.
> 
> Signed-off-by: Arnd Bergmann 

For the change in power/power.h

Acked-by: Rafael J. Wysocki 

> ---
>  arch/powerpc/include/asm/asm-prototypes.h |  3 ++-
>  arch/powerpc/kernel/syscalls.c|  4 ++--
>  fs/select.c   | 10 +-
>  fs/utimes.c   |  8 
>  include/linux/syscalls.h  | 10 +-
>  kernel/power/power.h  |  2 +-
>  kernel/time/time.c|  2 +-
>  7 files changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
> b/arch/powerpc/include/asm/asm-prototypes.h
> index 8561498e653c..2c25dc079cb9 100644
> --- a/arch/powerpc/include/asm/asm-prototypes.h
> +++ b/arch/powerpc/include/asm/asm-prototypes.h
> @@ -92,7 +92,8 @@ long sys_swapcontext(struct ucontext __user *old_ctx,
>  long sys_debug_setcontext(struct ucontext __user *ctx,
> int ndbg, struct sig_dbg_op __user *dbg);
>  int
> -ppc_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user 
> *exp, struct timeval __user *tvp);
> +ppc_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user 
> *exp,
> +struct __kernel_old_timeval __user *tvp);
>  unsigned long __init early_init(unsigned long dt_ptr);
>  void __init machine_init(u64 dt_ptr);
>  #endif
> diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
> index 3bfb3888e897..078608ec2e92 100644
> --- a/arch/powerpc/kernel/syscalls.c
> +++ b/arch/powerpc/kernel/syscalls.c
> @@ -79,7 +79,7 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, size_t, len,
>   * sys_select() with the appropriate args. -- Cort
>   */
>  int
> -ppc_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user 
> *exp, struct timeval __user *tvp)
> +ppc_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user 
> *exp, struct __kernel_old_timeval __user *tvp)
>  {
>   if ( (unsigned long)n >= 4096 )
>   {
> @@ -89,7 +89,7 @@ ppc_select(int n, fd_set __user *inp, fd_set __user *outp, 
> fd_set __user *exp, s
>   || __get_user(inp, ((fd_set __user * __user *)(buffer+1)))
>   || __get_user(outp, ((fd_set  __user * __user *)(buffer+2)))
>   || __get_user(exp, ((fd_set  __user * __user *)(buffer+3)))
> - || __get_user(tvp, ((struct timeval  __user * __user 
> *)(buffer+4
> + || __get_user(tvp, ((struct __kernel_old_timeval  __user * 
> __user *)(buffer+4
>   return -EFAULT;
>   }
>   return sys_select(n, inp, outp, exp, tvp);
> diff --git a/fs/select.c b/fs/select.c
> index 53a0c149f528..11d0285d46b7 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -321,7 +321,7 @@ static int poll_select_finish(struct timespec64 *end_time,
>   switch (pt_type) {
>   case PT_TIMEVAL:
>   {
> - struct timeval rtv;
> + struct __kernel_old_timeval rtv;
>  
>   if (sizeof(rtv) > sizeof(rtv.tv_sec) + 
> sizeof(rtv.tv_usec))
>   memset(&rtv, 0, sizeof(rtv));
> @@ -698,10 +698,10 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
> __user *outp,
>  }
>  
>  static int kern_select(int n, fd_set __user *inp, fd_set __user *outp,
> -fd_set __user *exp, struct timeval __user *tvp)
> +fd_set __user *exp, struct __kernel_old_timeval __user 
> *tvp)
>  {
>   struct timespec64 end_time, *to = NULL;
> - struct timeval tv;
> + struct __kernel_old_timeval tv;
>   int ret;
>  
>   if (tvp) {
> @@ -720,7 +720,7 @@ static int kern_select(int n, fd_set __user *inp, fd_set 
> __user *outp,
>  }
>  
>  SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp,
> - fd_set __user *, exp, struct timeval __user *, tvp)
> + fd_set __user *, exp, struct __kernel_old_timeval __user *, tvp)
>  {
>   return kern_select(n, inp, outp, exp, tvp);
>  }
> @@ -810,7 +810,7 @@ SYSCALL_DEFINE6(pselect6_time32, int, n, fd_set __user *, 
> inp, fd_set __user *,
>  struct sel_arg_struct {
>   unsigned long n;
>   fd_set __user *inp, *outp, *exp;
> - struct timeval __user *tvp;
> + struct __kernel_old_timeval __user *tvp;
>  };
>  
>  SYSCALL_DEFINE1(old_select, struct sel_arg_struct __user *, arg)
> diff --git a/fs/utimes.c b/fs/utimes.c
> index 1ba3f7883870..c952b6b3d8a0 100644
> --- a/fs/utimes.c
> +++ b/fs/utimes.c
> @@ -161,9 +161,9 @@ SYSCALL_DEFINE4(utimensat, int, dfd, const char __user *, 
> filename,
>   * utimensat() instead.
>   */
>  stat

linux-next: build warning after merge of the powerpc tree

2019-11-13 Thread Stephen Rothwell
Hi all,

After merging the powerpc tree, today's linux-next build (x86_64
allmodconfig) produced this warning:

security/integrity/platform_certs/load_uefi.c:17:19: warning: 
'efi_cert_sha256_guid' defined but not used [-Wunused-variable]
   17 | static efi_guid_t efi_cert_sha256_guid __initdata = 
EFI_CERT_SHA256_GUID;
  |   ^~~~
security/integrity/platform_certs/load_uefi.c:15:19: warning: 
'efi_cert_x509_sha256_guid' defined but not used [-Wunused-variable]
   15 | static efi_guid_t efi_cert_x509_sha256_guid __initdata =
  |   ^
security/integrity/platform_certs/load_uefi.c:14:19: warning: 
'efi_cert_x509_guid' defined but not used [-Wunused-variable]
   14 | static efi_guid_t efi_cert_x509_guid __initdata = EFI_CERT_X509_GUID;
  |   ^~

Introduced by commit

  ad723674d675 ("x86/efi: move common keyring handler functions to new file")

-- 
Cheers,
Stephen Rothwell


pgp3VzXPMc7I9.pgp
Description: OpenPGP digital signature


Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Dan Williams
On Wed, Nov 13, 2019 at 11:23 AM Dan Williams  wrote:
>
> On Tue, Nov 12, 2019 at 8:27 PM John Hubbard  wrote:
> >
> > An upcoming patch changes and complicates the refcounting and
> > especially the "put page" aspects of it. In order to keep
> > everything clean, refactor the devmap page release routines:
> >
> > * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> >   and limit the functionality to "read only": return a bool,
> >   with no side effects.
> >
> > * Add a new routine, put_devmap_managed_page(), to handle checking
> >   what kind of page it is, and what kind of refcount handling it
> >   requires.
> >
> > * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> >   and limit the functionality to unconditionally freeing a devmap
> >   page.
> >
> > This is originally based on a separate patch by Ira Weiny, which
> > applied to an early version of the put_user_page() experiments.
> > Since then, Jérôme Glisse suggested the refactoring described above.
> >
> > Suggested-by: Jérôme Glisse 
> > Signed-off-by: Ira Weiny 
> > Signed-off-by: John Hubbard 
> > ---
> >  include/linux/mm.h | 27 ---
> >  mm/memremap.c  | 67 --
> >  2 files changed, 53 insertions(+), 41 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a2adf95b3f9c..96228376139c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct 
> > page *page)
> >  #endif
> >
> >  #ifdef CONFIG_DEV_PAGEMAP_OPS
> > -void __put_devmap_managed_page(struct page *page);
> > +void free_devmap_managed_page(struct page *page);
> >  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> > -static inline bool put_devmap_managed_page(struct page *page)
> > +
> > +static inline bool page_is_devmap_managed(struct page *page)
> >  {
> > if (!static_branch_unlikely(&devmap_managed_key))
> > return false;
> > @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
> > *page)
> > switch (page->pgmap->type) {
> > case MEMORY_DEVICE_PRIVATE:
> > case MEMORY_DEVICE_FS_DAX:
> > -   __put_devmap_managed_page(page);
> > return true;
> > default:
> > break;
> > @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
> > *page)
> > return false;
> >  }
> >
> > +static inline bool put_devmap_managed_page(struct page *page)
> > +{
> > +   bool is_devmap = page_is_devmap_managed(page);
> > +
> > +   if (is_devmap) {
> > +   int count = page_ref_dec_return(page);
> > +
> > +   /*
> > +* devmap page refcounts are 1-based, rather than 0-based: 
> > if
> > +* refcount is 1, then the page is free and the refcount is
> > +* stable because nobody holds a reference on the page.
> > +*/
> > +   if (count == 1)
> > +   free_devmap_managed_page(page);
> > +   else if (!count)
> > +   __put_page(page);
> > +   }
> > +
> > +   return is_devmap;
> > +}
> > +
> >  #else /* CONFIG_DEV_PAGEMAP_OPS */
> >  static inline bool put_devmap_managed_page(struct page *page)
> >  {
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 03ccbdfeb697..bc7e2a27d025 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >
> >  #ifdef CONFIG_DEV_PAGEMAP_OPS
> > -void __put_devmap_managed_page(struct page *page)
> > +void free_devmap_managed_page(struct page *page)
> >  {
> > -   int count = page_ref_dec_return(page);
> > +   /* Clear Active bit in case of parallel mark_page_accessed */
> > +   __ClearPageActive(page);
> > +   __ClearPageWaiters(page);
> > +
> > +   mem_cgroup_uncharge(page);
>
> Ugh, when did all this HMM specific manipulation sneak into the
> generic ZONE_DEVICE path? It used to be gated by pgmap type with its
> own put_zone_device_private_page(). For example it's certainly
> unnecessary and might be broken (would need to check) to call
> mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
> monolith and the HMM use case leaks pages into code paths that DAX
> explicitly avoids.

It's been this way for a while and I did not react previously,
apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
mem_cgroup_uncharge, belong behind a device-private conditional. The
history here is:

Move some, but not all HMM specifics to hmm_devmem_free():
2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put

Remove the clearing of mapping since no upstream consumers needed it:
b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free

Add it back in once an upstream consumer arrived:
7ab0ad0e74f8 mm

Re: [PATCH 03/23] y2038: vdso: change timeval to __kernel_old_timeval

2019-11-13 Thread Thomas Gleixner
On Fri, 8 Nov 2019, Arnd Bergmann wrote:

> The gettimeofday() function in vdso uses the traditional 'timeval'
> structure layout, which will be incompatible with future versions of
> glibc on 32-bit architectures that use a 64-bit time_t.
> 
> This interface is problematic for y2038, when time_t overflows on 32-bit
> architectures, but the plan so far is that a libc with 64-bit time_t
> will not call into the gettimeofday() vdso helper at all, and only
> have a method for entering clock_gettime().  This means we don't have
> to fix it here, though we probably want to add a new clock_gettime()
> entry point using a 64-bit version of 'struct timespec' at some point.
> 
> Changing the vdso code to use __kernel_old_timeval helps isolate
> this usage from the other ones that still need to be fixed properly,
> and it gets us closer to removing the 'timeval' definition from the
> kernel sources.
> 
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Thomas Gleixner 


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-13 Thread Ram Pai
On Thu, Nov 14, 2019 at 08:18:24AM +1100, Paul Mackerras wrote:
> On Tue, Nov 12, 2019 at 10:32:33PM -0800, Ram Pai wrote:
> > On Wed, Nov 13, 2019 at 11:14:27AM +1100, Paul Mackerras wrote:
> > > On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> > > > On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > > > > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > > > > There is subtle problem removing that code from the assembly.
> > > > > > 
> > > > > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without 
> > > > > > clearing
> > > > > > kvm->arch.secure_guest, the hypervisor will continue to think that 
> > > > > > the
> > > > > > VM is a secure VM.   However the primary reason the H_SVM_INIT_ABORT
> > > > > > hcall was invoked, was to inform the Hypervisor that it should no 
> > > > > > longer
> > > > > > consider the VM as a Secure VM. So there is a inconsistency there.
> > > > > 
> > > > > Most of the checks that look at whether a VM is a secure VM use code
> > > > > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > > > > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that will
> > > > > take the false branch once we have set kvm->arch.secure_guest to
> > > > > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
> > > > > most places we will treat the VM as a normal VM from then on.  If
> > > > > there are any places where we still need to treat the VM as a secure
> > > > > VM while we are processing the abort we can easily do that too.
> > > > 
> > > > Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return back
> > > > to the Ultravisor?   Because removing that assembly code will NOT lead 
> > > > the
> > > 
> > > No.  The suggestion is that vcpu->arch.secure_guest stays set to
> > > KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.
> > 
> > In the fast_guest_return path, if it finds 
> > (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_ABORT) is true, should it 
> > return to
> > UV or not?
> > 
> > Ideally it should return back to the ultravisor the first time
> > KVMPPC_SECURE_INIT_ABORT is set, and not than onwards.
> 
> What is ideal about that behavior?  Why would that be a particularly
> good thing to do?

It is following the rule -- "return back to the caller".

RP

-- 
Ram Pai



Re: [PATCH 00/23] y2038 cleanups

2019-11-13 Thread Arnd Bergmann
On Fri, Nov 8, 2019 at 10:04 PM Arnd Bergmann  wrote:
>
> This is a series of cleanups for the y2038 work, mostly intended
> for namespace cleaning: the kernel defines the traditional
> time_t, timeval and timespec types that often lead to y2038-unsafe
> code. Even though the unsafe usage is mostly gone from the kernel,
> having the types and associated functions around means that we
> can still grow new users, and that we may be missing conversions
> to safe types that actually matter.
>
> As there is no rush on any of these patches, I would either
> queue them up in linux-next through my y2038 branch, or
> Thomas could add them to the tip tree if he wants.
>
> As mentioned in another series, this is part of a larger
> effort to fix all the remaining bits and pieces that are
> not completed yet from the y2038 conversion, and the full
> set can be found at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
>
> Maintainers, please review and provide Acks.
>
> Let me know if you have any opinion on whether we should do
> the include last two patches of this series or not.
>
>  Arnd
>
> Arnd Bergmann (23):
>   y2038: remove CONFIG_64BIT_TIME
>   y2038: add __kernel_old_timespec and __kernel_old_time_t
>   y2038: vdso: change timeval to __kernel_old_timeval
>   y2038: vdso: change timespec to __kernel_old_timespec
>   y2038: vdso: change time_t to __kernel_old_time_t
>   y2038: vdso: nds32: open-code timespec_add_ns()
>   y2038: vdso: powerpc: avoid timespec references
>   y2038: ipc: remove __kernel_time_t reference from headers
>   y2038: stat: avoid 'time_t' in 'struct stat'
>   y2038: uapi: change __kernel_time_t to __kernel_old_time_t
>   y2038: rusage: use __kernel_old_timeval
>   y2038: syscalls: change remaining timeval to __kernel_old_timeval
>   y2038: socket: remove timespec reference in timestamping
>   y2038: make ns_to_compat_timeval use __kernel_old_timeval
>   y2038: elfcore: Use __kernel_old_timeval for process times
>   y2038: timerfd: Use timespec64 internally
>   y2038: time: avoid timespec usage in settimeofday()
>   y2038: itimer: compat handling to itimer.c
>   y2038: use compat_{get,set}_itimer on alpha
>   y2038: move itimer reset into itimer.c
>   y2038: itimer: change implementation to timespec64
>   [RFC] y2038: itimer: use ktime_t internally
>   y2038: allow disabling time32 system calls

I've dropped the "[RFC] y2038: itimer: use ktime_t internally" patch
for the moment,
and added two other patches from other series:

y2038: remove CONFIG_64BIT_TIME
y2038: socket: use __kernel_old_timespec instead of timespec

Tentatively pushed out the patches with the Acks I have received so
far to my y2038 branch on git.kernel.org so it gets included in linux-next.

If I hear no complaints, I'll send a pull request for the merge window,
along with the compat-ioctl series I have already queued up in the same
branch.

Arnd


Re: [PATCH V3 2/2] ASoC: fsl_asrc: Add support for imx8qm

2019-11-13 Thread Nicolin Chen
On Mon, Nov 11, 2019 at 05:18:23PM +0800, Shengjiu Wang wrote:
> There are two asrc module in imx8qm, each module has different
> clock configuration, and the DMA type is EDMA.
> 
> So in this patch, we define the new clocks, refine the clock map,
> and include struct fsl_asrc_soc_data for different soc usage.
> 
> The EDMA channel is fixed with each dma request, one dma request
> corresponding to one dma channel. So we need to request dma
> channel with dma request of asrc module.
> 
> Signed-off-by: Shengjiu Wang 

Two small comments inline. Once they are addressed,

Acked-by: Nicolin Chen 

> ---
> changes in v2
> - use !use_edma to wrap code in fsl_asrc_dma
> - add Acked-by: Nicolin Chen
> 
> changes in v3
> - remove the acked-by for commit is updated
> - read "fsl,asrc-clk-map" property, and update table
>   clk_map_imx8qm.
> 
>  sound/soc/fsl/fsl_asrc.c | 112 ---
>  sound/soc/fsl/fsl_asrc.h |  64 +++-
>  sound/soc/fsl/fsl_asrc_dma.c |  42 +
>  3 files changed, 183 insertions(+), 35 deletions(-)
> 
> diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
> index a3cfceea7d2f..03de33de8633 100644
> --- a/sound/soc/fsl/fsl_asrc.c
> +++ b/sound/soc/fsl/fsl_asrc.c
> @@ -964,14 +1001,33 @@ static int fsl_asrc_probe(struct platform_device *pdev)

> + } else if (of_device_is_compatible(np, "fsl,imx8qm-asrc")) {

> + ret = of_property_read_u32(np, "fsl,asrc-clk-map",
> +&map_idx);

This seems to fit a single line?

> diff --git a/sound/soc/fsl/fsl_asrc_dma.c b/sound/soc/fsl/fsl_asrc_dma.c
> index d6146de9acd2..f871fdb9d1c6 100644
> --- a/sound/soc/fsl/fsl_asrc_dma.c
> +++ b/sound/soc/fsl/fsl_asrc_dma.c
> @@ -197,21 +197,37 @@ static int fsl_asrc_dma_hw_params(struct 
> snd_soc_component *component,

> + /*
> +  * For EDMA DEV_TO_DEV channel, we don't need to configure
> +  * dma_request and dma_request2, we can get dma channel through
> +  * dma_request_slave_channel directly.
> +  * Compare with SDMA channel, EDMA channel is bound with dma
> +  * request event of each peripheral, and it is fixed. Not like SDMA,
> +  * the channel is allocated dynamically. So when DMA is EDMA, we
> +  * can only get EDMA channel through dma-names of Front-End device.
> +  */

Just trying to make it concise :)

+   /*
+* An EDMA DEV_TO_DEV channel is fixed and bound with DMA event of each
+* peripheral, unlike SDMA channel that is allocated dynamically. So no
+* need to configure dma_request and dma_request2, but get dma_chan via
+* dma_request_slave_channel directly with dma name of Front-End device
+*/


Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-13 Thread Robin Murphy

On 2019-11-13 8:41 pm, Florian Fainelli wrote:

On 11/13/19 12:34 PM, Robin Murphy wrote:

On 13/11/2019 4:13 pm, Nicolas Saenz Julienne wrote:

Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit is
meant to contain the higher accesible DMA address.


Neat, you win a "why didn't I do it that way in the first place?" :)

Looking at it without all the history of previous attempts, this looks
entirely reasonable, and definitely a step in the right direction.


And while you are changing those, would it make sense to not only rename
the structure member but introduce a getter and setter in order to ease
future work where this would no longer be a scalar?


I doubt it - once we get as a far as supporting multiple DMA ranges, 
there will be a whole load of infrastructure churn anyway if only to 
replace dma_pfn_offset, and I'm not sure a simple get/set paradigm would 
even be viable, so it's probably better to save that until clearly 
necessary.


Robin.


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-13 Thread Paul Mackerras
On Tue, Nov 12, 2019 at 10:32:33PM -0800, Ram Pai wrote:
> On Wed, Nov 13, 2019 at 11:14:27AM +1100, Paul Mackerras wrote:
> > On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> > > On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > > > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > > > There is subtle problem removing that code from the assembly.
> > > > > 
> > > > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without 
> > > > > clearing
> > > > > kvm->arch.secure_guest, the hypervisor will continue to think that the
> > > > > VM is a secure VM.   However the primary reason the H_SVM_INIT_ABORT
> > > > > hcall was invoked, was to inform the Hypervisor that it should no 
> > > > > longer
> > > > > consider the VM as a Secure VM. So there is a inconsistency there.
> > > > 
> > > > Most of the checks that look at whether a VM is a secure VM use code
> > > > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > > > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that will
> > > > take the false branch once we have set kvm->arch.secure_guest to
> > > > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
> > > > most places we will treat the VM as a normal VM from then on.  If
> > > > there are any places where we still need to treat the VM as a secure
> > > > VM while we are processing the abort we can easily do that too.
> > > 
> > > Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return back
> > > to the Ultravisor?   Because removing that assembly code will NOT lead the
> > 
> > No.  The suggestion is that vcpu->arch.secure_guest stays set to
> > KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.
> 
> In the fast_guest_return path, if it finds 
> (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_ABORT) is true, should it return 
> to
> UV or not?
> 
> Ideally it should return back to the ultravisor the first time
> KVMPPC_SECURE_INIT_ABORT is set, and not than onwards.

What is ideal about that behavior?  Why would that be a particularly
good thing to do?

Paul.


Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-13 Thread Florian Fainelli
On 11/13/19 12:34 PM, Robin Murphy wrote:
> On 13/11/2019 4:13 pm, Nicolas Saenz Julienne wrote:
>> Using a mask to represent bus DMA constraints has a set of limitations.
>> The biggest one being it can only hold a power of two (minus one). The
>> DMA mapping code is already aware of this and treats dev->bus_dma_mask
>> as a limit. This quirk is already used by some architectures although
>> still rare.
>>
>> With the introduction of the Raspberry Pi 4 we've found a new contender
>> for the use of bus DMA limits, as its PCIe bus can only address the
>> lower 3GB of memory (of a total of 4GB). This is impossible to represent
>> with a mask. To make things worse the device-tree code rounds non power
>> of two bus DMA limits to the next power of two, which is unacceptable in
>> this case.
>>
>> In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
>> over the tree and treat it as such. Note that dev->bus_dma_limit is
>> meant to contain the higher accesible DMA address.
> 
> Neat, you win a "why didn't I do it that way in the first place?" :)
> 
> Looking at it without all the history of previous attempts, this looks
> entirely reasonable, and definitely a step in the right direction.

And while you are changing those, would it make sense to not only rename
the structure member but introduce a getter and setter in order to ease
future work where this would no longer be a scalar?
-- 
Florian


Re: [PATCH] powerpc/powernv: Disable native PCIe port management

2019-11-13 Thread Tyrel Datwyler
Nothing but pedantic spelling and grammar nits of the commit log follow.

-Tyrel

On 11/13/19 1:40 AM, Oliver O'Halloran wrote:
> On PowerNV the PCIe topology is (currently) managed the powernv platform

s/the powernv/by the powernv

> code in cooperation with firmware. The PCIe-native service drivers bypass
> both and this can cause problems.
> 
> Historically this hasn't been a big deal since the only port service
> driver that saw much use was the AER driver. The AER driver relies
> a kernel service to report when errors occur rather than acting autonmously

s/a kernel/on a kernel
autonomously

> so it's fairly easy to ignore. On PowerNV (and pseries) AER events are
> handled through EEH, which ignores the AER service, so it's never been
> an issue.
> 
> Unfortunately, the hotplug port service driver (pciehp) does act
> autonomously and conflicts with the platform specific hotplug
> driver (pnv_php). The main issue is that pciehp claims the interrupt
> associated with the PCIe capability which in turn prevents pnv_php from
> claiming it.
> 
> This results in hotplug events being handled by pciehp which does not
> notify firmware when the PCIe topology changes, and does not setup/teardown
> the arch specific PCI device structures (pci_dn) when the topology changes.
> The end result is that hot-added devices cannot be enabled and hot-removed
> devices may not be fully torn-down on removal.
> 
> We can fix these problems by setting the "pcie_ports_disabled" flag during
> platform initialisation. The flag indicates the platform owns the PCIe
> ports which stops the portbus driver being registered.

s/being/from being

> 
> Cc: Sergey Miroshnichenko 
> Fixes: 66725152fb9f ("PCI/hotplug: PowerPC PowerNV PCI hotplug driver")
> Signed-off-by: Oliver O'Halloran 
> ---
> Sergey, just FYI. I'll try sort out the rest of the hotplug
> trainwreck in 5.6.
> 
> The Fixes: here is for the patch that added pnv_php in 4.8. It's been
> a problem since then, but wasn't noticed until people started testing
> it after the EEH fixes in commit 799abe283e51 ("powerpc/eeh: Clean up
> EEH PEs after recovery finishes") went in earlier in the 5.4 cycle.
> ---
>  arch/powerpc/platforms/powernv/pci.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index 2825d00..ae62583 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -941,6 +941,23 @@ void __init pnv_pci_init(void)
> 
>   pci_add_flags(PCI_CAN_SKIP_ISA_ALIGN);
> 
> +#ifdef CONFIG_PCIEPORTBUS
> + /*
> +  * On PowerNV PCIe devices are (currently) managed in cooperation
> +  * with firmware. This isn't *strictly* required, but there's enough
> +  * assumptions baked into both firmware and the platform code that
> +  * it's unwise to allow the portbus services to be used.
> +  *
> +  * We need to fix this eventually, but for now set this flag to disable
> +  * the portbus driver. The AER service isn't required since that AER
> +  * events are handled via EEH. The pciehp hotplug driver can't work
> +  * without kernel changes (and portbus binding breaks pnv_php). The
> +  * other services also require some thinking about how we're going
> +  * to integrate them.
> +  */
> + pcie_ports_disabled = true;
> +#endif
> +
>   /* If we don't have OPAL, eg. in sim, just skip PCI probe */
>   if (!firmware_has_feature(FW_FEATURE_OPAL))
>   return;
> 



Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-13 Thread Robin Murphy

On 13/11/2019 4:13 pm, Nicolas Saenz Julienne wrote:

Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit is
meant to contain the higher accesible DMA address.


Neat, you win a "why didn't I do it that way in the first place?" :)

Looking at it without all the history of previous attempts, this looks 
entirely reasonable, and definitely a step in the right direction.


[...]

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 5a7551d060f2..f18827cf96df 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1097,7 +1097,7 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, 
u64 *dma_size)
 * Limit coherent and dma mask based on size
 * retrieved from firmware.
 */
-   dev->bus_dma_mask = mask;
+   dev->bus_dma_limit = mask;


Although this preserves the existing behaviour, as in of_dma_configure() 
we can do better here since we have the original address range to hand. 
I think it's worth keeping the ACPI and OF paths in sync for minor 
tweaks like this, rather than letting them diverge unnecessarily.


Otherwise, the rest looks OK to me - in principle we could store it as 
an exclusive limit such that we could then streamline the min_not_zero() 
tests to just min(mask, limit - 1), but that's probably too clever for 
its own good.


Robin.


dev->coherent_dma_mask = mask;
*dev->dma_mask = mask;
}


Re: [PATCH v3 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-13 Thread John Hubbard

On 11/13/19 3:43 AM, Daniel Vetter wrote:
...

Can't we call this unpin_user_page then, for some symmetry? Or is that
even more churn?

Looking from afar the naming here seems really confusing.



That look from afar is valuable, because I'm too close to the problem to see
how the naming looks. :)

unpin_user_page() sounds symmetrical. It's true that it would cause more
churn (which is why I started off with a proposal that avoids changing the
names of put_user_page*() APIs). But OTOH, the amount of churn is proportional
to the change in direction here, and it's really only 10 or 20 lines changed,
in the end.

So I'm open to changing to that naming. It would be nice to hear what others
prefer, too...


FWIW I'd find unpin_user_page() also better than put_user_page() as a
counterpart to pin_user_pages().


One more point from afar on pin/unpin: We use that a lot in graphics for
permanently pinned graphics buffer objects. Which really only should be
used for scanout. So at least graphics folks should have an appropriate
mindset and try to make sure we don't overuse this stuff.
-Daniel



OK, Ira also likes "unpin", and so far no one has said *anything* in favor
of the "put_user_page" names, so I think we have a winner! I'll change the
names to unpin_user_page*().


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-13 Thread John Hubbard

On 11/13/19 10:59 AM, Ira Weiny wrote:

On Tue, Nov 12, 2019 at 08:26:56PM -0800, John Hubbard wrote:

Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

These variants all set FOLL_PIN, which is also introduced, and
thoroughly documented.

The pin_longterm*() variants also set FOLL_LONGTERM, in addition
to FOLL_PIN:

 pin_user_pages()
 pin_user_pages_remote()
 pin_user_pages_fast()

 pin_longterm_pages()
 pin_longterm_pages_remote()
 pin_longterm_pages_fast()


At some point in this conversation I thought we were going to put in "unpin_*"
versions of these.

Is that still in the plans?



Why yes it is! :)  Daniel Vetter and Jan Kara both already weighed in [1],
in favor of "unpin_user_page*()", rather than "put_user_page*()".

I'll change those names.

[1] https://lore.kernel.org/r/20191113101210.gd6...@quack2.suse.cz


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v4 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-13 Thread John Hubbard

On 11/13/19 11:17 AM, Ira Weiny wrote:
...

@@ -348,33 +347,13 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
  
  	down_read(&mm->mmap_sem);

-   if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
-   } else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
-   /*
-* The lifetime of a vaddr_get_pfn() page pin is
-* userspace-controlled. In the fs-dax case this could
-* lead to indefinite stalls in filesystem operations.
-* Disallow attempts to pin fs-dax pages via this
-* interface.
-*/
-   if (ret > 0 && vma_is_fsdax(vmas[0])) {
-   ret = -EOPNOTSUPP;
-   put_page(page[0]);
-   }
-   }
-   up_read(&mm->mmap_sem);
-
+   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
return 0;


Mind the return with the lock held this needs some goto unwind


Ah yea...  retract my reviewed by...  :-(



ooops, embarrassed that I missed that, good catch. Will repost with it fixed.



thanks,
--
John Hubbard
NVIDIA



Re: [PATCH 00/50] Add log level to show_stack()

2019-11-13 Thread Dmitry Safonov
Hi Sergey,

On 11/13/19 6:33 AM, Sergey Senozhatsky wrote:
[..]
> Well, here we go. There is a number of generally useful functions that
> print nice data and where people might want to have better loglevel control
> (for debugging purposes). show_stack() is just one of them. Patching all
> those functions, which you have mentioned above, is hardly a fun task to do.
> Hence the printk() per-CPU per-context loglevel proposition. The code there
> is not clever or complicated and is meant for debugging purposes only, but
> with just 3 lines of code one can do some stuff:
> 
>   /* @BTW you can't do this with "%s" KERN_FOO ;P */
> + printk_emergency_enter(LOGLEVEL_SCHED);
> + debug_show_all_locks();
> + printk_emergency_exit();

Not that I want to critic your proposal more, but just to clarify what
I've meant by "cleaver and complicated":

I don't think semantically there is any difference between:

printk_emergency_enter(LOGLEVEL_SCHED);
printk();
printk_emergency_exit();
vs
printk("%s ...", KERN_SHED, ...);

I was speaking about complexity not about usage, but about realization
inside printk_emergency_enter(): there will be some business logic that
will do "it's NMI? Use loglevel given. it's KERN_ALERT? Don't downgrade
the loglevel..." and other smart details those are really logic which
one have to think about and later maintain.

There will be also minor issues like people inserting print with one log
level and seeing it in dmesg with another, but I presume those
printk_enter() and printk_exit() will cover little amount of code
without much nesting [as long as it's not getting overused]. And also it
can be documented and people will learn about another feature of printk().

And this year I've seen the presentation "Why printk() is so
complicated?" and I presumed that the approach is to keep things as
simple as possible.

In conclusion:
I see that your proposal also solves the problem [except preemption and
some pr_debug() that shouldn't be printed]. And I think your approach is
better in the sense of short-term (we won't have any missed %s KERN_ in
linux-next), but in a long-term it adds some amount of business logic to
printk and another feature.

Just in case: again, I don't mind, it's up to you, maintainers of
printk. It's also not very fun for me to create those patches.
But they fix console_loglevel issues (I hope we could un-export it in
the end) and also I need it for my other patches those will produce
warnings with debug loglevel when configured through sysctl.

Thanks,
  Dmitry


Re: [PATCH 00/50] Add log level to show_stack()

2019-11-13 Thread Steven Rostedt
On Wed, 13 Nov 2019 09:47:22 +0100
Petr Mladek  wrote:

> At the moment, I am in favor of this patchset. It is huge and
> needed a lot of manual work. But the result is straightforward and
> easy to understand.

I'm in favor of this patchset too. If there's other areas that need to
adjust the current loglevel (say per task context), then we can cross
that bridge when the need arises. But I don't want to over engineer
this as the stack trace logic should have a way to explicitly state how
important this stack trace really is (or better yet, we should be
removing stack traces that are not important!)

-- Steve


Re: [PATCH 00/50] Add log level to show_stack()

2019-11-13 Thread Steven Rostedt
On Wed, 6 Nov 2019 23:25:13 +
Russell King - ARM Linux admin  wrote:

> On Wed, Nov 06, 2019 at 09:34:40PM +0100, Peter Zijlstra wrote:
> > I suppose I'm surprised there are backtraces that are not important.
> > Either badness happened and it needs printing, or the user asked for it
> > and it needs printing.  
> 
> Or utterly meaningless.
> 
> > Perhaps we should be removing backtraces if they're not important
> > instead of allowing to print them as lower loglevels?  
> 
> Definitely!  WARN_ON() is well overused - and as is typical, used
> without much thought.  Bound to happen after Linus got shirty about
> BUG_ON() being over used.  Everyone just grabbed the next nearest thing
> to assert().
> 
> As a kind of example, I've recently come across one WARN_ON() in a
> driver subsystem (that shall remain nameless at the moment) which very
> likely has multiple different devices on a platform.  The WARN_ON()
> triggers as a result of a problem with the hardware, but because it's a
> WARN_ON(), you've no idea which device has a problem.  The backtrace is
> mostly meaningless.  So you know that a problem has occurred, but the
> kernel prints *useless* backtrace to let you know, and totally omits
> the *useful* information.
> 

I would like to bring up a topic for the next maintainers summit
(although I may not even be there), that we define a clear use of
WARN_ON(). I use it only if the code does something I do not expect it
to do, and is considered a bug in the code if it triggers. But it
appears that some drivers use it for "oh I didn't realize this hardware
does something I didn't expect". And is ignored when the warn on is
triggered and reported, with "you have buggy hardware" but my hardware
appears to work just fine!

-- Steve


Re: [PATCH 00/50] Add log level to show_stack()

2019-11-13 Thread Steven Rostedt
On Wed, 6 Nov 2019 21:34:40 +0100
Peter Zijlstra  wrote:

> I suppose I'm surprised there are backtraces that are not important.
> Either badness happened and it needs printing, or the user asked for it
> and it needs printing.

Unfortunately that is the case. As my tests will fail if a backtrace is
detected.

> 
> Perhaps we should be removing backtraces if they're not important
> instead of allowing to print them as lower loglevels?

I usually end up removing backtraces for my tests, so I'm for this.
Specifically this happens in the drm and i915 drivers :-p

-- Steve



Re: [PATCH 00/50] Add log level to show_stack()

2019-11-13 Thread Steven Rostedt
On Tue, 12 Nov 2019 11:17:47 +0900
Sergey Senozhatsky  wrote:

> void show_stack(struct task_struct *task, unsigned long *sp, int log_level)
> {
>   printk_emergency_enter(log_level);
>   __show_stack(task, sp);
>   printk_emergency_exit();
> }
> // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - //
> 
> show_stack() never schedules, disabling preemption around it should
> not change anything. Should it be interrupted, we will handle it via
> preempt count.

Please no! The whole point of the printk rewrite was to allow for
printk to be preemptible and used in more contexts. The show_stack() can
be all over the place and is not a fast function. Let's not disable
preemption for it.

-- Steve


Re: [PATCH 00/50] Add log level to show_stack()

2019-11-13 Thread Steven Rostedt
On Tue, 12 Nov 2019 13:44:47 +0900
Sergey Senozhatsky  wrote:

> > > I do recall that we talked about per-CPU printk state bit which would
> > > start/end "just print it" section. We probably can extend it to "just
> > > log_store" type of functionality. Doesn't look like a very bad idea.  
> > 
> > The problem with per-CPU printk is that we would need to disable
> > interrupts.  
> 
> Or disable preemption and have loglevel per-CPU and per-context.
> preempt_count can navigate us to the right context loglevel on
> particular CPU. I'm talking here only about backtrace (error)
> reporting contexts. Those can be atomic perfectly fine.

With my real-time hat on, I'm totally against disabling of preemption
for this purpose.

-- Steve


Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Dan Williams
On Tue, Nov 12, 2019 at 8:27 PM John Hubbard  wrote:
>
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
>
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
>
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
>
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.
>
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
>
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 
> ---
>  include/linux/mm.h | 27 ---
>  mm/memremap.c  | 67 --
>  2 files changed, 53 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
> if (!static_branch_unlikely(&devmap_managed_key))
> return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
> switch (page->pgmap->type) {
> case MEMORY_DEVICE_PRIVATE:
> case MEMORY_DEVICE_FS_DAX:
> -   __put_devmap_managed_page(page);
> return true;
> default:
> break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
> return false;
>  }
>
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> +   bool is_devmap = page_is_devmap_managed(page);
> +
> +   if (is_devmap) {
> +   int count = page_ref_dec_return(page);
> +
> +   /*
> +* devmap page refcounts are 1-based, rather than 0-based: if
> +* refcount is 1, then the page is free and the refcount is
> +* stable because nobody holds a reference on the page.
> +*/
> +   if (count == 1)
> +   free_devmap_managed_page(page);
> +   else if (!count)
> +   __put_page(page);
> +   }
> +
> +   return is_devmap;
> +}
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> -   int count = page_ref_dec_return(page);
> +   /* Clear Active bit in case of parallel mark_page_accessed */
> +   __ClearPageActive(page);
> +   __ClearPageWaiters(page);
> +
> +   mem_cgroup_uncharge(page);

Ugh, when did all this HMM specific manipulation sneak into the
generic ZONE_DEVICE path? It used to be gated by pgmap type with its
own put_zone_device_private_page(). For example it's certainly
unnecessary and might be broken (would need to check) to call
mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
monolith and the HMM use case leaks pages into code paths that DAX
explicitly avoids.


Re: [PATCH v4 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-13 Thread Ira Weiny
On Wed, Nov 13, 2019 at 09:02:02AM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 12, 2019 at 08:26:55PM -0800, John Hubbard wrote:
> > As it says in the updated comment in gup.c: current FOLL_LONGTERM
> > behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> > FS DAX check requirement on vmas.
> > 
> > However, the corresponding restriction in get_user_pages_remote() was
> > slightly stricter than is actually required: it forbade all
> > FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> > that do not set the "locked" arg.
> > 
> > Update the code and comments accordingly, and update the VFIO caller
> > to take advantage of this, fixing a bug as a result: the VFIO caller
> > is logically a FOLL_LONGTERM user.
> > 
> > Also, remove an unnessary pair of calls that were releasing and
> > reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
> > just in order to call page_to_pfn().
> > 
> > Also, move the DAX check ("if a VMA is DAX, don't allow long term
> > pinning") from the VFIO call site, all the way into the internals
> > of get_user_pages_remote() and __gup_longterm_locked(). That is:
> > get_user_pages_remote() calls __gup_longterm_locked(), which in turn
> > calls check_dax_vmas(). It's lightly explained in the comments as well.
> > 
> > Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
> > and to Dan Williams for helping clarify the DAX refactoring.
> > 
> > Suggested-by: Jason Gunthorpe 
> > Cc: Dan Williams 
> > Cc: Jerome Glisse 
> > Cc: Ira Weiny 
> > Signed-off-by: John Hubbard 
> >  drivers/vfio/vfio_iommu_type1.c | 25 ++---
> >  mm/gup.c| 27 ++-
> >  2 files changed, 24 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index d864277ea16f..7301b710c9a4 100644
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -340,7 +340,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> > long vaddr,
> >  {
> > struct page *page[1];
> > struct vm_area_struct *vma;
> > -   struct vm_area_struct *vmas[1];
> > unsigned int flags = 0;
> > int ret;
> >  
> > @@ -348,33 +347,13 @@ static int vaddr_get_pfn(struct mm_struct *mm, 
> > unsigned long vaddr,
> > flags |= FOLL_WRITE;
> >  
> > down_read(&mm->mmap_sem);
> > -   if (mm == current->mm) {
> > -   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> > -vmas);
> > -   } else {
> > -   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> > -   vmas, NULL);
> > -   /*
> > -* The lifetime of a vaddr_get_pfn() page pin is
> > -* userspace-controlled. In the fs-dax case this could
> > -* lead to indefinite stalls in filesystem operations.
> > -* Disallow attempts to pin fs-dax pages via this
> > -* interface.
> > -*/
> > -   if (ret > 0 && vma_is_fsdax(vmas[0])) {
> > -   ret = -EOPNOTSUPP;
> > -   put_page(page[0]);
> > -   }
> > -   }
> > -   up_read(&mm->mmap_sem);
> > -
> > +   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
> > +   page, NULL, NULL);
> > if (ret == 1) {
> > *pfn = page_to_pfn(page[0]);
> > return 0;
> 
> Mind the return with the lock held this needs some goto unwind

Ah yea...  retract my reviewed by...  :-(

Ira



Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Jerome Glisse
On Tue, Nov 12, 2019 at 08:26:51PM -0800, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
> 
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
> 
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
> 
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.
> 
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
> 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm.h | 27 ---
>  mm/memremap.c  | 67 --
>  2 files changed, 53 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>   if (!static_branch_unlikely(&devmap_managed_key))
>   return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
>   case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
>   return true;
>   default:
>   break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> +  * devmap page refcounts are 1-based, rather than 0-based: if
> +  * refcount is 1, then the page is free and the refcount is
> +  * stable because nobody holds a reference on the page.
> +  */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> - int count = page_ref_dec_return(page);
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
> +
> + mem_cgroup_uncharge(page);
>  
>   /*
> -  * If refcount is 1 then page is freed and refcount is stable as nobody
> -  * holds a reference on the page.
> +  * When a device_private page is freed, the page->mapping field
> +  * may still contain a (stale) mapping value. For example, the
> +  * lower bits of page->mapping may still identify the page as
> +  * an anonymous page. Ultimately, this entire field is just
> +  * stale and wrong, and it will cause errors if not cleared.
> +  * One example is:
> +  *
> +  *  migrate_vma_pages()
> +  *migrate_vma_insert_page()
> +  *  page_add_new_anon_rmap()
> +  *__page_set_anon_rmap()
> +  *  ...checks page->mapping, via PageAnon(page) call,
> +  *and incorrectly concludes that the page is an
> +  *anonymous page. Therefore, it incorrectly,
> +  *silently fails to set up the new anon rmap.
> +  *
> +  * For other types of ZONE_DEVICE pages, migration is either
> +  * handled differently or not done at all, so there is no need
> +  * to clear page->mapping.
>*/
> - if (count == 1) {
> - /* Clear Active bit in case of parallel mark_page_accessed */
> -  

Re: [PATCH v4 23/23] mm/gup: remove support for gup(FOLL_LONGTERM)

2019-11-13 Thread Ira Weiny
On Tue, Nov 12, 2019 at 08:27:10PM -0800, John Hubbard wrote:
> Now that all other kernel callers of get_user_pages(FOLL_LONGTERM)
> have been converted to pin_longterm_pages(), lock it down:
> 
> 1) Add an assertion to get_user_pages(), preventing callers from
>passing FOLL_LONGTERM (in addition to the existing assertion that
>prevents FOLL_PIN).
> 
> 2) Remove the associated GUP_LONGTERM_BENCHMARK test.
> 
> Signed-off-by: John Hubbard 
> ---
>  mm/gup.c   | 8 
>  mm/gup_benchmark.c | 9 +
>  tools/testing/selftests/vm/gup_benchmark.c | 7 ++-
>  3 files changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 82e7e4ce5027..90f5f95ee7ac 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1756,11 +1756,11 @@ long get_user_pages(unsigned long start, unsigned 
> long nr_pages,
>   struct vm_area_struct **vmas)
>  {
>   /*
> -  * FOLL_PIN must only be set internally by the pin_user_page*() and
> -  * pin_longterm_*() APIs, never directly by the caller, so enforce that
> -  * with an assertion:
> +  * FOLL_PIN and FOLL_LONGTERM must only be set internally by the
> +  * pin_user_page*() and pin_longterm_*() APIs, never directly by the
> +  * caller, so enforce that with an assertion:
>*/
> - if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> + if (WARN_ON_ONCE(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))

Don't we want to block FOLL_LONGTERM in get_user_pages_fast() as well after all
this?

Ira

>   return -EINVAL;
>  
>   return __gup_longterm_locked(current, current->mm, start, nr_pages,
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 8f980d91dbf5..679f0e6a0adb 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -6,7 +6,7 @@
>  #include 
>  
>  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> -#define GUP_LONGTERM_BENCHMARK   _IOWR('g', 2, struct gup_benchmark)
> +/* Command 2 has been deleted. */
>  #define GUP_BENCHMARK_IOWR('g', 3, struct gup_benchmark)
>  #define PIN_FAST_BENCHMARK   _IOWR('g', 4, struct gup_benchmark)
>  #define PIN_LONGTERM_BENCHMARK   _IOWR('g', 5, struct gup_benchmark)
> @@ -28,7 +28,6 @@ static void put_back_pages(int cmd, struct page **pages, 
> unsigned long nr_pages)
>  
>   switch (cmd) {
>   case GUP_FAST_BENCHMARK:
> - case GUP_LONGTERM_BENCHMARK:
>   case GUP_BENCHMARK:
>   for (i = 0; i < nr_pages; i++)
>   put_page(pages[i]);
> @@ -97,11 +96,6 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>   nr = get_user_pages_fast(addr, nr, gup->flags,
>pages + i);
>   break;
> - case GUP_LONGTERM_BENCHMARK:
> - nr = get_user_pages(addr, nr,
> - gup->flags | FOLL_LONGTERM,
> - pages + i, NULL);
> - break;
>   case GUP_BENCHMARK:
>   nr = get_user_pages(addr, nr, gup->flags, pages + i,
>   NULL);
> @@ -159,7 +153,6 @@ static long gup_benchmark_ioctl(struct file *filep, 
> unsigned int cmd,
>  
>   switch (cmd) {
>   case GUP_FAST_BENCHMARK:
> - case GUP_LONGTERM_BENCHMARK:
>   case GUP_BENCHMARK:
>   case PIN_FAST_BENCHMARK:
>   case PIN_LONGTERM_BENCHMARK:
> diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
> b/tools/testing/selftests/vm/gup_benchmark.c
> index 03928e47a86f..836b7082db13 100644
> --- a/tools/testing/selftests/vm/gup_benchmark.c
> +++ b/tools/testing/selftests/vm/gup_benchmark.c
> @@ -15,7 +15,7 @@
>  #define PAGE_SIZE sysconf(_SC_PAGESIZE)
>  
>  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> -#define GUP_LONGTERM_BENCHMARK   _IOWR('g', 2, struct gup_benchmark)
> +/* Command 2 has been deleted. */
>  #define GUP_BENCHMARK_IOWR('g', 3, struct gup_benchmark)
>  
>  /*
> @@ -49,7 +49,7 @@ int main(int argc, char **argv)
>   char *file = "/dev/zero";
>   char *p;
>  
> - while ((opt = getopt(argc, argv, "m:r:n:f:abctTLUuwSH")) != -1) {
> + while ((opt = getopt(argc, argv, "m:r:n:f:abctTUuwSH")) != -1) {
>   switch (opt) {
>   case 'a':
>   cmd = PIN_FAST_BENCHMARK;
> @@ -75,9 +75,6 @@ int main(int argc, char **argv)
>   case 'T':
>   thp = 0;
>   break;
> - case 'L':
> - cmd = GUP_LONGTERM_BENCHMARK;
> - break;
>   case 'U':
>   cmd = GUP_BENCHMARK;
>   break;
> -- 
> 2.24.0
> 


Re: [PATCH v4 21/23] mm/gup_benchmark: support pin_user_pages() and related calls

2019-11-13 Thread Ira Weiny
On Tue, Nov 12, 2019 at 08:27:08PM -0800, John Hubbard wrote:
> Up until now, gup_benchmark supported testing of the
> following kernel functions:
> 
> * get_user_pages(): via the '-U' command line option
> * get_user_pages_longterm(): via the '-L' command line option
> * get_user_pages_fast(): as the default (no options required)
> 
> Add test coverage for the new corresponding pin_*() functions:
> 
> * pin_user_pages(): via the '-c' command line option
> * pin_longterm_pages(): via the '-b' command line option
> * pin_user_pages_fast(): via the '-a' command line option
> 
> Also, add an option for clarity: '-u' for what is now (still) the
> default choice: get_user_pages_fast().
> 
> Also, for the three commands that set FOLL_PIN, verify that the pages
> really are dma-pinned, via the new is_dma_pinned() routine.
> Those commands are:
> 
> PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
> PIN_LONGTERM_BENCHMARK : calls pin_longterm_pages()
> PIN_BENCHMARK  : calls pin_user_pages()
> 
> In between the calls to pin_*() and put_user_pages(),
> check each page: if page_dma_pinned() returns false, then
> WARN and return.
> 
> Do this outside of the benchmark timestamps, so that it doesn't
> affect reported times.
> 
> Signed-off-by: John Hubbard 

Reviewed-by: Ira Weiny 

> ---
>  mm/gup_benchmark.c | 73 --
>  tools/testing/selftests/vm/gup_benchmark.c | 23 ++-
>  2 files changed, 90 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7fc44d25eca7..8f980d91dbf5 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -8,6 +8,9 @@
>  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
>  #define GUP_LONGTERM_BENCHMARK   _IOWR('g', 2, struct gup_benchmark)
>  #define GUP_BENCHMARK_IOWR('g', 3, struct gup_benchmark)
> +#define PIN_FAST_BENCHMARK   _IOWR('g', 4, struct gup_benchmark)
> +#define PIN_LONGTERM_BENCHMARK   _IOWR('g', 5, struct gup_benchmark)
> +#define PIN_BENCHMARK_IOWR('g', 6, struct gup_benchmark)
>  
>  struct gup_benchmark {
>   __u64 get_delta_usec;
> @@ -19,6 +22,44 @@ struct gup_benchmark {
>   __u64 expansion[10];/* For future use */
>  };
>  
> +static void put_back_pages(int cmd, struct page **pages, unsigned long 
> nr_pages)
> +{
> + int i;
> +
> + switch (cmd) {
> + case GUP_FAST_BENCHMARK:
> + case GUP_LONGTERM_BENCHMARK:
> + case GUP_BENCHMARK:
> + for (i = 0; i < nr_pages; i++)
> + put_page(pages[i]);
> + break;
> +
> + case PIN_FAST_BENCHMARK:
> + case PIN_LONGTERM_BENCHMARK:
> + case PIN_BENCHMARK:
> + put_user_pages(pages, nr_pages);
> + break;
> + }
> +}
> +
> +static void verify_dma_pinned(int cmd, struct page **pages,
> +   unsigned long nr_pages)
> +{
> + int i;
> +
> + switch (cmd) {
> + case PIN_FAST_BENCHMARK:
> + case PIN_LONGTERM_BENCHMARK:
> + case PIN_BENCHMARK:
> + for (i = 0; i < nr_pages; i++) {
> + if (WARN(!page_dma_pinned(pages[i]),
> +  "pages[%d] is NOT dma-pinned\n", i))
> + break;
> + }
> + break;
> + }
> +}
> +
>  static int __gup_benchmark_ioctl(unsigned int cmd,
>   struct gup_benchmark *gup)
>  {
> @@ -65,6 +106,18 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>   nr = get_user_pages(addr, nr, gup->flags, pages + i,
>   NULL);
>   break;
> + case PIN_FAST_BENCHMARK:
> + nr = pin_user_pages_fast(addr, nr, gup->flags,
> +  pages + i);
> + break;
> + case PIN_LONGTERM_BENCHMARK:
> + nr = pin_longterm_pages(addr, nr, gup->flags, pages + i,
> + NULL);
> + break;
> + case PIN_BENCHMARK:
> + nr = pin_user_pages(addr, nr, gup->flags, pages + i,
> + NULL);
> + break;
>   default:
>   return -1;
>   }
> @@ -75,15 +128,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>   }
>   end_time = ktime_get();
>  
> + /* Shifting the meaning of nr_pages: now it is actual number pinned: */
> + nr_pages = i;
> +
>   gup->get_delta_usec = ktime_us_delta(end_time, start_time);
>   gup->size = addr - gup->addr;
>  
> + /*
> +  * Take an un-benchmark-timed moment to verify DMA pinned
> +  * state: print a warning if any non-dma-pinned pages are found:
> +  */
> + verify_dma_pinned(cmd, pages, nr_pages);
> +
>   start_time = ktime_get();
> - for (i = 0; i < n

Re: [PATCH v4 22/23] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

2019-11-13 Thread Ira Weiny
On Tue, Nov 12, 2019 at 08:27:09PM -0800, John Hubbard wrote:
> It's good to have basic unit test coverage of the new FOLL_PIN
> behavior. Fortunately, the gup_benchmark unit test is extremely
> fast (a few milliseconds), so adding it the the run_vmtests suite
> is going to cause no noticeable change in running time.
> 
> So, add two new invocations to run_vmtests:
> 
> 1) Run gup_benchmark with normal get_user_pages().
> 
> 2) Run gup_benchmark with pin_user_pages(). This is much like
> the first call, except that it sets FOLL_PIN.
> 
> Running these two in quick succession also provide a visual
> comparison of the running times, which is convenient.
> 
> The new invocations are fairly early in the run_vmtests script,
> because with test suites, it's usually preferable to put the
> shorter, faster tests first, all other things being equal.
> 
> Signed-off-by: John Hubbard 

Reviewed-by: Ira Weiny 

> ---
>  tools/testing/selftests/vm/run_vmtests | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/tools/testing/selftests/vm/run_vmtests 
> b/tools/testing/selftests/vm/run_vmtests
> index 951c507a27f7..93e8dc9a7cad 100755
> --- a/tools/testing/selftests/vm/run_vmtests
> +++ b/tools/testing/selftests/vm/run_vmtests
> @@ -104,6 +104,28 @@ echo "NOTE: The above hugetlb tests provide minimal 
> coverage.  Use"
>  echo "  https://github.com/libhugetlbfs/libhugetlbfs.git for"
>  echo "  hugetlb regression testing."
>  
> +echo ""
> +echo "running 'gup_benchmark -U' (normal/slow gup)"
> +echo ""
> +./gup_benchmark -U
> +if [ $? -ne 0 ]; then
> + echo "[FAIL]"
> + exitcode=1
> +else
> + echo "[PASS]"
> +fi
> +
> +echo "--"
> +echo "running gup_benchmark -c (pin_user_pages)"
> +echo "--"
> +./gup_benchmark -c
> +if [ $? -ne 0 ]; then
> + echo "[FAIL]"
> + exitcode=1
> +else
> + echo "[PASS]"
> +fi
> +
>  echo "---"
>  echo "running userfaultfd"
>  echo "---"
> -- 
> 2.24.0
> 


Re: [PATCH v4 20/23] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

2019-11-13 Thread Ira Weiny
On Tue, Nov 12, 2019 at 08:27:07PM -0800, John Hubbard wrote:
> Fix the gup benchmark flags to use the symbolic FOLL_WRITE,
> instead of a hard-coded "1" value.
> 
> Also, clean up the filtering of gup flags a little, by just doing
> it once before issuing any of the get_user_pages*() calls. This
> makes it harder to overlook, instead of having little "gup_flags & 1"
> phrases in the function calls.
> 
> Signed-off-by: John Hubbard 

Reviewed-by: Ira Weiny 

> ---
>  mm/gup_benchmark.c | 9 ++---
>  tools/testing/selftests/vm/gup_benchmark.c | 6 +-
>  2 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7dd602d7f8db..7fc44d25eca7 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -48,18 +48,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>   nr = (next - addr) / PAGE_SIZE;
>   }
>  
> + /* Filter out most gup flags: only allow a tiny subset here: */
> + gup->flags &= FOLL_WRITE;
> +
>   switch (cmd) {
>   case GUP_FAST_BENCHMARK:
> - nr = get_user_pages_fast(addr, nr, gup->flags & 1,
> + nr = get_user_pages_fast(addr, nr, gup->flags,
>pages + i);
>   break;
>   case GUP_LONGTERM_BENCHMARK:
>   nr = get_user_pages(addr, nr,
> - (gup->flags & 1) | FOLL_LONGTERM,
> + gup->flags | FOLL_LONGTERM,
>   pages + i, NULL);
>   break;
>   case GUP_BENCHMARK:
> - nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
> + nr = get_user_pages(addr, nr, gup->flags, pages + i,
>   NULL);
>   break;
>   default:
> diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
> b/tools/testing/selftests/vm/gup_benchmark.c
> index 485cf06ef013..389327e9b30a 100644
> --- a/tools/testing/selftests/vm/gup_benchmark.c
> +++ b/tools/testing/selftests/vm/gup_benchmark.c
> @@ -18,6 +18,9 @@
>  #define GUP_LONGTERM_BENCHMARK   _IOWR('g', 2, struct gup_benchmark)
>  #define GUP_BENCHMARK_IOWR('g', 3, struct gup_benchmark)
>  
> +/* Just the flags we need, copied from mm.h: */
> +#define FOLL_WRITE   0x01/* check pte is writable */
> +
>  struct gup_benchmark {
>   __u64 get_delta_usec;
>   __u64 put_delta_usec;
> @@ -85,7 +88,8 @@ int main(int argc, char **argv)
>   }
>  
>   gup.nr_pages_per_call = nr_pages;
> - gup.flags = write;
> + if (write)
> + gup.flags |= FOLL_WRITE;
>  
>   fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
>   if (fd == -1)
> -- 
> 2.24.0
> 


Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-13 Thread Ira Weiny
On Tue, Nov 12, 2019 at 08:26:56PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
> pin_user_pages()
> pin_user_pages_remote()
> pin_user_pages_fast()
> 
> pin_longterm_pages()
> pin_longterm_pages_remote()
> pin_longterm_pages_fast()

At some point in this conversation I thought we were going to put in "unpin_*"
versions of these.

Is that still in the plans?

Ira



Re: [PATCH v4 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-13 Thread Ira Weiny
On Tue, Nov 12, 2019 at 08:26:55PM -0800, John Hubbard wrote:
> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
> 
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
> 
> Update the code and comments accordingly, and update the VFIO caller
> to take advantage of this, fixing a bug as a result: the VFIO caller
> is logically a FOLL_LONGTERM user.
> 
> Also, remove an unnessary pair of calls that were releasing and
> reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
> just in order to call page_to_pfn().
> 
> Also, move the DAX check ("if a VMA is DAX, don't allow long term
> pinning") from the VFIO call site, all the way into the internals
> of get_user_pages_remote() and __gup_longterm_locked(). That is:
> get_user_pages_remote() calls __gup_longterm_locked(), which in turn
> calls check_dax_vmas(). It's lightly explained in the comments as well.
> 
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
> and to Dan Williams for helping clarify the DAX refactoring.
> 
> Suggested-by: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Jerome Glisse 
> Cc: Ira Weiny 

Reviewed-by: Ira Weiny 

> Signed-off-by: John Hubbard 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 25 ++---
>  mm/gup.c| 27 ++-
>  2 files changed, 24 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d864277ea16f..7301b710c9a4 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -340,7 +340,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  {
>   struct page *page[1];
>   struct vm_area_struct *vma;
> - struct vm_area_struct *vmas[1];
>   unsigned int flags = 0;
>   int ret;
>  
> @@ -348,33 +347,13 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>   flags |= FOLL_WRITE;
>  
>   down_read(&mm->mmap_sem);
> - if (mm == current->mm) {
> - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> -  vmas);
> - } else {
> - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> - vmas, NULL);
> - /*
> -  * The lifetime of a vaddr_get_pfn() page pin is
> -  * userspace-controlled. In the fs-dax case this could
> -  * lead to indefinite stalls in filesystem operations.
> -  * Disallow attempts to pin fs-dax pages via this
> -  * interface.
> -  */
> - if (ret > 0 && vma_is_fsdax(vmas[0])) {
> - ret = -EOPNOTSUPP;
> - put_page(page[0]);
> - }
> - }
> - up_read(&mm->mmap_sem);
> -
> + ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
> + page, NULL, NULL);
>   if (ret == 1) {
>   *pfn = page_to_pfn(page[0]);
>   return 0;
>   }
>  
> - down_read(&mm->mmap_sem);
> -
>   vaddr = untagged_addr(vaddr);
>  
>   vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> diff --git a/mm/gup.c b/mm/gup.c
> index 933524de6249..83702b2e86c8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,13 @@ struct follow_page_context {
>   unsigned int page_mask;
>  };
>  
> +static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
> +   struct mm_struct *mm,
> +   unsigned long start,
> +   unsigned long nr_pages,
> +   struct page **pages,
> +   struct vm_area_struct **vmas,
> +   unsigned int flags);
>  /*
>   * Return the compound head page with ref appropriately incremented,
>   * or NULL if that failed.
> @@ -1167,13 +1174,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
> struct mm_struct *mm,
>   struct vm_area_struct **vmas, int *locked)
>  {
>   /*
> -  * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> +  * Parts of FOLL_LONGTERM behavior are incompatible with
>* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> -  * vmas.  As there are no users of this flag in this call we simply
> -  * disallow this option for now.
> +  * vmas. However, this only comes up if locked is set, and there are
> +  * callers that do request FOLL_LONGT

Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-13 Thread Dan Williams
On Wed, Nov 13, 2019 at 2:43 AM Jan Kara  wrote:
>
> On Tue 12-11-19 20:26:56, John Hubbard wrote:
> > Introduce pin_user_pages*() variations of get_user_pages*() calls,
> > and also pin_longterm_pages*() variations.
> >
> > These variants all set FOLL_PIN, which is also introduced, and
> > thoroughly documented.
> >
> > The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> > to FOLL_PIN:
> >
> > pin_user_pages()
> > pin_user_pages_remote()
> > pin_user_pages_fast()
> >
> > pin_longterm_pages()
> > pin_longterm_pages_remote()
> > pin_longterm_pages_fast()
> >
> > All pages that are pinned via the above calls, must be unpinned via
> > put_user_page().
> >
> > The underlying rules are:
> >
> > * These are gup-internal flags, so the call sites should not directly
> > set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> > assertions, for the new FOLL_PIN flag. However, for the pre-existing
> > FOLL_LONGTERM flag, which has some call sites that still directly
> > set FOLL_LONGTERM, there is no assertion yet.
> >
> > * Call sites that want to indicate that they are going to do DirectIO
> >   ("DIO") or something with similar characteristics, should call a
> >   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
> >   will:
> > * Start with "pin_user_pages" instead of "get_user_pages". That
> >   makes it easy to find and audit the call sites.
> > * Set FOLL_PIN
> >
> > * For pages that are received via FOLL_PIN, those pages must be returned
> >   via put_user_page().
> >
> > Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> > in this documentation. (I've reworded it and expanded upon it.)
> >
> > Reviewed-by: Mike Rapoport   # Documentation
> > Reviewed-by: Jérôme Glisse 
> > Cc: Jonathan Corbet 
> > Cc: Ira Weiny 
> > Signed-off-by: John Hubbard 
>
> Thanks for the documentation. It looks great!
>
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 83702b2e86c8..4409e84dff51 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -201,6 +201,10 @@ static struct page *follow_page_pte(struct 
> > vm_area_struct *vma,
> >   spinlock_t *ptl;
> >   pte_t *ptep, pte;
> >
> > + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> > + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> > +  (FOLL_PIN | FOLL_GET)))
> > + return ERR_PTR(-EINVAL);
> >  retry:
> >   if (unlikely(pmd_bad(*pmd)))
> >   return no_page_table(vma, flags);
>
> How does FOLL_PIN result in grabbing (at least normal, for now) page 
> reference?
> I didn't find that anywhere in this patch but it is a prerequisite to
> converting any user to pin_user_pages() interface, right?
>
> > +/**
> > + * pin_user_pages_fast() - pin user pages in memory without taking locks
> > + *
> > + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. 
> > See
> > + * get_user_pages_fast() for documentation on the function arguments, 
> > because
> > + * the arguments here are identical.
> > + *
> > + * FOLL_PIN means that the pages must be released via put_user_page(). 
> > Please
> > + * see Documentation/vm/pin_user_pages.rst for further details.
> > + *
> > + * This is intended for Case 1 (DIO) in 
> > Documentation/vm/pin_user_pages.rst. It
> > + * is NOT intended for Case 2 (RDMA: long-term pins).
> > + */
> > +int pin_user_pages_fast(unsigned long start, int nr_pages,
> > + unsigned int gup_flags, struct page **pages)
> > +{
> > + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> > + if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> > + return -EINVAL;
> > +
> > + gup_flags |= FOLL_PIN;
> > + return internal_get_user_pages_fast(start, nr_pages, gup_flags, 
> > pages);
> > +}
> > +EXPORT_SYMBOL_GPL(pin_user_pages_fast);
>
> I was somewhat wondering about the number of functions you add here. So we
> have:
>
> pin_user_pages()
> pin_user_pages_fast()
> pin_user_pages_remote()
>
> and then longterm variants:
>
> pin_longterm_pages()
> pin_longterm_pages_fast()
> pin_longterm_pages_remote()
>
> and obviously we have gup like:
> get_user_pages()
> get_user_pages_fast()
> get_user_pages_remote()
> ... and some other gup variants ...
>
> I think we really should have pin_* vs get_* variants as they are very
> different in terms of guarantees and after conversion, any use of get_*
> variant in non-mm code should be closely scrutinized. OTOH pin_longterm_*
> don't look *that* useful to me and just using pin_* instead with
> FOLL_LONGTERM flag would look OK to me and somewhat reduce the number of
> functions which is already large enough? What do people think? I don't feel
> too strongly about this but wanted to bring this up.

I'd vote for FOLL_LONGTERM should obviate the need for
{get,pin}_user_pages_longterm(). It's a property that is passed by the
call site, not an internal flag.


Re: [PATCH v4 03/23] mm/gup: move try_get_compound_head() to top, fix minor issues

2019-11-13 Thread Ira Weiny
On Tue, Nov 12, 2019 at 08:26:50PM -0800, John Hubbard wrote:
> An upcoming patch uses try_get_compound_head() more widely,
> so move it to the top of gup.c.
> 
> Also fix a tiny spelling error and a checkpatch.pl warning.
> 
> Signed-off-by: John Hubbard 

Simple enough...

Reviewed-by: Ira Weiny 

> ---
>  mm/gup.c | 29 +++--
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..933524de6249 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,21 @@ struct follow_page_context {
>   unsigned int page_mask;
>  };
>  
> +/*
> + * Return the compound head page with ref appropriately incremented,
> + * or NULL if that failed.
> + */
> +static inline struct page *try_get_compound_head(struct page *page, int refs)
> +{
> + struct page *head = compound_head(page);
> +
> + if (WARN_ON_ONCE(page_ref_count(head) < 0))
> + return NULL;
> + if (unlikely(!page_cache_add_speculative(head, refs)))
> + return NULL;
> + return head;
> +}
> +
>  /**
>   * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
> pages
>   * @pages:  array of pages to be maybe marked dirty, and definitely released.
> @@ -1793,20 +1808,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, 
> int nr_start,
>   }
>  }
>  
> -/*
> - * Return the compund head page with ref appropriately incremented,
> - * or NULL if that failed.
> - */
> -static inline struct page *try_get_compound_head(struct page *page, int refs)
> -{
> - struct page *head = compound_head(page);
> - if (WARN_ON_ONCE(page_ref_count(head) < 0))
> - return NULL;
> - if (unlikely(!page_cache_add_speculative(head, refs)))
> - return NULL;
> - return head;
> -}
> -
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>unsigned int flags, struct page **pages, int *nr)
> -- 
> 2.24.0
> 


Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-13 Thread Ira Weiny
> > +/**
> > + * pin_user_pages_fast() - pin user pages in memory without taking locks
> > + *
> > + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. 
> > See
> > + * get_user_pages_fast() for documentation on the function arguments, 
> > because
> > + * the arguments here are identical.
> > + *
> > + * FOLL_PIN means that the pages must be released via put_user_page(). 
> > Please
> > + * see Documentation/vm/pin_user_pages.rst for further details.
> > + *
> > + * This is intended for Case 1 (DIO) in 
> > Documentation/vm/pin_user_pages.rst. It
> > + * is NOT intended for Case 2 (RDMA: long-term pins).
> > + */
> > +int pin_user_pages_fast(unsigned long start, int nr_pages,
> > +   unsigned int gup_flags, struct page **pages)
> > +{
> > +   /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> > +   if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> > +   return -EINVAL;
> > +
> > +   gup_flags |= FOLL_PIN;
> > +   return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> > +}
> > +EXPORT_SYMBOL_GPL(pin_user_pages_fast);
> 
> I was somewhat wondering about the number of functions you add here. So we
> have:
> 
> pin_user_pages()
> pin_user_pages_fast()
> pin_user_pages_remote()
> 
> and then longterm variants:
> 
> pin_longterm_pages()
> pin_longterm_pages_fast()
> pin_longterm_pages_remote()
> 
> and obviously we have gup like:
> get_user_pages()
> get_user_pages_fast()
> get_user_pages_remote()
> ... and some other gup variants ...
> 
> I think we really should have pin_* vs get_* variants as they are very
> different in terms of guarantees and after conversion, any use of get_*
> variant in non-mm code should be closely scrutinized. OTOH pin_longterm_*
> don't look *that* useful to me and just using pin_* instead with
> FOLL_LONGTERM flag would look OK to me and somewhat reduce the number of
> functions which is already large enough? What do people think? I don't feel
> too strongly about this but wanted to bring this up.

I'm a bit concerned with the function explosion myself.  I think what you
suggest is a happy medium.  So I'd be ok with that.

Ira



Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path

2019-11-13 Thread Cédric Le Goater
On 13/11/2019 17:46, Greg Kurz wrote:
> We need to check the host page size is big enough to accomodate the
> EQ. Let's do this before taking a reference on the EQ page to avoid
> a potential leak if the check fails.
> 
> Cc: sta...@vger.kernel.org # v5.2
> Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ 
> configuration")
> Signed-off-by: Greg Kurz 


Reviewed-by: Cédric Le Goater 

> ---
>  arch/powerpc/kvm/book3s_xive_native.c |   13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index 0e1fc5a16729..d83adb1e1490 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -630,12 +630,6 @@ static int kvmppc_xive_native_set_queue_config(struct 
> kvmppc_xive *xive,
>  
>   srcu_idx = srcu_read_lock(&kvm->srcu);
>   gfn = gpa_to_gfn(kvm_eq.qaddr);
> - page = gfn_to_page(kvm, gfn);
> - if (is_error_page(page)) {
> - srcu_read_unlock(&kvm->srcu, srcu_idx);
> - pr_err("Couldn't get queue page %llx!\n", kvm_eq.qaddr);
> - return -EINVAL;
> - }
>  
>   page_size = kvm_host_page_size(kvm, gfn);
>   if (1ull << kvm_eq.qshift > page_size) {
> @@ -644,6 +638,13 @@ static int kvmppc_xive_native_set_queue_config(struct 
> kvmppc_xive *xive,
>   return -EINVAL;
>   }
>  
> + page = gfn_to_page(kvm, gfn);
> + if (is_error_page(page)) {
> + srcu_read_unlock(&kvm->srcu, srcu_idx);
> + pr_err("Couldn't get queue page %llx!\n", kvm_eq.qaddr);
> + return -EINVAL;
> + }
> +
>   qaddr = page_to_virt(page) + (kvm_eq.qaddr & ~PAGE_MASK);
>   srcu_read_unlock(&kvm->srcu, srcu_idx);
>  
> 



[PATCH v2 2/2] KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path

2019-11-13 Thread Greg Kurz
We need to check the host page size is big enough to accomodate the
EQ. Let's do this before taking a reference on the EQ page to avoid
a potential leak if the check fails.

Cc: sta...@vger.kernel.org # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ 
configuration")
Signed-off-by: Greg Kurz 
---
 arch/powerpc/kvm/book3s_xive_native.c |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 0e1fc5a16729..d83adb1e1490 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -630,12 +630,6 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
 
srcu_idx = srcu_read_lock(&kvm->srcu);
gfn = gpa_to_gfn(kvm_eq.qaddr);
-   page = gfn_to_page(kvm, gfn);
-   if (is_error_page(page)) {
-   srcu_read_unlock(&kvm->srcu, srcu_idx);
-   pr_err("Couldn't get queue page %llx!\n", kvm_eq.qaddr);
-   return -EINVAL;
-   }
 
page_size = kvm_host_page_size(kvm, gfn);
if (1ull << kvm_eq.qshift > page_size) {
@@ -644,6 +638,13 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
return -EINVAL;
}
 
+   page = gfn_to_page(kvm, gfn);
+   if (is_error_page(page)) {
+   srcu_read_unlock(&kvm->srcu, srcu_idx);
+   pr_err("Couldn't get queue page %llx!\n", kvm_eq.qaddr);
+   return -EINVAL;
+   }
+
qaddr = page_to_virt(page) + (kvm_eq.qaddr & ~PAGE_MASK);
srcu_read_unlock(&kvm->srcu, srcu_idx);
 



[PATCH v2 1/2] KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one

2019-11-13 Thread Greg Kurz
The EQ page is allocated by the guest and then passed to the hypervisor
with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
before handing it over to the HW. This reference is dropped either when
the guest issues the H_INT_RESET hcall or when the KVM device is released.
But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times,
either to reset the EQ (vCPU hot unplug) or to set a new EQ (guest reboot).
In both cases the existing EQ page reference is leaked because we simply
overwrite it in the XIVE queue structure without calling put_page().

This is especially visible when the guest memory is backed with huge pages:
start a VM up to the guest userspace, either reboot it or unplug a vCPU,
quit QEMU. The leak is observed by comparing the value of HugePages_Free in
/proc/meminfo before and after the VM is run.

Ideally we'd want the XIVE code to handle the EQ page de-allocation at the
platform level. This isn't the case right now because the various XIVE
drivers have different allocation needs. It could maybe worth introducing
hooks for this purpose instead of exposing XIVE internals to the drivers,
but this is certainly a huge work to be done later.

In the meantime, for easier backport, fix both vCPU unplug and guest reboot
leaks by introducing a wrapper around xive_native_configure_queue() that
does the necessary cleanup.

Reported-by: Satheesh Rajendran 
Cc: sta...@vger.kernel.org # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ 
configuration")
Signed-off-by: Cédric Le Goater 
Signed-off-by: Greg Kurz 
---
v2: use wrapper as suggested by Cedric
---
 arch/powerpc/kvm/book3s_xive_native.c |   31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 34bd123fa024..0e1fc5a16729 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -50,6 +50,24 @@ static void kvmppc_xive_native_cleanup_queue(struct kvm_vcpu 
*vcpu, int prio)
}
 }
 
+static int kvmppc_xive_native_configure_queue(u32 vp_id, struct xive_q *q,
+ u8 prio, __be32 *qpage,
+ u32 order, bool can_escalate)
+{
+   int rc;
+   __be32 *qpage_prev = q->qpage;
+
+   rc = xive_native_configure_queue(vp_id, q, prio, qpage, order,
+can_escalate);
+   if (rc)
+   return rc;
+
+   if (qpage_prev)
+   put_page(virt_to_page(qpage_prev));
+
+   return rc;
+}
+
 void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
@@ -575,19 +593,14 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
q->guest_qaddr  = 0;
q->guest_qshift = 0;
 
-   rc = xive_native_configure_queue(xc->vp_id, q, priority,
-NULL, 0, true);
+   rc = kvmppc_xive_native_configure_queue(xc->vp_id, q, priority,
+   NULL, 0, true);
if (rc) {
pr_err("Failed to reset queue %d for VCPU %d: %d\n",
   priority, xc->server_num, rc);
return rc;
}
 
-   if (q->qpage) {
-   put_page(virt_to_page(q->qpage));
-   q->qpage = NULL;
-   }
-
return 0;
}
 
@@ -646,8 +659,8 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
  * OPAL level because the use of END ESBs is not supported by
  * Linux.
  */
-   rc = xive_native_configure_queue(xc->vp_id, q, priority,
-(__be32 *) qaddr, kvm_eq.qshift, true);
+   rc = kvmppc_xive_native_configure_queue(xc->vp_id, q, priority,
+   (__be32 *) qaddr, kvm_eq.qshift, true);
if (rc) {
pr_err("Failed to configure queue %d for VCPU %d: %d\n",
   priority, xc->server_num, rc);



[PATCH] MAINTAINERS: Add pattern that matches ppc perf to the perf entry.

2019-11-13 Thread Michal Suchanek
Signed-off-by: Michal Suchanek 
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index eb19fad370d7..2e6c187bc98c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12772,6 +12772,8 @@ F:  arch/*/kernel/*/perf_event*.c
 F: arch/*/kernel/*/*/perf_event*.c
 F: arch/*/include/asm/perf_event.h
 F: arch/*/kernel/perf_callchain.c
+F: arch/*/perf/*
+F: arch/*/perf/*/*
 F: arch/*/events/*
 F: arch/*/events/*/*
 F: tools/perf/
-- 
2.23.0



[PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-13 Thread Nicolas Saenz Julienne
Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit is
meant to contain the higher accesible DMA address.

Signed-off-by: Nicolas Saenz Julienne 

---

Note this is rebased on top of Christoph's latest DMA series:
https://www.spinics.net/lists/arm-kernel/msg768600.html

 arch/mips/pci/fixup-sb1250.c  | 16 
 arch/powerpc/sysdev/fsl_pci.c |  6 +++---
 arch/x86/kernel/pci-dma.c |  2 +-
 arch/x86/mm/mem_encrypt.c |  2 +-
 arch/x86/pci/sta2x11-fixup.c  |  2 +-
 drivers/acpi/arm64/iort.c |  2 +-
 drivers/ata/ahci.c|  2 +-
 drivers/iommu/dma-iommu.c |  3 +--
 drivers/of/device.c   |  9 +
 include/linux/device.h|  6 +++---
 include/linux/dma-direct.h|  2 +-
 include/linux/dma-mapping.h   |  2 +-
 kernel/dma/direct.c   | 27 +--
 13 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/arch/mips/pci/fixup-sb1250.c b/arch/mips/pci/fixup-sb1250.c
index 8a41b359cf90..40efc990cdce 100644
--- a/arch/mips/pci/fixup-sb1250.c
+++ b/arch/mips/pci/fixup-sb1250.c
@@ -21,22 +21,22 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SIBYTE, 
PCI_DEVICE_ID_BCM1250_PCI,
 
 /*
  * The BCM1250, etc. PCI host bridge does not support DAC on its 32-bit
- * bus, so we set the bus's DMA mask accordingly.  However the HT link
+ * bus, so we set the bus's DMA limit accordingly.  However the HT link
  * down the artificial PCI-HT bridge supports 40-bit addressing and the
  * SP1011 HT-PCI bridge downstream supports both DAC and a 64-bit bus
  * width, so we record the PCI-HT bridge's secondary and subordinate bus
- * numbers and do not set the mask for devices present in the inclusive
+ * numbers and do not set the limit for devices present in the inclusive
  * range of those.
  */
-struct sb1250_bus_dma_mask_exclude {
+struct sb1250_bus_dma_limit_exclude {
bool set;
unsigned char start;
unsigned char end;
 };
 
-static int sb1250_bus_dma_mask(struct pci_dev *dev, void *data)
+static int sb1250_bus_dma_limit(struct pci_dev *dev, void *data)
 {
-   struct sb1250_bus_dma_mask_exclude *exclude = data;
+   struct sb1250_bus_dma_limit_exclude *exclude = data;
bool exclude_this;
bool ht_bridge;
 
@@ -55,7 +55,7 @@ static int sb1250_bus_dma_mask(struct pci_dev *dev, void 
*data)
exclude->start, exclude->end);
} else {
dev_dbg(&dev->dev, "disabling DAC for device");
-   dev->dev.bus_dma_mask = DMA_BIT_MASK(32);
+   dev->dev.bus_dma_limit = DMA_BIT_MASK(32);
}
 
return 0;
@@ -63,9 +63,9 @@ static int sb1250_bus_dma_mask(struct pci_dev *dev, void 
*data)
 
 static void quirk_sb1250_pci_dac(struct pci_dev *dev)
 {
-   struct sb1250_bus_dma_mask_exclude exclude = { .set = false };
+   struct sb1250_bus_dma_limit_exclude exclude = { .set = false };
 
-   pci_walk_bus(dev->bus, sb1250_bus_dma_mask, &exclude);
+   pci_walk_bus(dev->bus, sb1250_bus_dma_limit, &exclude);
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SIBYTE, PCI_DEVICE_ID_BCM1250_PCI,
quirk_sb1250_pci_dac);
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index ff0e2b156cb5..617a443d673d 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -115,8 +115,8 @@ static void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
 {
struct pci_controller *hose = pci_bus_to_host(pdev->bus);
 
-   pdev->dev.bus_dma_mask =
-   hose->dma_window_base_cur + hose->dma_window_size;
+   pdev->dev.bus_dma_limit =
+   hose->dma_window_base_cur + hose->dma_window_size - 1;
 }
 
 static void setup_swiotlb_ops(struct pci_controller *hose)
@@ -135,7 +135,7 @@ static void fsl_pci_dma_set_mask(struct device *dev, u64 
dma_mask)
 * mapping that allows addressing any RAM address from across PCI.
 */
if (dev_is_pci(dev) && dma_mask >= pci64_dma_offset * 2 - 1) {
-   dev->bus_dma_mask = 0;
+   dev->bus_dma_limit = 0;
dev->archdata.dma_offset = pci64_dma_offset;
}
 }
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index fa4352dce491..3a75

[PATCH v2] powerpc/kernel/sysfs: Add PMU_SYSFS config option to enable PMU SPRs sysfs file creation

2019-11-13 Thread Kajol Jain
Many of the performance moniroting unit (PMU) SPRs are
exposed in the sysfs. "perf" API is the primary interface to program
PMU and collect counter data in the system. So expose these
PMU SPRs in the absence of CONFIG_PERF_EVENTS.

Patch adds a new CONFIG option 'CONFIG_PMU_SYSFS'. The new config
option used in kernel/sysfs.c for PMU SPRs sysfs file creation and
this new option is enabled only if 'CONFIG_PERF_EVENTS' option is
disabled.

Tested this patch with enable/disable CONFIG_PERF_EVENTS option
in powernv and pseries machines.
Also did compilation testing for different architecture include:
x86, mips, mips64, alpha, arm. And with book3s_32.config option.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/kernel/sysfs.c| 21 +
 arch/powerpc/platforms/Kconfig.cputype |  8 
 2 files changed, 29 insertions(+)

---
Changelog:
v1 -> v2
- Added new config option 'PMU_SYSFS' for PMU SPR's creation
  rather than using PERF_EVENTS config option directly and make
  sure SPR's file creation only if 'CONFIG_PERF_EVENTS' disabled.
---
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 80a676da11cb..b7c01f1ef236 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -457,16 +457,21 @@ static ssize_t __used \
 
 #if defined(CONFIG_PPC64)
 #define HAS_PPC_PMC_CLASSIC1
+#ifdef CONFIG_PMU_SYSFS
 #define HAS_PPC_PMC_IBM1
+#endif
 #define HAS_PPC_PMC_PA6T   1
 #elif defined(CONFIG_PPC_BOOK3S_32)
 #define HAS_PPC_PMC_CLASSIC1
+#ifdef CONFIG_PMU_SYSFS
 #define HAS_PPC_PMC_IBM1
 #define HAS_PPC_PMC_G4 1
 #endif
+#endif
 
 
 #ifdef HAS_PPC_PMC_CLASSIC
+#ifdef CONFIG_PMU_SYSFS
 SYSFS_PMCSETUP(mmcr0, SPRN_MMCR0);
 SYSFS_PMCSETUP(mmcr1, SPRN_MMCR1);
 SYSFS_PMCSETUP(pmc1, SPRN_PMC1);
@@ -485,6 +490,10 @@ SYSFS_PMCSETUP(pmc7, SPRN_PMC7);
 SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
 
 SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
+#endif /* CONFIG_PPC64 */
+#endif /* CONFIG_PMU_SYSFS */
+
+#ifdef CONFIG_PPC64
 SYSFS_SPRSETUP(purr, SPRN_PURR);
 SYSFS_SPRSETUP(spurr, SPRN_SPURR);
 SYSFS_SPRSETUP(pir, SPRN_PIR);
@@ -495,7 +504,9 @@ SYSFS_SPRSETUP(tscr, SPRN_TSCR);
   enable write when needed with a separate function.
   Lets be conservative and default to pseries.
 */
+#ifdef CONFIG_PMU_SYSFS
 static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
+#endif /* CONFIG_PMU_SYSFS */
 static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
 static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
 static DEVICE_ATTR(pir, 0400, show_pir, NULL);
@@ -606,12 +617,14 @@ static void sysfs_create_dscr_default(void)
 #endif /* CONFIG_PPC64 */
 
 #ifdef HAS_PPC_PMC_PA6T
+#ifdef CONFIG_PMU_SYSFS
 SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
 SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
 SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
 SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
+#endif /* CONFIG_PMU_SYSFS */
 #ifdef CONFIG_DEBUG_MISC
 SYSFS_SPRSETUP(hid0, SPRN_HID0);
 SYSFS_SPRSETUP(hid1, SPRN_HID1);
@@ -644,6 +657,7 @@ SYSFS_SPRSETUP(tsr3, SPRN_PA6T_TSR3);
 #endif /* CONFIG_DEBUG_MISC */
 #endif /* HAS_PPC_PMC_PA6T */
 
+#ifdef CONFIG_PMU_SYSFS
 #ifdef HAS_PPC_PMC_IBM
 static struct device_attribute ibm_common_attrs[] = {
__ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
@@ -671,9 +685,11 @@ static struct device_attribute classic_pmc_attrs[] = {
__ATTR(pmc8, 0600, show_pmc8, store_pmc8),
 #endif
 };
+#endif /* CONFIG_PMU_SYSFS */
 
 #ifdef HAS_PPC_PMC_PA6T
 static struct device_attribute pa6t_attrs[] = {
+#ifdef CONFIG_PMU_SYSFS
__ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
__ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
__ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
@@ -682,6 +698,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
__ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
__ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
+#endif /* CONFIG_PMU_SYSFS */
 #ifdef CONFIG_DEBUG_MISC
__ATTR(hid0, 0600, show_hid0, store_hid0),
__ATTR(hid1, 0600, show_hid1, store_hid1),
@@ -787,8 +804,10 @@ static int register_cpu_online(unsigned int cpu)
device_create_file(s, &pmc_attrs[i]);
 
 #ifdef CONFIG_PPC64
+#ifdef CONFIG_PMU_SYSFS
if (cpu_has_feature(CPU_FTR_MMCRA))
device_create_file(s, &dev_attr_mmcra);
+#endif /* CONFIG_PMU_SYSFS */
 
if (cpu_has_feature(CPU_FTR_PURR)) {
if (!firmware_has_feature(FW_FEATURE_LPAR))
@@ -876,8 +895,10 @@ static int unregister_cpu_online(unsigned int cpu)
device_remove_file(s, &pmc_attrs[i]);
 
 #ifdef CONFIG_PPC64
+#ifdef CONFIG_PMU_SYSFS
if (cpu_has_feature(CPU_FTR_MMCRA))
device_remove_file(s, &dev_attr_mmcra);
+#endif /* CONFIG_PMU_SYSFS */
 
if (cpu_has_feature(CPU_FTR_P

Re: [PATCH v10 6/8] KVM: PPC: Support reset of secure guest

2019-11-13 Thread Bharata B Rao
On Tue, Nov 12, 2019 at 04:34:34PM +1100, Paul Mackerras wrote:
> On Mon, Nov 04, 2019 at 09:47:58AM +0530, Bharata B Rao wrote:
> [snip]
> > @@ -5442,6 +5471,64 @@ static int kvmhv_store_to_eaddr(struct kvm_vcpu 
> > *vcpu, ulong *eaddr, void *ptr,
> > return rc;
> >  }
> >  
> > +/*
> > + *  IOCTL handler to turn off secure mode of guest
> > + *
> > + * - Issue ucall to terminate the guest on the UV side
> > + * - Unpin the VPA pages (Enables these pages to be migrated back
> > + *   when VM becomes secure again)
> > + * - Recreate partition table as the guest is transitioning back to
> > + *   normal mode
> > + * - Release all device pages
> > + */
> > +static int kvmhv_svm_off(struct kvm *kvm)
> > +{
> > +   struct kvm_vcpu *vcpu;
> > +   int srcu_idx;
> > +   int ret = 0;
> > +   int i;
> > +
> > +   if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
> > +   return ret;
> > +
> 
> A further comment on this code: it should check that no vcpus are
> running and fail if any are running, and it should prevent any vcpus
> from running until the function is finished, using code like that in
> kvmhv_configure_mmu().  That is, it should do something like this:
> 
>   mutex_lock(&kvm->arch.mmu_setup_lock);
>   mmu_was_ready = kvm->arch.mmu_ready;
>   if (kvm->arch.mmu_ready) {
>   kvm->arch.mmu_ready = 0;
>   /* order mmu_ready vs. vcpus_running */
>   smp_mb();
>   if (atomic_read(&kvm->arch.vcpus_running)) {
>   kvm->arch.mmu_ready = 1;
>   ret = -EBUSY;
>   goto out_unlock;
>   }
>   }
> 
> and then after clearing kvm->arch.secure_guest below:
> 
> > +   srcu_idx = srcu_read_lock(&kvm->srcu);
> > +   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +   struct kvm_memory_slot *memslot;
> > +   struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > +
> > +   if (!slots)
> > +   continue;
> > +
> > +   kvm_for_each_memslot(memslot, slots) {
> > +   kvmppc_uvmem_drop_pages(memslot, kvm, true);
> > +   uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
> > +   }
> > +   }
> > +   srcu_read_unlock(&kvm->srcu, srcu_idx);
> > +
> > +   ret = uv_svm_terminate(kvm->arch.lpid);
> > +   if (ret != U_SUCCESS) {
> > +   ret = -EINVAL;
> > +   goto out;
> > +   }
> > +
> > +   kvm_for_each_vcpu(i, vcpu, kvm) {
> > +   spin_lock(&vcpu->arch.vpa_update_lock);
> > +   unpin_vpa_reset(kvm, &vcpu->arch.dtl);
> > +   unpin_vpa_reset(kvm, &vcpu->arch.slb_shadow);
> > +   unpin_vpa_reset(kvm, &vcpu->arch.vpa);
> > +   spin_unlock(&vcpu->arch.vpa_update_lock);
> > +   }
> > +
> > +   ret = kvmppc_reinit_partition_table(kvm);
> > +   if (ret)
> > +   goto out;
> > +
> > +   kvm->arch.secure_guest = 0;
> 
> you need to do:
> 
>   kvm->arch.mmu_ready = mmu_was_ready;
>  out_unlock:
>   mutex_unlock(&kvm->arch.mmu_setup_lock);
> 
> > +out:
> > +   return ret;
> > +}
> > +
> 
> With that extra check in place, it should be safe to unpin the vpas if
> there is a good reason to do so.  ("Userspace has some bug that we
> haven't found" isn't a good reason to do so.)

QEMU indeed does set_one_reg to reset the VPAs but that only marks
the VPA update as pending. The actual unpinning happens when vcpu
gets to run after reset at which time the VPAs are updated after
any unpinning (if required)

When secure guest reboots, vpu 0 gets to run and does unpin its
VPA pages and then proceeds with switching to secure. Here UV
tries to page-in all the guest pages, including the still pinned
VPA pages corresponding to other vcpus which haven't had a chance
to run till now. They are all still pinned and hence page-in fails.

To prevent this, we have to explicitly unpin the VPA pages during
this svm off ioctl. This will ensure that SMP secure guest is able
to reboot correctly.

So I will incorporate the code chunk you have shown above to fail
if any vcpu is running and prevent any vcpu from running when
we unpin VPAs from this ioctl.

Regards,
Bharata.



Re: generic DMA bypass flag

2019-11-13 Thread Robin Murphy

On 13/11/2019 1:37 pm, Christoph Hellwig wrote:

Hi all,

I've recently beeing chatting with Lu about using dma-iommu and
per-device DMA ops in the intel IOMMU driver, and one missing feature
in dma-iommu is a bypass mode where the direct mapping is used even
when an iommu is attached to improve performance.  The powerpc
code already has a similar mode, so I'd like to move it to the core
DMA mapping code.  As part of that I noticed that the current
powerpc code has a little bug in that it used the wrong check in the
dma_sync_* routines to see if the direct mapping code is used.


In all honesty, this seems silly. If we can set a per-device flag to say 
"oh, bypass these ops and use some other ops instead", then we can just 
as easily simply give the device the appropriate ops in the first place. 
Because, y'know, the name of the game is "per-device ops".



These two patches just add the generic code and move powerpc over,
the intel IOMMU bits will require a separate discussion.


The ops need to match the default domain type, so the hard part of the 
problem is choosing when and if to switch that (particularly fun if 
there are multiple devices in the IOMMU group). Flipping between 
iommu-dma and dma-direct at that point will be trivial.


I don't see a great benefit to pulling legacy cruft out into common code 
instead of just working to get rid of it in-place, when said cruft pulls 
in the opposite direction to where we're taking the common code (i.e. 
it's inherently based on the premise of global ops).


Robin.


The x86 AMD Gart code also has a bypass mode, but it is a lot
strange, so I'm not going to touch it for now.
___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [PATCH] powerpc/powernv: Disable native PCIe port management

2019-11-13 Thread Bjorn Helgaas
On Wed, Nov 13, 2019 at 08:40:35PM +1100, Oliver O'Halloran wrote:
> On PowerNV the PCIe topology is (currently) managed the powernv platform
> code in cooperation with firmware. The PCIe-native service drivers bypass
> both and this can cause problems.
> 
> Historically this hasn't been a big deal since the only port service
> driver that saw much use was the AER driver. The AER driver relies
> a kernel service to report when errors occur rather than acting autonmously
> so it's fairly easy to ignore. On PowerNV (and pseries) AER events are
> handled through EEH, which ignores the AER service, so it's never been
> an issue.
> 
> Unfortunately, the hotplug port service driver (pciehp) does act
> autonomously and conflicts with the platform specific hotplug
> driver (pnv_php). The main issue is that pciehp claims the interrupt
> associated with the PCIe capability which in turn prevents pnv_php from
> claiming it.
> 
> This results in hotplug events being handled by pciehp which does not
> notify firmware when the PCIe topology changes, and does not setup/teardown
> the arch specific PCI device structures (pci_dn) when the topology changes.
> The end result is that hot-added devices cannot be enabled and hot-removed
> devices may not be fully torn-down on removal.
> 
> We can fix these problems by setting the "pcie_ports_disabled" flag during
> platform initialisation. The flag indicates the platform owns the PCIe
> ports which stops the portbus driver being registered.
> 
> Cc: Sergey Miroshnichenko 
> Fixes: 66725152fb9f ("PCI/hotplug: PowerPC PowerNV PCI hotplug driver")
> Signed-off-by: Oliver O'Halloran 
> ---
> Sergey, just FYI. I'll try sort out the rest of the hotplug
> trainwreck in 5.6.
> 
> The Fixes: here is for the patch that added pnv_php in 4.8. It's been
> a problem since then, but wasn't noticed until people started testing
> it after the EEH fixes in commit 799abe283e51 ("powerpc/eeh: Clean up
> EEH PEs after recovery finishes") went in earlier in the 5.4 cycle.
> ---
>  arch/powerpc/platforms/powernv/pci.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index 2825d00..ae62583 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -941,6 +941,23 @@ void __init pnv_pci_init(void)
>  
>   pci_add_flags(PCI_CAN_SKIP_ISA_ALIGN);
>  
> +#ifdef CONFIG_PCIEPORTBUS
> + /*
> +  * On PowerNV PCIe devices are (currently) managed in cooperation
> +  * with firmware. This isn't *strictly* required, but there's enough
> +  * assumptions baked into both firmware and the platform code that
> +  * it's unwise to allow the portbus services to be used.
> +  *
> +  * We need to fix this eventually, but for now set this flag to disable
> +  * the portbus driver. The AER service isn't required since that AER
> +  * events are handled via EEH. The pciehp hotplug driver can't work
> +  * without kernel changes (and portbus binding breaks pnv_php). The
> +  * other services also require some thinking about how we're going
> +  * to integrate them.
> +  */
> + pcie_ports_disabled = true;
> +#endif

This is fine, but it feels like sort of a blunt instrument.  Is there
any practical way to clear pci_host_bridge.native_pcie_hotplug (and
native_aer if appropriate) for the PHBs in question?  That would also
prevent pciehp from binding.

We might someday pull portdrv into the PCI core directly instead of as
a separate driver, and I'm thinking that might be easier if we have
more specific indications of what the core shouldn't use.

>   /* If we don't have OPAL, eg. in sim, just skip PCI probe */
>   if (!firmware_has_feature(FW_FEATURE_OPAL))
>   return;
> -- 
> 2.9.5
> 


[PATCH 2/2] powerpc: use the generic dma_ops_bypass mode

2019-11-13 Thread Christoph Hellwig
Use the DMA API bypass mechanism for direct window mappings.  This uses
common code and speed up the direct mapping case by avoiding indirect
calls just when not using dma ops at all.  It also fixes a problem where
the sync_* methods were using the bypass check for DMA allocations, but
those are part of the streaming ops.

Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which
has never been well defined, as is only used by a few drivers, which
IIRC never showed up in the typical Cell blade setups that are affected
by the ordering workaround.

Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB")
Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/device.h |  5 --
 arch/powerpc/kernel/dma-iommu.c   | 90 ---
 2 files changed, 9 insertions(+), 86 deletions(-)

diff --git a/arch/powerpc/include/asm/device.h 
b/arch/powerpc/include/asm/device.h
index 266542769e4b..452402215e12 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -18,11 +18,6 @@ struct iommu_table;
  * drivers/macintosh/macio_asic.c
  */
 struct dev_archdata {
-   /*
-* Set to %true if the dma_iommu_ops are requested to use a direct
-* window instead of dynamically mapping memory.
-*/
-   booliommu_bypass : 1;
/*
 * These two used to be a union. However, with the hybrid ops we need
 * both so here we store both a DMA offset for direct mappings and
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index e486d1d78de2..569fecd7b5b2 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -14,23 +14,6 @@
  * Generic iommu implementation
  */
 
-/*
- * The coherent mask may be smaller than the real mask, check if we can
- * really use a direct window.
- */
-static inline bool dma_iommu_alloc_bypass(struct device *dev)
-{
-   return dev->archdata.iommu_bypass && !iommu_fixed_is_weak &&
-   dma_direct_supported(dev, dev->coherent_dma_mask);
-}
-
-static inline bool dma_iommu_map_bypass(struct device *dev,
-   unsigned long attrs)
-{
-   return dev->archdata.iommu_bypass &&
-   (!iommu_fixed_is_weak || (attrs & DMA_ATTR_WEAK_ORDERING));
-}
-
 /* Allocates a contiguous real buffer and creates mappings over it.
  * Returns the virtual address of the buffer and sets dma_handle
  * to the dma address (mapping) of the first page.
@@ -39,8 +22,6 @@ static void *dma_iommu_alloc_coherent(struct device *dev, 
size_t size,
  dma_addr_t *dma_handle, gfp_t flag,
  unsigned long attrs)
 {
-   if (dma_iommu_alloc_bypass(dev))
-   return dma_direct_alloc(dev, size, dma_handle, flag, attrs);
return iommu_alloc_coherent(dev, get_iommu_table_base(dev), size,
dma_handle, dev->coherent_dma_mask, flag,
dev_to_node(dev));
@@ -50,11 +31,7 @@ static void dma_iommu_free_coherent(struct device *dev, 
size_t size,
void *vaddr, dma_addr_t dma_handle,
unsigned long attrs)
 {
-   if (dma_iommu_alloc_bypass(dev))
-   dma_direct_free(dev, size, vaddr, dma_handle, attrs);
-   else
-   iommu_free_coherent(get_iommu_table_base(dev), size, vaddr,
-   dma_handle);
+   iommu_free_coherent(get_iommu_table_base(dev), size, vaddr, dma_handle);
 }
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
@@ -67,9 +44,6 @@ static dma_addr_t dma_iommu_map_page(struct device *dev, 
struct page *page,
 enum dma_data_direction direction,
 unsigned long attrs)
 {
-   if (dma_iommu_map_bypass(dev, attrs))
-   return dma_direct_map_page(dev, page, offset, size, direction,
-   attrs);
return iommu_map_page(dev, get_iommu_table_base(dev), page, offset,
  size, dma_get_mask(dev), direction, attrs);
 }
@@ -79,11 +53,8 @@ static void dma_iommu_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 size_t size, enum dma_data_direction direction,
 unsigned long attrs)
 {
-   if (!dma_iommu_map_bypass(dev, attrs))
-   iommu_unmap_page(get_iommu_table_base(dev), dma_handle, size,
-   direction,  attrs);
-   else
-   dma_direct_unmap_page(dev, dma_handle, size, direction, attrs);
+   iommu_unmap_page(get_iommu_table_base(dev), dma_handle, size, direction,
+attrs);
 }
 
 
@@ -91,8 +62,6 @@ static int dma_iommu_map_sg(struct device *dev, struct 
scatterlist *sglist,
int nelems, enum dma_data_direction dir

[PATCH 1/2] dma-mapping: add a dma_ops_bypass flag to struct device

2019-11-13 Thread Christoph Hellwig
Several IOMMU drivers have a bypass mode where they can use a direct
mapping if the devices DMA mask is large enough.  Add generic support
to the core dma-mapping code to do that to switch those drivers to
a common solution.

Signed-off-by: Christoph Hellwig 
---
 include/linux/device.h  |  4 
 include/linux/dma-mapping.h | 30 ++
 kernel/dma/mapping.c| 35 ++-
 3 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 297239a08bb7..b8a3b4ec46bd 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1217,6 +1217,9 @@ struct dev_links_info {
  *  device.
  * @dma_coherent: this particular device is dma coherent, even if the
  * architecture supports non-coherent devices.
+ * @dma_ops_bypass: If set to %true then the dma_ops are bypassed for the
+ * streaming DMA operations (->map_* / ->unmap_* / ->sync_*).
+ * This flag is managed by the dma_ops from ->dma_supported.
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -1316,6 +1319,7 @@ struct device {
 defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
booldma_coherent:1;
 #endif
+   booldma_ops_bypass : 1;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 4d450672b7d6..22fe74179e02 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -191,9 +191,15 @@ static inline int dma_mmap_from_global_coherent(struct 
vm_area_struct *vma,
 }
 #endif /* CONFIG_DMA_DECLARE_COHERENT */
 
-static inline bool dma_is_direct(const struct dma_map_ops *ops)
+/*
+ * Check if the devices uses a direct mapping for streaming DMA operations.
+ * This allows IOMMU drivers to set a bypass mode if the DMA mask is large
+ * enough.
+ */
+static inline bool dma_map_direct(struct device *dev,
+   const struct dma_map_ops *ops)
 {
-   return likely(!ops);
+   return likely(!ops) || dev->dma_ops_bypass;
 }
 
 /*
@@ -282,7 +288,7 @@ static inline dma_addr_t dma_map_page_attrs(struct device 
*dev,
dma_addr_t addr;
 
BUG_ON(!valid_dma_direction(dir));
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
else
addr = ops->map_page(dev, page, offset, size, dir, attrs);
@@ -297,7 +303,7 @@ static inline void dma_unmap_page_attrs(struct device *dev, 
dma_addr_t addr,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
dma_direct_unmap_page(dev, addr, size, dir, attrs);
else if (ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
@@ -316,7 +322,7 @@ static inline int dma_map_sg_attrs(struct device *dev, 
struct scatterlist *sg,
int ents;
 
BUG_ON(!valid_dma_direction(dir));
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
@@ -334,7 +340,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
 
BUG_ON(!valid_dma_direction(dir));
debug_dma_unmap_sg(dev, sg, nents, dir);
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
else if (ops->unmap_sg)
ops->unmap_sg(dev, sg, nents, dir, attrs);
@@ -355,7 +361,7 @@ static inline dma_addr_t dma_map_resource(struct device 
*dev,
if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr
return DMA_MAPPING_ERROR;
 
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
addr = dma_direct_map_resource(dev, phys_addr, size, dir, 
attrs);
else if (ops->map_resource)
addr = ops->map_resource(dev, phys_addr, size, dir, attrs);
@@ -371,7 +377,7 @@ static inline void dma_unmap_resource(struct device *dev, 
dma_addr_t addr,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (!dma_is_direct(ops) && ops->unmap_resource)
+   if (!dma_map_direct(dev, ops) && ops->unmap_resource)
ops->unmap_resource(dev, addr, size, dir, attrs);
debug_dma_unmap_resource(dev, addr, size, dir);
 }
@@ -383,7 +389,7 @@ static inline void dma_sync_single_for_cpu(struct device 
*dev, dma_addr_t addr,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (d

generic DMA bypass flag

2019-11-13 Thread Christoph Hellwig
Hi all,

I've recently beeing chatting with Lu about using dma-iommu and
per-device DMA ops in the intel IOMMU driver, and one missing feature
in dma-iommu is a bypass mode where the direct mapping is used even
when an iommu is attached to improve performance.  The powerpc
code already has a similar mode, so I'd like to move it to the core
DMA mapping code.  As part of that I noticed that the current
powerpc code has a little bug in that it used the wrong check in the
dma_sync_* routines to see if the direct mapping code is used.

These two patches just add the generic code and move powerpc over,
the intel IOMMU bits will require a separate discussion.

The x86 AMD Gart code also has a bypass mode, but it is a lot
strange, so I'm not going to touch it for now.


Re: [PATCH v4 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-13 Thread Jason Gunthorpe
On Tue, Nov 12, 2019 at 08:26:55PM -0800, John Hubbard wrote:
> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
> 
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
> 
> Update the code and comments accordingly, and update the VFIO caller
> to take advantage of this, fixing a bug as a result: the VFIO caller
> is logically a FOLL_LONGTERM user.
> 
> Also, remove an unnessary pair of calls that were releasing and
> reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
> just in order to call page_to_pfn().
> 
> Also, move the DAX check ("if a VMA is DAX, don't allow long term
> pinning") from the VFIO call site, all the way into the internals
> of get_user_pages_remote() and __gup_longterm_locked(). That is:
> get_user_pages_remote() calls __gup_longterm_locked(), which in turn
> calls check_dax_vmas(). It's lightly explained in the comments as well.
> 
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
> and to Dan Williams for helping clarify the DAX refactoring.
> 
> Suggested-by: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Jerome Glisse 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 
>  drivers/vfio/vfio_iommu_type1.c | 25 ++---
>  mm/gup.c| 27 ++-
>  2 files changed, 24 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d864277ea16f..7301b710c9a4 100644
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -340,7 +340,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  {
>   struct page *page[1];
>   struct vm_area_struct *vma;
> - struct vm_area_struct *vmas[1];
>   unsigned int flags = 0;
>   int ret;
>  
> @@ -348,33 +347,13 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>   flags |= FOLL_WRITE;
>  
>   down_read(&mm->mmap_sem);
> - if (mm == current->mm) {
> - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> -  vmas);
> - } else {
> - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> - vmas, NULL);
> - /*
> -  * The lifetime of a vaddr_get_pfn() page pin is
> -  * userspace-controlled. In the fs-dax case this could
> -  * lead to indefinite stalls in filesystem operations.
> -  * Disallow attempts to pin fs-dax pages via this
> -  * interface.
> -  */
> - if (ret > 0 && vma_is_fsdax(vmas[0])) {
> - ret = -EOPNOTSUPP;
> - put_page(page[0]);
> - }
> - }
> - up_read(&mm->mmap_sem);
> -
> + ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
> + page, NULL, NULL);
>   if (ret == 1) {
>   *pfn = page_to_pfn(page[0]);
>   return 0;

Mind the return with the lock held this needs some goto unwind

Jason


Re: [PATCH v3 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-13 Thread Daniel Vetter
On Wed, Nov 13, 2019 at 11:12:10AM +0100, Jan Kara wrote:
> On Wed 13-11-19 01:02:02, John Hubbard wrote:
> > On 11/13/19 12:22 AM, Daniel Vetter wrote:
> > ...
> > > > > Why are we doing this? I think things got confused here someplace, as
> > > > 
> > > > 
> > > > Because:
> > > > 
> > > > a) These need put_page() calls,  and
> > > > 
> > > > b) there is no put_pages() call, but there is a release_pages() call 
> > > > that
> > > > is, arguably, what put_pages() would be.
> > > > 
> > > > 
> > > > > the comment still says:
> > > > > 
> > > > > /**
> > > > >   * put_user_page() - release a gup-pinned page
> > > > >   * @page:pointer to page to be released
> > > > >   *
> > > > >   * Pages that were pinned via get_user_pages*() must be released via
> > > > >   * either put_user_page(), or one of the put_user_pages*() routines
> > > > >   * below.
> > > > 
> > > > 
> > > > Ohhh, I missed those comments. They need to all be changed over to
> > > > say "pages that were pinned via pin_user_pages*() or
> > > > pin_longterm_pages*() must be released via put_user_page*()."
> > > > 
> > > > The get_user_pages*() pages must still be released via put_page.
> > > > 
> > > > The churn is due to a fairly significant change in strategy, whis
> > > > is: instead of changing all get_user_pages*() sites to call
> > > > put_user_page(), change selected sites to call pin_user_pages*() or
> > > > pin_longterm_pages*(), plus put_user_page().
> > > 
> > > Can't we call this unpin_user_page then, for some symmetry? Or is that
> > > even more churn?
> > > 
> > > Looking from afar the naming here seems really confusing.
> > 
> > 
> > That look from afar is valuable, because I'm too close to the problem to see
> > how the naming looks. :)
> > 
> > unpin_user_page() sounds symmetrical. It's true that it would cause more
> > churn (which is why I started off with a proposal that avoids changing the
> > names of put_user_page*() APIs). But OTOH, the amount of churn is 
> > proportional
> > to the change in direction here, and it's really only 10 or 20 lines 
> > changed,
> > in the end.
> > 
> > So I'm open to changing to that naming. It would be nice to hear what others
> > prefer, too...
> 
> FWIW I'd find unpin_user_page() also better than put_user_page() as a
> counterpart to pin_user_pages().

One more point from afar on pin/unpin: We use that a lot in graphics for
permanently pinned graphics buffer objects. Which really only should be
used for scanout. So at least graphics folks should have an appropriate
mindset and try to make sure we don't overuse this stuff.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v4 02/23] mm/gup: factor out duplicate code from four routines

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:49, John Hubbard wrote:
> There are four locations in gup.c that have a fair amount of code
> duplication. This means that changing one requires making the same
> changes in four places, not to mention reading the same code four
> times, and wondering if there are subtle differences.
> 
> Factor out the common code into static functions, thus reducing the
> overall line count and the code's complexity.
> 
> Also, take the opportunity to slightly improve the efficiency of the
> error cases, by doing a mass subtraction of the refcount, surrounded
> by get_page()/put_page().
> 
> Also, further simplify (slightly), by waiting until the the successful
> end of each routine, to increment *nr.
> 
> Reviewed-by: Jérôme Glisse 
> Cc: Ira Weiny 
> Cc: Christoph Hellwig 
> Cc: Aneesh Kumar K.V 
> Signed-off-by: John Hubbard 

> diff --git a/mm/gup.c b/mm/gup.c
> index 85caf76b3012..199da99e8ffc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1969,6 +1969,34 @@ static int __gup_device_huge_pud(pud_t pud, pud_t 
> *pudp, unsigned long addr,
>  }
>  #endif
>  
> +static int __record_subpages(struct page *page, unsigned long addr,
> +  unsigned long end, struct page **pages, int nr)
> +{
> + int nr_recorded_pages = 0;
> +
> + do {
> + pages[nr] = page;
> + nr++;
> + page++;
> + nr_recorded_pages++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return nr_recorded_pages;
> +}

Why don't you pass in already pages + nr?

> +
> +static void put_compound_head(struct page *page, int refs)
> +{
> + /* Do a get_page() first, in case refs == page->_refcount */
> + get_page(page);
> + page_ref_sub(page, refs);
> + put_page(page);
> +}
> +
> +static void __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
> +{
> + *nr += nr_recorded_pages;
> + SetPageReferenced(head);
> +}

I don't find this last helper very useful. It seems to muddy water more
than necessary...

Other than that the cleanup looks nice to me.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v1 08/12] powerpc/pseries: CMM: Implement balloon compaction

2019-11-13 Thread Michael Ellerman
David Hildenbrand  writes:
> On 12.11.19 11:46, Michael Ellerman wrote:
>> David Hildenbrand  writes:
>>> On 31.10.19 15:29, David Hildenbrand wrote:
 We can now get rid of the cmm_lock and completely rely on the balloon
 compaction internals, which now also manage the page list and the lock.
>> ...
 +
 +static int cmm_migratepage(struct balloon_dev_info *b_dev_info,
 + struct page *newpage, struct page *page,
 + enum migrate_mode mode)
 +{
 +  unsigned long flags;
 +
 +  /*
 +   * loan/"inflate" the newpage first.
 +   *
 +   * We might race against the cmm_thread who might discover after our
 +   * loan request that another page is to be unloaned. However, once
 +   * the cmm_thread runs again later, this error will automatically
 +   * be corrected.
 +   */
 +  if (plpar_page_set_loaned(newpage)) {
 +  /* Unlikely, but possible. Tell the caller not to retry now. */
 +  pr_err_ratelimited("%s: Cannot set page to loaned.", __func__);
 +  return -EBUSY;
 +  }
 +
 +  /* balloon page list reference */
 +  get_page(newpage);
 +
 +  spin_lock_irqsave(&b_dev_info->pages_lock, flags);
 +  balloon_page_insert(b_dev_info, newpage);
 +  balloon_page_delete(page);
>>>
>>> I think I am missing a b_dev_info->isolated_pages-- here.
>> 
>> I don't know this code at all, but looking at other balloon drivers they
>> do seem to do that in roughly the same spot.
>> 
>> I'll add it, how can we test that it's correct?
>
> It's certainly correct. We increment when we isolate 
> (balloon_page_isolate()) and decrement when we un-isolate.
>
> Un-isolate happens when we putback a isolated page 
> (balloon_page_putback() - migration aborted) or when we successfully 
> migrate it (via balloon_page_migrate()).
>
> The issue is that we cannot decrement in balloon_page_migrate(), as we 
> have to hold the b_dev_info->pages_lock. That's why we have to do it in 
> the registered callback under lock.

OK, I get it now.

> Please note that b_dev_info->isolated_pages is only needed for a sanity 
> check in balloon_page_dequeue(). That's why I didn't notice during 
> testing. I wonder if we should at some point rip out that sanity check ...

OK. Sanity checks can be good, though checks that call BUG() are less
nice :)  But I'm not an mm expert so I'll defer to you folks on the
sanity check.

For now I've merged this series with the decrement added to
cmm_migratepage().

cheers


Re: [PATCH 4/5] power: avs: smartreflex: Remove superfluous cast in debugfs_create_file() call

2019-11-13 Thread Rafael J. Wysocki
On Monday, October 21, 2019 4:51:48 PM CET Geert Uytterhoeven wrote:
> There is no need to cast a typed pointer to a void pointer when calling
> a function that accepts the latter.  Remove it, as the cast prevents
> further compiler checks.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
>  drivers/power/avs/smartreflex.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/power/avs/smartreflex.c b/drivers/power/avs/smartreflex.c
> index 4684e7df833a81e9..5376f3d22f31eade 100644
> --- a/drivers/power/avs/smartreflex.c
> +++ b/drivers/power/avs/smartreflex.c
> @@ -905,7 +905,7 @@ static int omap_sr_probe(struct platform_device *pdev)
>   sr_info->dbg_dir = debugfs_create_dir(sr_info->name, sr_dbg_dir);
>  
>   debugfs_create_file("autocomp", S_IRUGO | S_IWUSR, sr_info->dbg_dir,
> - (void *)sr_info, &pm_sr_fops);
> + sr_info, &pm_sr_fops);
>   debugfs_create_x32("errweight", S_IRUGO, sr_info->dbg_dir,
>  &sr_info->err_weight);
>   debugfs_create_x32("errmaxlimit", S_IRUGO, sr_info->dbg_dir,
> 

Applying as 5.5 material, thanks!






Re: Bug 205201 - overflow of DMA mask and bus mask

2019-11-13 Thread Christoph Hellwig
Interesting.  Give me some time to come up with a real fix, as drivers
really should not mess with GFP flags for these allocations, and even
if they did swiotlb is supposed to take care of any resulting problems.


[Bug 205201] Booting halts if Dawicontrol DC-2976 UW SCSI board installed, unless RAM size limited to 3500M

2019-11-13 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205201

--- Comment #14 from Christian Zigotzky (chzigot...@xenosoft.de) ---
Created attachment 285889
  --> https://bugzilla.kernel.org/attachment.cgi?id=285889&action=edit
Patch for renaming the GFP_DMA32 to GFP_DMA

Hi All,

The issue with the BT878 TV cards is solved. :-)

GFP_DMA32 was renamed to GFP_DMA in the PowerPC updates 4.21-1 in 
December last year.

Some PCI devices still use GFP_DMA32 (grep -r GFP_DMA32 *). I renamed 
GFP_DMA32 to GFP_DMA in the file 
"drivers/media/v4l2-core/videobuf-dma-sg.c". After compiling the RC7 of 
kernel 5.4, my BT878 TV card works again.

I created a patch for renaming GFP_DMA32 to GFP_DMA.

This patch doesn't solve the issue with the Dawicontrol DC-2976 UW SCSI board.
Please help us to solve the issue with the SCSI board too.

Cheers,
Christian

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v4 01/23] mm/gup: pass flags arg to __gup_device_* functions

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:48, John Hubbard wrote:
> A subsequent patch requires access to gup flags, so
> pass the flags argument through to the __gup_device_*
> functions.
> 
> Also placate checkpatch.pl by shortening a nearby line.
> 
> Reviewed-by: Jérôme Glisse 
> Reviewed-by: Ira Weiny 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 

Looks good! You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/gup.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..85caf76b3012 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
> unsigned long end,
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   int nr_start = *nr;
>   struct dev_pagemap *pgmap = NULL;
> @@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, 
> unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
> *pmdp, unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
> *pudp, unsigned long addr,
>  }
>  #else
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
>  }
>  
>  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
> @@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>   if (pmd_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
> + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   refs = 0;
> @@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>  }
>  
>  static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, unsigned int flags, struct page **pages, int 
> *nr)
> + unsigned long end, unsigned int flags,
> + struct page **pages, int *nr)
>  {
>   struct page *head, *page;
>   int refs;
> @@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>   if (pud_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
> + return __gup_device_huge_pud(orig, pudp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   refs = 0;
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 03/23] mm/gup: move try_get_compound_head() to top, fix minor issues

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:50, John Hubbard wrote:
> An upcoming patch uses try_get_compound_head() more widely,
> so move it to the top of gup.c.
> 
> Also fix a tiny spelling error and a checkpatch.pl warning.
> 
> Signed-off-by: John Hubbard 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/gup.c | 29 +++--
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..933524de6249 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,21 @@ struct follow_page_context {
>   unsigned int page_mask;
>  };
>  
> +/*
> + * Return the compound head page with ref appropriately incremented,
> + * or NULL if that failed.
> + */
> +static inline struct page *try_get_compound_head(struct page *page, int refs)
> +{
> + struct page *head = compound_head(page);
> +
> + if (WARN_ON_ONCE(page_ref_count(head) < 0))
> + return NULL;
> + if (unlikely(!page_cache_add_speculative(head, refs)))
> + return NULL;
> + return head;
> +}
> +
>  /**
>   * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
> pages
>   * @pages:  array of pages to be maybe marked dirty, and definitely released.
> @@ -1793,20 +1808,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, 
> int nr_start,
>   }
>  }
>  
> -/*
> - * Return the compund head page with ref appropriately incremented,
> - * or NULL if that failed.
> - */
> -static inline struct page *try_get_compound_head(struct page *page, int refs)
> -{
> - struct page *head = compound_head(page);
> - if (WARN_ON_ONCE(page_ref_count(head) < 0))
> - return NULL;
> - if (unlikely(!page_cache_add_speculative(head, refs)))
> - return NULL;
> - return head;
> -}
> -
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>unsigned int flags, struct page **pages, int *nr)
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:51, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
> 
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
> 
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
> 
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.
> 
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
> 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/mm.h | 27 ---
>  mm/memremap.c  | 67 --
>  2 files changed, 53 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>   if (!static_branch_unlikely(&devmap_managed_key))
>   return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
>   case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
>   return true;
>   default:
>   break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> +  * devmap page refcounts are 1-based, rather than 0-based: if
> +  * refcount is 1, then the page is free and the refcount is
> +  * stable because nobody holds a reference on the page.
> +  */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> - int count = page_ref_dec_return(page);
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
> +
> + mem_cgroup_uncharge(page);
>  
>   /*
> -  * If refcount is 1 then page is freed and refcount is stable as nobody
> -  * holds a reference on the page.
> +  * When a device_private page is freed, the page->mapping field
> +  * may still contain a (stale) mapping value. For example, the
> +  * lower bits of page->mapping may still identify the page as
> +  * an anonymous page. Ultimately, this entire field is just
> +  * stale and wrong, and it will cause errors if not cleared.
> +  * One example is:
> +  *
> +  *  migrate_vma_pages()
> +  *migrate_vma_insert_page()
> +  *  page_add_new_anon_rmap()
> +  *__page_set_anon_rmap()
> +  *  ...checks page->mapping, via PageAnon(page) call,
> +  *and incorrectly concludes that the page is an
> +  *anonymous page. Therefore, it incorrectly,
> +  *silently fails to set up the new anon rmap.
> +  *
> +  * For other types of ZONE_DEVICE pages, migration is either
> +  * handled differently or not done at all, so there is no need
> +  * to clear page->mapping.
>*/
> - if (count == 1) {
>

Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:56, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
> pin_user_pages()
> pin_user_pages_remote()
> pin_user_pages_fast()
> 
> pin_longterm_pages()
> pin_longterm_pages_remote()
> pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
> * Start with "pin_user_pages" instead of "get_user_pages". That
>   makes it easy to find and audit the call sites.
> * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded upon it.)
> 
> Reviewed-by: Mike Rapoport   # Documentation
> Reviewed-by: Jérôme Glisse 
> Cc: Jonathan Corbet 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 

Thanks for the documentation. It looks great!

> diff --git a/mm/gup.c b/mm/gup.c
> index 83702b2e86c8..4409e84dff51 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -201,6 +201,10 @@ static struct page *follow_page_pte(struct 
> vm_area_struct *vma,
>   spinlock_t *ptl;
>   pte_t *ptep, pte;
>  
> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> +  (FOLL_PIN | FOLL_GET)))
> + return ERR_PTR(-EINVAL);
>  retry:
>   if (unlikely(pmd_bad(*pmd)))
>   return no_page_table(vma, flags);

How does FOLL_PIN result in grabbing (at least normal, for now) page reference?
I didn't find that anywhere in this patch but it is a prerequisite to
converting any user to pin_user_pages() interface, right?

> +/**
> + * pin_user_pages_fast() - pin user pages in memory without taking locks
> + *
> + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
> + * get_user_pages_fast() for documentation on the function arguments, because
> + * the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. 
> It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +int pin_user_pages_fast(unsigned long start, int nr_pages,
> + unsigned int gup_flags, struct page **pages)
> +{
> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> + if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> + return -EINVAL;
> +
> + gup_flags |= FOLL_PIN;
> + return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> +}
> +EXPORT_SYMBOL_GPL(pin_user_pages_fast);

I was somewhat wondering about the number of functions you add here. So we
have:

pin_user_pages()
pin_user_pages_fast()
pin_user_pages_remote()

and then longterm variants:

pin_longterm_pages()
pin_longterm_pages_fast()
pin_longterm_pages_remote()

and obviously we have gup like:
get_user_pages()
get_user_pages_fast()
get_user_pages_remote()
... and some other gup variants ...

I think we really should have pin_* vs get_* variants as they are very
different in terms of guarantees and after conversion, any use of get_*
variant in non-mm code should be closely scrutinized. OTOH pin_longterm_*
don't look *that* useful to me and just using pin_* instead with
FOLL_LONGTERM flag would look OK to me and somewhat reduce the number of
functions which is already large enough? What do people think? I don't feel
too strongly about this but wanted to bring this up.

Honza



-- 
Jan Kara 
SUSE Labs, CR


Re: unify the dma_capable definition

2019-11-13 Thread Nicolas Saenz Julienne
On Wed, 2019-11-13 at 08:35 +0100, Christoph Hellwig wrote:
> Hi all,
> 
> there is no good reason to have different version of dma_capable.
> This series removes the arch overrides and also adds a few cleanups
> in the same area.

Reviewed-by: Nicolas Saenz Julienne 

for the whole series,
Thanks!



signature.asc
Description: This is a digitally signed message part


Re: Bug 205201 - overflow of DMA mask and bus mask

2019-11-13 Thread Christian Zigotzky

Hello Christoph,

I have found the issue. :-)

GFP_DMA32 was renamed to GFP_DMA in the PowerPC updates 4.21-1 in 
December last year.


Some PCI devices still use GFP_DMA32 (grep -r GFP_DMA32 *). I renamed 
GFP_DMA32 to GFP_DMA in the file 
"drivers/media/v4l2-core/videobuf-dma-sg.c". After compiling the RC7 of 
kernel 5.4 my TV card works again.


Cheers,
Christian


On 12 November 2019 at 3:41 pm, Christoph Hellwig wrote:

On Mon, Nov 11, 2019 at 01:22:55PM +0100, Christian Zigotzky wrote:

Now, I can definitely say that this patch does not solve the issue.

Do you have another patch for testing or shall I bisect?

I'm still interested in the .config and dmesg.  Especially if the
board is using arch/powerpc/sysdev/fsl_pci.c, which is the only
place inside the powerpc arch code doing funny things with the
bus_dma_mask, which Robin pointed out looks fishy.


Thanks,
Christian

---end quoted text---





Re: [PATCH v3 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-13 Thread Jan Kara
On Wed 13-11-19 01:02:02, John Hubbard wrote:
> On 11/13/19 12:22 AM, Daniel Vetter wrote:
> ...
> > > > Why are we doing this? I think things got confused here someplace, as
> > > 
> > > 
> > > Because:
> > > 
> > > a) These need put_page() calls,  and
> > > 
> > > b) there is no put_pages() call, but there is a release_pages() call that
> > > is, arguably, what put_pages() would be.
> > > 
> > > 
> > > > the comment still says:
> > > > 
> > > > /**
> > > >   * put_user_page() - release a gup-pinned page
> > > >   * @page:pointer to page to be released
> > > >   *
> > > >   * Pages that were pinned via get_user_pages*() must be released via
> > > >   * either put_user_page(), or one of the put_user_pages*() routines
> > > >   * below.
> > > 
> > > 
> > > Ohhh, I missed those comments. They need to all be changed over to
> > > say "pages that were pinned via pin_user_pages*() or
> > > pin_longterm_pages*() must be released via put_user_page*()."
> > > 
> > > The get_user_pages*() pages must still be released via put_page.
> > > 
> > > The churn is due to a fairly significant change in strategy, whis
> > > is: instead of changing all get_user_pages*() sites to call
> > > put_user_page(), change selected sites to call pin_user_pages*() or
> > > pin_longterm_pages*(), plus put_user_page().
> > 
> > Can't we call this unpin_user_page then, for some symmetry? Or is that
> > even more churn?
> > 
> > Looking from afar the naming here seems really confusing.
> 
> 
> That look from afar is valuable, because I'm too close to the problem to see
> how the naming looks. :)
> 
> unpin_user_page() sounds symmetrical. It's true that it would cause more
> churn (which is why I started off with a proposal that avoids changing the
> names of put_user_page*() APIs). But OTOH, the amount of churn is proportional
> to the change in direction here, and it's really only 10 or 20 lines changed,
> in the end.
> 
> So I'm open to changing to that naming. It would be nice to hear what others
> prefer, too...

FWIW I'd find unpin_user_page() also better than put_user_page() as a
counterpart to pin_user_pages().

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 31/33] powerpc/64: make buildable without CONFIG_COMPAT

2019-11-13 Thread Arnd Bergmann
On Wed, Nov 13, 2019 at 9:41 AM Michal Suchánek  wrote:
>
> On Wed, Nov 13, 2019 at 01:02:34PM +1000, Nicholas Piggin wrote:

> > >
> > > @@ -277,7 +276,7 @@ static void do_signal(struct task_struct *tsk)
> > >
> > > rseq_signal_deliver(&ksig, tsk->thread.regs);
> > >
> > > -   if (is32) {
> > > +   if (is_32bit_task()) {
> > > if (ksig.ka.sa.sa_flags & SA_SIGINFO)
> > > ret = handle_rt_signal32(&ksig, oldset, tsk);
> > > else
> >
> > This is just a clean up I guess.
>
> It also expands directly to if(0) or if(1) for the !COMPAT cases. I am
> not sure how it would work with the intermediate variable.
>
> There was more complex change initially but after some additional
> cleanups removing the variable is the only part left.

I would be surprised if that made any difference to a modern compiler,
but the new version is definitely clearer to human readers.

   Arnd


[RESEND PATCH] selftests/powerpc: Handle Makefile for unrecognized option

2019-11-13 Thread Harish
On older distributions like Sles12SP5 gcc does not recognize
-no-pie option making the powerpc selftests build to fail

Fixes the following:
gcc: error: unrecognized command line option ‘-no-pie’

Signed-off-by: Harish 
---
 tools/testing/selftests/powerpc/pmu/ebb/Makefile | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/pmu/ebb/Makefile 
b/tools/testing/selftests/powerpc/pmu/ebb/Makefile
index 23f4caf48ffc..417306353e07 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/Makefile
+++ b/tools/testing/selftests/powerpc/pmu/ebb/Makefile
@@ -1,4 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
+include ../../../../../../scripts/Kbuild.include
+
 noarg:
$(MAKE) -C ../../
 
@@ -6,7 +8,10 @@ noarg:
 CFLAGS += -m64
 
 # Toolchains may build PIE by default which breaks the assembly
-LDFLAGS += -no-pie
+no-pie-option := $(call try-run, echo 'int main() { return 0; }' | \
+$(CC) -Werror $(KBUILD_CPPFLAGS) $(CC_OPTION_CFLAGS) -no-pie -x c - -o 
"$$TMP", -no-pie)
+
+LDFLAGS += $(no-pie-option)
 
 TEST_GEN_PROGS := reg_access_test event_attributes_test cycles_test\
 cycles_with_freeze_test pmc56_overflow_test\
-- 
2.21.0



Re: [PATCH 00/50] Add log level to show_stack()

2019-11-13 Thread Petr Mladek
On Wed 2019-11-13 15:33:34, Sergey Senozhatsky wrote:
> On (19/11/13 02:25), Dmitry Safonov wrote:
> > I guess I've pointed that in my point of view price for one-time testing
> > code is cheaper than adding a new printk feature to swap log levels on
> > the fly.
> [..]
> > I've gone through functions used by sysrq driver and the same changes
> > introducing log level parameter would be needed for: sched_show_task(),
> > debug_show_all_locks(), show_regs(), show_state(), show_mem(). Some of
> > them don't need any platform changes, but at least show_regs() needs.
> 
> Good points and nice conclusion.
> 
> Well, here we go. There is a number of generally useful functions that
> print nice data and where people might want to have better loglevel control
> (for debugging purposes). show_stack() is just one of them.

Could you please provide some examples so that we get an idea about
the scope, usefulness, and requirements?

> Patching all
> those functions, which you have mentioned above, is hardly a fun task to do.
> Hence the printk() per-CPU per-context loglevel proposition. The code there
> is not clever or complicated and is meant for debugging purposes only, but
> with just 3 lines of code one can do some stuff:
> 
>   /* @BTW you can't do this with "%s" KERN_FOO ;P */
> + printk_emergency_enter(LOGLEVEL_SCHED);
> + debug_show_all_locks();
> + printk_emergency_exit();

But this will not solve situations where the original loglevel should
stay from any reason. It happened in this patchset, see

https://lkml.kernel.org/r/20191106091258.gs25...@shell.armlinux.org.uk
https://lkml.kernel.org/r/20191106132516.GC5808@willie-the-truck

We would need to investigate more potential users of this feature to
see eventual requirements. If there are too many exceptions and modes
then the generic API might get pretty complicated.

At the moment, I am in favor of this patchset. It is huge and
needed a lot of manual work. But the result is straightforward and
easy to understand.

Best Regards,
Petr


Re: [PATCH 00/50] Add log level to show_stack()

2019-11-13 Thread Sergey Senozhatsky
On (19/11/13 02:25), Dmitry Safonov wrote:
> I guess I've pointed that in my point of view price for one-time testing
> code is cheaper than adding a new printk feature to swap log levels on
> the fly.
[..]
> I've gone through functions used by sysrq driver and the same changes
> introducing log level parameter would be needed for: sched_show_task(),
> debug_show_all_locks(), show_regs(), show_state(), show_mem(). Some of
> them don't need any platform changes, but at least show_regs() needs.

Good points and nice conclusion.

Well, here we go. There is a number of generally useful functions that
print nice data and where people might want to have better loglevel control
(for debugging purposes). show_stack() is just one of them. Patching all
those functions, which you have mentioned above, is hardly a fun task to do.
Hence the printk() per-CPU per-context loglevel proposition. The code there
is not clever or complicated and is meant for debugging purposes only, but
with just 3 lines of code one can do some stuff:

/* @BTW you can't do this with "%s" KERN_FOO ;P */
+   printk_emergency_enter(LOGLEVEL_SCHED);
+   debug_show_all_locks();
+   printk_emergency_exit();

Now...
I'm not sure if this whole thing is up to "printk maintainers only".
If no one is going to use "emergency printk contexts" then there is
no point in having that code in the kernel.

-ss


[PATCH] powerpc/powernv: Disable native PCIe port management

2019-11-13 Thread Oliver O'Halloran
On PowerNV the PCIe topology is (currently) managed the powernv platform
code in cooperation with firmware. The PCIe-native service drivers bypass
both and this can cause problems.

Historically this hasn't been a big deal since the only port service
driver that saw much use was the AER driver. The AER driver relies
a kernel service to report when errors occur rather than acting autonmously
so it's fairly easy to ignore. On PowerNV (and pseries) AER events are
handled through EEH, which ignores the AER service, so it's never been
an issue.

Unfortunately, the hotplug port service driver (pciehp) does act
autonomously and conflicts with the platform specific hotplug
driver (pnv_php). The main issue is that pciehp claims the interrupt
associated with the PCIe capability which in turn prevents pnv_php from
claiming it.

This results in hotplug events being handled by pciehp which does not
notify firmware when the PCIe topology changes, and does not setup/teardown
the arch specific PCI device structures (pci_dn) when the topology changes.
The end result is that hot-added devices cannot be enabled and hot-removed
devices may not be fully torn-down on removal.

We can fix these problems by setting the "pcie_ports_disabled" flag during
platform initialisation. The flag indicates the platform owns the PCIe
ports which stops the portbus driver being registered.

Cc: Sergey Miroshnichenko 
Fixes: 66725152fb9f ("PCI/hotplug: PowerPC PowerNV PCI hotplug driver")
Signed-off-by: Oliver O'Halloran 
---
Sergey, just FYI. I'll try sort out the rest of the hotplug
trainwreck in 5.6.

The Fixes: here is for the patch that added pnv_php in 4.8. It's been
a problem since then, but wasn't noticed until people started testing
it after the EEH fixes in commit 799abe283e51 ("powerpc/eeh: Clean up
EEH PEs after recovery finishes") went in earlier in the 5.4 cycle.
---
 arch/powerpc/platforms/powernv/pci.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index 2825d00..ae62583 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -941,6 +941,23 @@ void __init pnv_pci_init(void)
 
pci_add_flags(PCI_CAN_SKIP_ISA_ALIGN);
 
+#ifdef CONFIG_PCIEPORTBUS
+   /*
+* On PowerNV PCIe devices are (currently) managed in cooperation
+* with firmware. This isn't *strictly* required, but there's enough
+* assumptions baked into both firmware and the platform code that
+* it's unwise to allow the portbus services to be used.
+*
+* We need to fix this eventually, but for now set this flag to disable
+* the portbus driver. The AER service isn't required since that AER
+* events are handled via EEH. The pciehp hotplug driver can't work
+* without kernel changes (and portbus binding breaks pnv_php). The
+* other services also require some thinking about how we're going
+* to integrate them.
+*/
+   pcie_ports_disabled = true;
+#endif
+
/* If we don't have OPAL, eg. in sim, just skip PCI probe */
if (!firmware_has_feature(FW_FEATURE_OPAL))
return;
-- 
2.9.5



Re: Pull request: scottwood/linux.git next

2019-11-13 Thread Michael Ellerman
Scott Wood  writes:
> This contains KASLR support for book3e 32-bit.
>
> The following changes since commit 612ee81b9461475b5a5612c2e8d71559dd3c7920:
>
>   powerpc/papr_scm: Fix an off-by-one check in papr_scm_meta_{get, set} 
> (2019-10-10 20:15:53 +1100)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next
>
> for you to fetch changes up to 9df1ef3f1376ec5d3a1b51a4546c94279bcd88ca:
>
>   powerpc/fsl_booke/32: Document KASLR implementation (2019-10-21 16:09:16 
> -0500)
>
> 
> Jason Yan (12):
>   powerpc: unify definition of M_IF_NEEDED
>   powerpc: move memstart_addr and kernstart_addr to init-common.c
>   powerpc: introduce kernstart_virt_addr to store the kernel base
>   powerpc/fsl_booke/32: introduce create_kaslr_tlb_entry() helper
>   powerpc/fsl_booke/32: introduce reloc_kernel_entry() helper
>   powerpc/fsl_booke/32: implement KASLR infrastructure

This commit breaks booting on the qemu mac99 machine, using pmac32_defconfig.

  $ qemu-system-ppc -nographic -vga none -M mac99 -m 1G -kernel vmlinux 
  >> =
  >> OpenBIOS 1.1 [Oct 5 2018 08:21]
  >> Configuration device id QEMU version 1 machine id 1
  >> CPUs: 1
  >> Memory: 1024M
  >> UUID: ----
  >> CPU type PowerPC,G4
  milliseconds isn't unique.
  Welcome to OpenBIOS v1.1 built on Oct 5 2018 08:21
  >> [ppc] Kernel already loaded (0x0100 + 0x009d2920) (initrd 0x + 
0x)
  >> [ppc] Kernel command line: 
  >> switching to new context:
  OF stdout device is: /pci@f200/mac-io@c/escc@13000/ch-a@13020
  Preparing to boot Linux version 5.4.0-rc2-gcc49-05398-g4e1bb50 
(michael@alpine1-p1) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #30 
Wed Nov 13 15:07:39 AEDT 2019
  Detected machine type: 0400
  command line: 
  memory layout at init:
memory_limit :  (16 MB aligned)
alloc_bottom : 019d7000
alloc_top: 3000
alloc_top_hi : 4000
rmo_top  : 3000
ram_top  : 4000
  copying OF device tree...
  Building dt strings...
  Building dt structure...
  Device tree strings 0x019d8000 -> 0x019d70a4
  Device tree struct  0x019d9000 -> 0x3fde7eb0
  Quiescing Open Firmware ...
  Booting Linux via __start() @ 0x0100 ...

And no output after that.

The problematic hunk is the change to early_32.c:

diff --git a/arch/powerpc/kernel/early_32.c b/arch/powerpc/kernel/early_32.c
index 3482118ffe76..6f8689d7ca7b 100644
--- a/arch/powerpc/kernel/early_32.c
+++ b/arch/powerpc/kernel/early_32.c
@@ -22,7 +22,8 @@ notrace unsigned long __init early_init(unsigned long dt_ptr)
unsigned long offset = reloc_offset();
 
/* First zero the BSS */
-   memset(PTRRELOC(&__bss_start), 0, __bss_stop - __bss_start);
+   if (kernstart_virt_addr == KERNELBASE)
+   memset(PTRRELOC(&__bss_start), 0, __bss_stop - __bss_start);
 
/*
 * Identify the CPU type and fix up code sections
@@ -32,5 +33,5 @@ notrace unsigned long __init early_init(unsigned long dt_ptr)
 
apply_feature_fixups();
 
-   return KERNELBASE + offset;
+   return kernstart_virt_addr + offset;
 }


It needs to use PTRRELOC() for the kernstart_virt_addr accesses.

I've made that change and squashed it into the series. I've pushed that
as a branch to here:
  https://github.com/linuxppc/linux/commits/topic/kaslr-book3e32


That boots for me on qemu mac99.

Jason can you please test it on your setup with KASLR enabled to make
sure it still works.

cheers


Re: [PATCH v3 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-13 Thread John Hubbard

On 11/13/19 12:22 AM, Daniel Vetter wrote:
...

Why are we doing this? I think things got confused here someplace, as



Because:

a) These need put_page() calls,  and

b) there is no put_pages() call, but there is a release_pages() call that
is, arguably, what put_pages() would be.



the comment still says:

/**
  * put_user_page() - release a gup-pinned page
  * @page:pointer to page to be released
  *
  * Pages that were pinned via get_user_pages*() must be released via
  * either put_user_page(), or one of the put_user_pages*() routines
  * below.



Ohhh, I missed those comments. They need to all be changed over to
say "pages that were pinned via pin_user_pages*() or
pin_longterm_pages*() must be released via put_user_page*()."

The get_user_pages*() pages must still be released via put_page.

The churn is due to a fairly significant change in strategy, whis
is: instead of changing all get_user_pages*() sites to call
put_user_page(), change selected sites to call pin_user_pages*() or
pin_longterm_pages*(), plus put_user_page().


Can't we call this unpin_user_page then, for some symmetry? Or is that
even more churn?

Looking from afar the naming here seems really confusing.



That look from afar is valuable, because I'm too close to the problem to see
how the naming looks. :)

unpin_user_page() sounds symmetrical. It's true that it would cause more
churn (which is why I started off with a proposal that avoids changing the
names of put_user_page*() APIs). But OTOH, the amount of churn is proportional
to the change in direction here, and it's really only 10 or 20 lines changed,
in the end.

So I'm open to changing to that naming. It would be nice to hear what others
prefer, too...


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH 31/33] powerpc/64: make buildable without CONFIG_COMPAT

2019-11-13 Thread Michal Suchánek
On Wed, Nov 13, 2019 at 01:02:34PM +1000, Nicholas Piggin wrote:
> Michal Suchanek's on November 13, 2019 2:52 am:
> > There are numerous references to 32bit functions in generic and 64bit
> > code so ifdef them out.
> > 
> > Signed-off-by: Michal Suchanek 
> 
> For the most part these seem okay to me.
> 
> > diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> > index 45f1d5e54671..35874119b398 100644
> > --- a/arch/powerpc/kernel/Makefile
> > +++ b/arch/powerpc/kernel/Makefile
> > @@ -44,16 +44,16 @@ CFLAGS_btext.o += -DDISABLE_BRANCH_PROFILING
> >  endif
> >  
> >  obj-y  := cputable.o ptrace.o syscalls.o \
> > -  irq.o align.o signal_32.o pmc.o vdso.o \
> > +  irq.o align.o signal_$(BITS).o pmc.o vdso.o \
> >process.o systbl.o idle.o \
> >signal.o sysfs.o cacheinfo.o time.o \
> >prom.o traps.o setup-common.o \
> >udbg.o misc.o io.o misc_$(BITS).o \
> >of_platform.o prom_parse.o
> > -obj-$(CONFIG_PPC64)+= setup_64.o sys_ppc32.o \
> > -  signal_64.o ptrace32.o \
> > +obj-$(CONFIG_PPC64)+= setup_64.o \
> >paca.o nvram_64.o firmware.o note.o \
> >syscall_64.o
> > +obj-$(CONFIG_COMPAT)   += sys_ppc32.o ptrace32.o signal_32.o
> >  obj-$(CONFIG_VDSO32)   += vdso32/
> >  obj-$(CONFIG_PPC_WATCHDOG) += watchdog.o
> >  obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += hw_breakpoint.o
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index 00173cc904ef..c339a984958f 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -52,8 +52,10 @@
> >  SYS_CALL_TABLE:
> > .tc sys_call_table[TC],sys_call_table
> >  
> > +#ifdef CONFIG_COMPAT
> >  COMPAT_SYS_CALL_TABLE:
> > .tc compat_sys_call_table[TC],compat_sys_call_table
> > +#endif
> >  
> >  /* This value is used to mark exception frames on the stack. */
> >  exception_marker:
> > diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> > index 60436432399f..61678cb0e6a1 100644
> > --- a/arch/powerpc/kernel/signal.c
> > +++ b/arch/powerpc/kernel/signal.c
> > @@ -247,7 +247,6 @@ static void do_signal(struct task_struct *tsk)
> > sigset_t *oldset = sigmask_to_save();
> > struct ksignal ksig = { .sig = 0 };
> > int ret;
> > -   int is32 = is_32bit_task();
> >  
> > BUG_ON(tsk != current);
> >  
> > @@ -277,7 +276,7 @@ static void do_signal(struct task_struct *tsk)
> >  
> > rseq_signal_deliver(&ksig, tsk->thread.regs);
> >  
> > -   if (is32) {
> > +   if (is_32bit_task()) {
> > if (ksig.ka.sa.sa_flags & SA_SIGINFO)
> > ret = handle_rt_signal32(&ksig, oldset, tsk);
> > else
> 
> This is just a clean up I guess.

It also expands directly to if(0) or if(1) for the !COMPAT cases. I am
not sure how it would work with the intermediate variable.

There was more complex change initially but after some additional
cleanups removing the variable is the only part left.

> 
> > diff --git a/arch/powerpc/kernel/syscall_64.c 
> > b/arch/powerpc/kernel/syscall_64.c
> > index d00cfc4a39a9..319ebd4f494d 100644
> > --- a/arch/powerpc/kernel/syscall_64.c
> > +++ b/arch/powerpc/kernel/syscall_64.c
> > @@ -17,7 +17,6 @@ typedef long (*syscall_fn)(long, long, long, long, long, 
> > long);
> >  
> >  long system_call_exception(long r3, long r4, long r5, long r6, long r7, 
> > long r8, unsigned long r0, struct pt_regs *regs)
> >  {
> > -   unsigned long ti_flags;
> > syscall_fn f;
> >  
> > if (IS_ENABLED(CONFIG_PPC_BOOK3S))
> > @@ -64,8 +63,7 @@ long system_call_exception(long r3, long r4, long r5, 
> > long r6, long r7, long r8,
> >  
> > __hard_irq_enable();
> >  
> > -   ti_flags = current_thread_info()->flags;
> > -   if (unlikely(ti_flags & _TIF_SYSCALL_DOTRACE)) {
> > +   if (unlikely(current_thread_info()->flags & _TIF_SYSCALL_DOTRACE)) {
> > /*
> >  * We use the return value of do_syscall_trace_enter() as the
> >  * syscall number. If the syscall was rejected for any reason
> > @@ -81,7 +79,7 @@ long system_call_exception(long r3, long r4, long r5, 
> > long r6, long r7, long r8,
> > /* May be faster to do array_index_nospec? */
> > barrier_nospec();
> >  
> > -   if (unlikely(ti_flags & _TIF_32BIT)) {
> > +   if (unlikely(is_32bit_task())) {
> > f = (void *)compat_sys_call_table[r0];
> >  
> > r3 &= 0xULL;
> 
> I guess this is okay, I did want to be careful about where ti_flags
> was loaded exactly, but I think DOTRACE and 32BIT are not volatile.
> Is it possible to define _TIF_32BIT to zero for 64-bit !compat case
> and have the origina

Re: [PATCH v3 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-13 Thread Daniel Vetter
On Tue, Nov 12, 2019 at 10:10 PM John Hubbard  wrote:
>
> On 11/12/19 12:38 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 11, 2019 at 04:06:37PM -0800, John Hubbard wrote:
> >> Hi,
> >>
> >> The cover letter is long, so the more important stuff is first:
> >>
> >> * Jason, if you or someone could look at the the VFIO cleanup (patch 8)
> >>   and conversion to FOLL_PIN (patch 18), to make sure it's use of
> >>   remote and longterm gup matches what we discussed during the review
> >>   of v2, I'd appreciate it.
> >>
> >> * Also for Jason and IB: as noted below, in patch 11, I am (too?) boldly
> >>   converting from put_user_pages() to release_pages().
> >
> > Why are we doing this? I think things got confused here someplace, as
>
>
> Because:
>
> a) These need put_page() calls,  and
>
> b) there is no put_pages() call, but there is a release_pages() call that
> is, arguably, what put_pages() would be.
>
>
> > the comment still says:
> >
> > /**
> >  * put_user_page() - release a gup-pinned page
> >  * @page:pointer to page to be released
> >  *
> >  * Pages that were pinned via get_user_pages*() must be released via
> >  * either put_user_page(), or one of the put_user_pages*() routines
> >  * below.
>
>
> Ohhh, I missed those comments. They need to all be changed over to
> say "pages that were pinned via pin_user_pages*() or
> pin_longterm_pages*() must be released via put_user_page*()."
>
> The get_user_pages*() pages must still be released via put_page.
>
> The churn is due to a fairly significant change in strategy, whis
> is: instead of changing all get_user_pages*() sites to call
> put_user_page(), change selected sites to call pin_user_pages*() or
> pin_longterm_pages*(), plus put_user_page().

Can't we call this unpin_user_page then, for some symmetry? Or is that
even more churn?

Looking from afar the naming here seems really confusing.
-Daniel

> That allows incrementally converting the kernel over to using the
> new pin APIs, without taking on the huge risk of a big one-shot
> conversion.
>
> So, I've ended up with one place that actually needs to get reverted
> back to get_user_pages(), and that's the IB ODP code.
>
> >
> > I feel like if put_user_pages() is not the correct way to undo
> > get_user_pages() then it needs to be deleted.
> >
>
> Yes, you're right. I'll fix the put_user_page comments() as described.
>
>
> thanks,
>
> John Hubbard
> NVIDIA



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch