[PATCH V3 06/10] x86/entry: Preserve PKRS MSR across exceptions

2020-11-06 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]
irqentry_state_t was already an object being used in the exception
handling and is straightforward.  It is also easy for any number of
nested states to be tracked and eventually can be enhanced to store the
reference counting required to support PKS through kmap reentry

Preserve the current task's PKRS values in irqentry_state_t on exception
entry and restoring them on exception exit.

Each nested exception is further saved allowing for any number of levels
of exception handling.

Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.

[1] 
https://lore.kernel.org/lkml/calcetrve1i5jdyzd_bcctxqjn+ze3t38efpgjxn1f577m36...@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf@nanos.tec.linutronix.de/#t

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Suggested-by: Peter Zijlstra 
Suggested-by: Thomas Gleixner 
Signed-off-by: Ira Weiny 

---
Changes from V1
remove redundant irq_state->pkrs
This value is only needed for the global tracking.  So
it should be included in that patch and not in this one.

Changes from RFC V3
Standardize on 'irq_state' variable name
Per Dave Hansen
irq_save_pkrs() -> irq_save_set_pkrs()
Rebased based on clean up patch by Thomas Gleixner
This includes moving irq_[save_set|restore]_pkrs() to
the core as well.
---
 arch/x86/entry/common.c | 38 +
 arch/x86/include/asm/pkeys_common.h |  5 ++--
 arch/x86/mm/pkeys.c |  2 +-
 include/linux/entry-common.h| 13 ++
 kernel/entry/common.c   | 14 +--
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 87dea56a15d2..1b6a419a6fac 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_XEN_PV
 #include 
@@ -209,6 +210,41 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the PKRS value for the duration of the
+ * exception.  Thus preventing exception handlers from having the elevated
+ * access of the interrupted task.
+ */
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   irq_state->thread_pkrs = current->thread.saved_pkrs;
+   write_pkrs(INIT_PKRS_VALUE);
+}
+
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   write_pkrs(irq_state->thread_pkrs);
+   current->thread.saved_pkrs = irq_state->thread_pkrs;
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -272,6 +308,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
+   /* Normally called by irqentry_exit, we must restore pkrs here 
*/
+   irq_restore_pkrs(_state);
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index 801a75615209..11a95e6efd2d 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -27,9 +27,10 @@
 PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
 
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 new_pkrs);
+DECLARE_PER_CPU(u32, pkrs_cache);
+noinstr void write_pkrs(u32 new_pkrs);
 #else
-static inline void write_pkrs(u32 new_pkrs) { }
+static __always_inline void write_pkrs(u32 new_pkrs) { }
 #endif
 
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 76a62419c446..6892d4524868 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -2

[PATCH V3 08/10] x86/pks: Add PKS kernel API

2020-11-06 Thread ira . weiny
From: Fenghua Yu 

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain.  Export 5 new symbols pks_key_alloc(), pks_mknoaccess(),
pks_mkread(), pks_mkrdwr(), and pks_key_free().  Add 2 new macros;
PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes from V2
From Greg KH
Replace all WARN_ON_ONCE() uses with pr_err()
From Dan Williams
Add __must_check to pks_key_alloc() to help ensure users
are using the API correctly

Changes from V1
Per Dave Hansen
Add flags to pks_key_alloc() to help future proof the
interface if/when the key space is exhausted.

Changes from RFC V3
Per Dave Hansen
Put WARN_ON_ONCE in pks_key_free()
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change return pks_key_alloc() to EOPNOTSUPP when not
supported or configured
Per Peter Zijlstra
Remove unneeded preempt disable/enable
---
 Documentation/core-api/protection-keys.rst | 102 +---
 arch/x86/include/asm/pgtable_types.h   |  12 ++
 arch/x86/include/asm/pkeys.h   |  11 ++
 arch/x86/include/asm/pkeys_common.h|   4 +
 arch/x86/mm/pkeys.c| 128 +
 include/linux/pgtable.h|   4 +
 include/linux/pkeys.h  |  24 
 7 files changed, 267 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..c4e6c480562f 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,33 @@
 Memory Protection Keys
 ==
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
 Memory Protection Keys provides a mechanism for enforcing page-based
 protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+when an application changes protection domains.
+
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And It will be available in future
+non-server Intel parts and future AMD processors.
+
+Future Intel processors will support Protection Keys for Supervisor pages
+(PKS).
+
+For anyone wishing to test or use user space pkeys, it is available in Amazon's
+EC2 C5 instances and is known to work there using an Ubuntu 17.04 image.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
+
+Protections for each page are controlled with per CPU registers for each type
+of page User and Supervisor.  Each of these 32 bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs.  
These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +109,58 @@ with a read()::
 The kernel will send a SIGSEG

[PATCH V3 00/10] PKS: Add Protection Keys Supervisor (PKS) support V3

2020-11-06 Thread ira . weiny
From: Ira Weiny 

Changes from V2 [4]
Rebased on tip-tree/core/entry
From Thomas Gleixner
Address bisectability
Drop Patch:
x86/entry: Move nmi entry/exit into common code
From Greg KH
Remove WARN_ON's
From Dan Williams
Add __must_check to pks_key_alloc()
New patch: x86/pks: Add PKS defines and config options
Split from Enable patch to build on through the series
Fix compile errors

Changes from V1
Rebase to TIP master; resolve conflicts and test
Clean up some kernel docs updates missed in V1
Add irqentry_state_t kernel doc for PKRS field
Removed redundant irq_state->pkrs
This is only needed when we add the global state and somehow
ended up in this patch series.  That will come back when we add
the global functionality in.
From Thomas Gleixner
Update commit messages
Add kernel doc for struct irqentry_state_t
From Dave Hansen add flags to pks_key_alloc()

Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed


Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM.  Trusted keys
is a newer use case which is still being explored.  PMEM was submitted as part
of the RFC (v2) series[1].  However, since then it was found that some callers
of kmap() require a global implementation of PKS.  Specifically some users of
kmap() expect mappings to be available to all kernel threads.  While global use
of PKS is rare it needs to be included for correctness.  Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out.  Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2]  This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:


https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.we...@intel.com/

https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.we...@intel.com/


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections.  PKS works in
a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change.  Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation
saves/restores the MSR across context switches and during exceptions.  Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one.  Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

   1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
   2) Wo

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-09-04 Thread Ira Weiny
On Tue, Sep 03, 2019 at 08:26:18AM +1000, Dave Chinner wrote:
> On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > > "Leases are associated with an open file description (see open(2)).  
> > > > This means
> > > > that duplicate file descriptors (created by, for example, fork(2) or 
> > > > dup(2))
> > > > refer to the same lease, and this lease may be modified or released 
> > > > using any
> > > > of these descriptors.  Furthermore,  the lease is released by either an
> > > > explicit F_UNLCK operation on any of these duplicate file descriptors, 
> > > > or when
> > > > all such file descriptors have been closed."
> > > 
> > > Right, the lease is attached to the struct file, so it follows
> > > where-ever the struct file goes. That doesn't mean it's actually
> > > useful when the struct file is duplicated and/or passed to another
> > > process. :/
> > > 
> > > AFAICT, the problem is that when we take another reference to the
> > > struct file, or when the struct file is passed to a different
> > > process, nothing updates the lease or lease state attached to that
> > > struct file.
> > 
> > Ok, I probably should have made this more clear in the cover letter but 
> > _only_
> > the process which took the lease can actually pin memory.
> 
> Sure, no question about that.
> 
> > That pinned memory _can_ be passed to another process but those 
> > sub-process' can
> > _not_ use the original lease to pin _more_ of the file.  They would need to
> > take their own lease to do that.
> 
> Yes, they would need a new lease to extend it. But that ignores the
> fact they don't have a lease on the existing pins they are using and
> have no control over the lease those pins originated under.  e.g.
> the originating process dies (for whatever reason) and now we have
> pins without a valid lease holder.

Define "valid lease holder"?

> 
> If something else now takes an exclusive lease on the file (because
> the original exclusive lease no longer exists), it's not going to
> work correctly because of the zombied page pins caused by closing
> the exclusive lease they were gained under. IOWs, pages pinned under
> an exclusive lease are no longer "exclusive" the moment the original
> exclusive lease is dropped, and pins passed to another process are
> no longer covered by the original lease they were created under.

The page pins are not zombied the lease is.  The lease still exists, it can't
be dropped while the pins are in place.  I need to double check the
implementation but that was the intent.

Yep just did a quick check, I have a test for that.  If the page pins exist
then the lease can _not_ be released.  Closing the FD will "zombie" the lease
but it and the struct file will still exist until the pins go away.

Furthermore, a "zombie" lease is _not_ sufficient to pin more pages.  (I have a
test for this too.)  I apologize that I don't have something to submit to
xfstests.  I'm new to that code base.

I'm happy to share the code I have which I've been using to test...  But it is
pretty rough as it has undergone a number of changes.  I think it would be
better to convert my test series to xfstests.

However, I don't know if it is ok to require RDMA within those tests.  Right
now that is the only sub-system I have allowed to create these page pins.  So
I'm not sure what to do at this time.  I'm open to suggestions.

> 
> > Sorry for not being clear on that.
> 
> I know exactly what you are saying. What I'm failing to get across
> is that file layout leases don't actually allow the behaviour you
> want to have.

Not currently, no.  But we are discussing the semantics to allow them _to_ have
the behavior needed.

> 
> > > As such, leases that require callbacks to userspace are currently
> > > only valid within the process context the lease was taken in.
> > 
> > But for long term pins we are not requiring callbacks.
> 
> Regardless, we still require an active lease for long term pins so
> that other lease holders fail operations appropriately. And that
> exclusive lease must follow the process that pins the pages so that
> the life cycle is the same...

I disagree.  See below.

> 
> > > Indeed, even closing the fd the lease was taken on without
> > > F

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Ira Weiny
On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > 
> > > > > The major problem with RDMA is that it doesn't always wait on close() 
> > > > > for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > > 
> > > > >uverbs_destroy_ufile_hw()
> > > > >   mutex_lock()
> > > > >[..]
> > > > > mmput()
> > > > >  exit_mmap()
> > > > >   remove_vma()
> > > > >fput();
> > > > > file_operations->release()
> > > > 
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > > 
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > > 
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> > 
> > Yes I will try to prove this out...  But I'm still not sure this fully 
> > solves
> > the problem.
> > 
> > This only ensures that the process which has the RDMA context (RDMA FD) is 
> > safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process.  But what about the scenario.
> > 
> > Process A has the RDMA context FD and data file FD (with lease) open.
> > 
> > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> 
> Passing the RDMA context dependent on a file layout lease to another
> process that doesn't have a file layout lease or a reference to the
> original lease should be considered a violation of the layout lease.
> Process B does not have an active layout lease, and so by the rules
> of layout leases, it is not allowed to pin the layout of the file.
> 

I don't disagree with the semantics of this.  I just don't know how to enforce
it.

> > Process A attempts to exit (hangs because data file FD is pinned).
> > 
> > Admin kills process A.  kill works because we have allowed for it...
> > 
> > Process B _still_ has the RDMA context FD open _and_ therefore still holds 
> > the
> > file pins.
> > 
> > Truncation still fails.
> > 
> > Admin does not know which process is holding the pin.
> > 
> > What am I missing?
> 
> Application does not hold the correct file layout lease references.
> Passing the fd via SCM_RIGHTS to a process without a layout lease
> is equivalent to not using layout leases in the first place.

Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
in this case?  I'm ok with that but not sure how to implement it right now.

To that end, I would like to simplify this slightly because I'm not convinced
that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
user who wants to do this.

Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
definition leases) exist underneath the "RDMA FD" (or other direct access FD,
like XDP etc) being duplicated.  Later, if this becomes a use case we will need
to code up the proper checks, potentially within each of the subsystems.  This
is because, with RDMA at least, there are potentially large numbers of MR's and
file leases which may have to be checked.

Ira



Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Ira Weiny
On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> > On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> > > 
> > > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > > deadlock..
> > > > 
> > > > The discussion between Jan and Dave was concerning what happens when a 
> > > > user
> > > > calls
> > > > 
> > > > fd = open()
> > > > fnctl(...getlease...)
> > > > addr = mmap(fd...)
> > > > ib_reg_mr() 
> > > > munmap(addr...)
> > > > close(fd)
> > > 
> > > I don't see how blocking close(fd) could work.
> > 
> > Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> > can't prove it won't..
> 
> Right, I proposed it as a possible way of making sure application
> developers don't do this. It _could_ be made to work (e.g. recording
> longterm page pins on the vma->file), but this is tangential to 
> the discussion of requiring active references to all resources
> covered by the layout lease.
> 
> I think allowing applications to behave like the above is simply
> poor system level design, regardless of the interaction with
> filesystems and layout leases.
> 
> > Maybe we are all just touching a different part of this
> > elephant[1] but the above scenario or one without munmap is very reasonably
> > something a user would do.  So we can either allow the close to complete (my
> > current patches) or try to make it block like Dave is suggesting.

My belief when writing the current series was that hanging the close would
cause deadlock.  But it seems I was wrong because of the delayed __fput().

So far, I have not been able to get RDMA to have an issue like Jason suggested
would happen (or used to happen).  So from that perspective it may be ok to
hang the close.

> > 
> > I don't disagree with Dave with the semantics being nice and clean for the
> > filesystem.
> 
> I'm not trying to make it "nice and clean for the filesystem".
> 
> The problem is not just RDMA/DAX - anything that is directly
> accessing the block device under the filesystem has the same set of
> issues. That is, the filesystem controls the life cycle of the
> blocks in the block device, so direct access to the blocks by any
> means needs to be co-ordinated with the filesystem. Pinning direct
> access to a file via page pins attached to a hardware context that
> the filesystem knows nothing about is not an access model that the
> filesystems can support.
> 
> IOWs, anyone looking at this problem just from the RDMA POV of page
> pins is not seeing all the other direct storage access mechainsms
> that we need to support in the filesystems. RDMA on DAX is just one
> of them.  pNFS is another. Remote acces via NVMeOF is another. XDP
> -> DAX (direct file data placement from the network hardware) is
> another. There are /lots/ of different direct storage access
> mechanisms that filesystems need to support and we sure as hell do
> not want to have to support special case semantics for every single
> one of them.

My use of struct file was based on the fact that FDs are a primary interface
for linux and my thought was that they would be more universal than having file
pin information stored in an RDMA specific structure.

XDP is not as direct; it uses sockets.  But sockets also have a struct file
which I believe could be used in a similar manner.  I'm not 100% sure of the
xdp_umem lifetime yet but it seems that my choice of using struct file was a
good one in this respect.

> 
> Hence if we don't start with a sane model for arbitrating direct
> access to the storage at the filesystem level we'll never get this
> stuff to work reliably, let alone work together coherently.  An
> application that wants a direct data path to storage should have a
> single API that enables then to safely access the storage,
> regardless of how they are accessing the storage.
> 
> From that perspective, what we are talking about here with RDMA
> doing "mmap, page pin, unmap, close" and "pass page pins via
> SCM_RIGHTS" are fundamentally unworkable from the filesystem
> perspective. They are use-after-free situations from the filesystem
> perspective - they do not hold direct references to anything in the
> filesystem, and so the filesytem is completely unaware of them.

I see your point of view but looking at it from a different point of view 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Ira Weiny
On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > 
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to 
> > > > work out.
> > > 
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process.  All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> > 
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
> 
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.
> 
> > IMHO it is wrong to try and create a model where the file lease exists
> > independently from the kernel object relying on it. In other words the
> > IB MR object itself should hold a reference to the lease it relies
> > upon to function properly.
> 
> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.
> 
> That is, a lease specifically tracks the pid and the _open fd_ it
> was obtained for, so it is essentially owned by a specific process
> context.  Hence a lease is not able to be passed to a separate
> process context and have it still work correctly for lease break
> notifications.  i.e. the layout break signal gets delivered to
> original process that created the struct file, if it still exists
> and has the original fd still open. It does not get sent to the
> process that currently holds a reference to the IB context.
>

The fcntl man page says:

"Leases are associated with an open file description (see open(2)).  This means
that duplicate file descriptors (created by, for example, fork(2) or dup(2))
refer to the same lease, and this lease may be modified or released using any
of these descriptors.  Furthermore,  the lease is released by either an
explicit F_UNLCK operation on any of these duplicate file descriptors, or when
all such file descriptors have been closed."

>From this I took it that the child process FD would have the lease as well
_and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does not
seem to work the same way as dup() so I'm not so sure.

Ira

> 
> So while a struct file passed to another process might still have
> an active lease, and you can change the owner of the struct file
> via fcntl(F_SETOWN), you can't associate the existing lease with a
> the new fd in the new process and so layout break signals can't be
> directed at the lease fd
> 
> This really means that a lease can only be owned by a single process
> context - it can't be shared across multiple processes (so I was
> wrong about dup/pass as being a possible way of passing them)
> because there's only one process that can "own" a struct file, and
> that where signals are sent when the lease needs to be broken.
> 
> So, fundamentally, if you want to pass a resource that pins a file
> layout between processes, both processes need to hold a layout lease
> on that file range. And that means exclusive leases and passing
> layouts between processes are fundamentally incompatible because you
> can't hold two exclusive leases on the same file range
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com


Re: [PATCH] IB/core: Add mitigation for Spectre V1

2019-07-30 Thread Ira Weiny
On Tue, Jul 30, 2019 at 01:24:07PM -0700, Tony Luck wrote:
> Some processors may mispredict an array bounds check and
> speculatively access memory that they should not. With
> a user supplied array index we like to play things safe
> by masking the value with the array size before it is
> used as an index.
> 
> Signed-off-by: Tony Luck 

Reviewed-by: Ira Weiny 
Tested-by: Ira Weiny 

> ---
> 
> [I don't have h/w, so just compile tested]
> 
>  drivers/infiniband/core/user_mad.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/infiniband/core/user_mad.c 
> b/drivers/infiniband/core/user_mad.c
> index 9f8a48016b41..fdce254e4f65 100644
> --- a/drivers/infiniband/core/user_mad.c
> +++ b/drivers/infiniband/core/user_mad.c
> @@ -49,6 +49,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -888,6 +889,7 @@ static int ib_umad_unreg_agent(struct ib_umad_file *file, 
> u32 __user *arg)
>   mutex_lock(>port->file_mutex);
>   mutex_lock(>mutex);
>  
> + id = array_index_nospec(id, IB_UMAD_MAX_AGENTS);
>   if (id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) {
>   ret = -EINVAL;
>   goto out;
> -- 
> 2.20.1
> 


Re: [PATCH] IB/core: Add mitigation for Spectre V1

2019-07-30 Thread Ira Weiny
On Tue, Jul 30, 2019 at 06:52:12PM -0500, Gustavo A. R. Silva wrote:
> 
> 
> On 7/30/19 3:24 PM, Tony Luck wrote:
> > Some processors may mispredict an array bounds check and
> > speculatively access memory that they should not. With
> > a user supplied array index we like to play things safe
> > by masking the value with the array size before it is
> > used as an index.
> > 
> > Signed-off-by: Tony Luck 
> > ---
> > 
> > [I don't have h/w, so just compile tested]
> > 
> >  drivers/infiniband/core/user_mad.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/infiniband/core/user_mad.c 
> > b/drivers/infiniband/core/user_mad.c
> > index 9f8a48016b41..fdce254e4f65 100644
> > --- a/drivers/infiniband/core/user_mad.c
> > +++ b/drivers/infiniband/core/user_mad.c
> > @@ -49,6 +49,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  
> > @@ -888,6 +889,7 @@ static int ib_umad_unreg_agent(struct ib_umad_file 
> > *file, u32 __user *arg)
> > mutex_lock(>port->file_mutex);
> > mutex_lock(>mutex);
> >  
> > +   id = array_index_nospec(id, IB_UMAD_MAX_AGENTS);
> 
> This is wrong. This prevents the below condition id >= IB_UMAD_MAX_AGENTS
> from ever being true. And I don't think this is what you want.

Ah Yea...  FWIW this would probably never be hit.

Tony; split the check?

if (id >= IB_UMAD_MAX_AGENTS) {
ret = -EINVAL;
goto out;
}

id = array_index_nospec(id, IB_UMAD_MAX_AGENTS);

if (!__get_agent(file, id)) {
ret = -EINVAL;
goto out;
}

Ira

> 
> > if (id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) {
> > ret = -EINVAL;
> > goto out;
> > 
> 
> --
> Gustavo


[PATCH] fs/xfs: Fix return code of xfs_break_leased_layouts()

2019-08-19 Thread ira . weiny
From: Ira Weiny 

The parens used in the while loop would result in error being assigned
the value 1 rather than the intended errno value.

This is required to return -ETXTBSY from follow on break_layout()
changes.

Signed-off-by: Ira Weiny 
---
 fs/xfs/xfs_pnfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 0c954cad7449..a339bd5fa260 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -32,7 +32,7 @@ xfs_break_leased_layouts(
struct xfs_inode*ip = XFS_I(inode);
int error;
 
-   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+   while ((error = break_layout(inode, false)) == -EWOULDBLOCK) {
xfs_iunlock(ip, *iolock);
*did_unlock = true;
error = break_layout(inode, true);
-- 
2.20.1



Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Ira Weiny
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>uverbs_destroy_ufile_hw()
>   mutex_lock()
>[..]
> mmput()
>  exit_mmap()
>   remove_vma()
>fput();
> file_operations->release()
>  ib_uverbs_close()
>   uverbs_destroy_ufile_hw()
>mutex_lock()   <-- Deadlock
> 
> But, as I said to Ira earlier, I wonder if this is now impossible on
> modern kernels and we can switch to making the whole thing
> synchronous. That would resolve RDMA's main problem with this.

I'm still looking into this...  but my bigger concern is that the RDMA FD can
be passed to other processes via SCM_RIGHTS.  Which means the process holding
the pin may _not_ be the one with the open file and layout lease...

Ira



Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*()

2019-08-08 Thread Ira Weiny
On Thu, Aug 08, 2019 at 03:59:15PM -0700, John Hubbard wrote:
> On 8/8/19 12:20 PM, John Hubbard wrote:
> > On 8/8/19 4:09 AM, Vlastimil Babka wrote:
> >> On 8/8/19 8:21 AM, Michal Hocko wrote:
> >>> On Wed 07-08-19 16:32:08, John Hubbard wrote:
>  On 8/7/19 4:01 AM, Michal Hocko wrote:
> > On Mon 05-08-19 15:20:17, john.hubb...@gmail.com wrote:
> >> From: John Hubbard 
>  Actually, I think follow_page_mask() gets all the pages, right? And the
>  get_page() in __munlock_pagevec_fill() is there to allow a 
>  pagevec_release() 
>  later.
> >>>
> >>> Maybe I am misreading the code (looking at Linus tree) but 
> >>> munlock_vma_pages_range
> >>> calls follow_page for the start address and then if not THP tries to
> >>> fill up the pagevec with few more pages (up to end), do the shortcut
> >>> via manual pte walk as an optimization and use generic get_page there.
> >>
> > 
> > Yes, I see it finally, thanks. :)  
> > 
> >> That's true. However, I'm not sure munlocking is where the
> >> put_user_page() machinery is intended to be used anyway? These are
> >> short-term pins for struct page manipulation, not e.g. dirtying of page
> >> contents. Reading commit fc1d8e7cca2d I don't think this case falls
> >> within the reasoning there. Perhaps not all GUP users should be
> >> converted to the planned separate GUP tracking, and instead we should
> >> have a GUP/follow_page_mask() variant that keeps using get_page/put_page?
> >>  
> > 
> > Interesting. So far, the approach has been to get all the gup callers to
> > release via put_user_page(), but if we add in Jan's and Ira's 
> > vaddr_pin_pages()
> > wrapper, then maybe we could leave some sites unconverted.
> > 
> > However, in order to do so, we would have to change things so that we have
> > one set of APIs (gup) that do *not* increment a pin count, and another set
> > (vaddr_pin_pages) that do. 
> > 
> > Is that where we want to go...?
> > 
> 
> Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead
> of get_page(), and also fix the releasing code. So this incremental patch, on
> top of the existing one, should do it:
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index b980e6270e8a..2ea272c6fee3 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec *pvec, 
> struct zone *zone)
> /*
>  * We won't be munlocking this page in the next phase
>  * but we still need to release the follow_page_mask()
> -* pin. We cannot do it under lru_lock however. If it's
> -* the last pin, __page_cache_release() would deadlock.
> +* pin.
>  */
> -   pagevec_add(_putback, pvec->pages[i]);
> +   put_user_page(pages[i]);
> pvec->pages[i] = NULL;
> }
> __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> spin_unlock_irq(>zone_pgdat->lru_lock);
>  
> -   /* Now we can release pins of pages that we are not munlocking */
> -   pagevec_release(_putback);
> -

I'm not an expert but this skips a call to lru_add_drain().  Is that ok?

> /* Phase 2: page munlock */
> for (i = 0; i < nr; i++) {
> struct page *page = pvec->pages[i];
> @@ -394,6 +390,8 @@ static unsigned long __munlock_pagevec_fill(struct 
> pagevec *pvec,
> start += PAGE_SIZE;
> while (start < end) {
> struct page *page = NULL;
> +   int ret;
> +
> pte++;
> if (pte_present(*pte))
> page = vm_normal_page(vma, start, *pte);
> @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct 
> pagevec *pvec,
> if (PageTransCompound(page))
> break;
>  
> -   get_page(page);
> +   /*
> +* Use get_user_pages_fast(), instead of get_page() so that 
> the
> +* releasing code can unconditionally call put_user_page().
> +*/
> +   ret = get_user_pages_fast(start, 1, 0, );
> +   if (ret != 1)
> +   break;

I like the idea of making this a get/put pair but I'm feeling uneasy about how
this is really supposed to work.

For sure the GUP/PUP was supposed to be separate from [get|put]_page.

Ira
> /*
>  * Increase the address that will be returned *before* the
>  * eventual break due to pvec becoming full by adding the page
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA


Re: [RFC PATCH 2/2] mm/gup: introduce vaddr_pin_pages_remote()

2019-08-15 Thread Ira Weiny
On Thu, Aug 15, 2019 at 03:35:10PM +0200, Jan Kara wrote:
> On Thu 15-08-19 15:26:22, Jan Kara wrote:
> > On Wed 14-08-19 20:01:07, John Hubbard wrote:
> > > On 8/14/19 5:02 PM, John Hubbard wrote:
> > > 
> > > Hold on, I *was* forgetting something: this was a two part thing, and
> > > you're conflating the two points, but they need to remain separate and
> > > distinct. There were:
> > > 
> > > 1. FOLL_PIN is necessary because the caller is clearly in the use case 
> > > that
> > > requires it--however briefly they might be there. As Jan described it,
> > > 
> > > "Anything that gets page reference and then touches page data (e.g.
> > > direct IO) needs the new kind of tracking so that filesystem knows
> > > someone is messing with the page data." [1]
> > 
> > So when the GUP user uses MMU notifiers to stop writing to pages whenever
> > they are writeprotected with page_mkclean(), they don't really need page
> > pin - their access is then fully equivalent to any other mmap userspace
> > access and filesystem knows how to deal with those. I forgot out this case
> > when I wrote the above sentence.
> > 
> > So to sum up there are three cases:
> > 1) DIO case - GUP references to pages serving as DIO buffers are needed for
> >relatively short time, no special synchronization with page_mkclean() or
> >munmap() => needs FOLL_PIN
> > 2) RDMA case - GUP references to pages serving as DMA buffers needed for a
> >long time, no special synchronization with page_mkclean() or munmap()
> >=> needs FOLL_PIN | FOLL_LONGTERM
> >This case has also a special case when the pages are actually DAX. Then
> >the caller additionally needs file lease and additional file_pin
> >structure is used for tracking this usage.
> > 3) ODP case - GUP references to pages serving as DMA buffers, MMU notifiers
> >used to synchronize with page_mkclean() and munmap() => normal page
> >references are fine.
> 
> I want to add that I'd like to convert users in cases 1) and 2) from using
> GUP to using differently named function. Users in case 3) can stay as they
> are for now although ultimately I'd like to denote such use cases in a
> special way as well...
> 

Ok just to make this clear I threw up my current tree with your patches here:

https://github.com/weiny2/linux-kernel/commits/mmotm-rdmafsdax-b0-v4

I'm talking about dropping the final patch:
05fd2d3afa6b rdma/umem_odp: Use vaddr_pin_pages_remote() in ODP

The other 2 can stay.  I split out the *_remote() call.  We don't have a user
but I'll keep it around for a bit.

This tree is still WIP as I work through all the comments.  So I've not changed
names or variable types etc...  Just wanted to settle this.

Ira



Re: [RFC PATCH 2/2] mm/gup: introduce vaddr_pin_pages_remote()

2019-08-16 Thread Ira Weiny
On Fri, Aug 16, 2019 at 05:41:08PM +0200, Jan Kara wrote:
> On Thu 15-08-19 19:14:08, John Hubbard wrote:
> > On 8/15/19 10:41 AM, John Hubbard wrote:
> > > On 8/15/19 10:32 AM, Ira Weiny wrote:
> > >> On Thu, Aug 15, 2019 at 03:35:10PM +0200, Jan Kara wrote:
> > >>> On Thu 15-08-19 15:26:22, Jan Kara wrote:
> > >>>> On Wed 14-08-19 20:01:07, John Hubbard wrote:
> > >>>>> On 8/14/19 5:02 PM, John Hubbard wrote:
> > ...
> > >> Ok just to make this clear I threw up my current tree with your patches 
> > >> here:
> > >>
> > >> https://github.com/weiny2/linux-kernel/commits/mmotm-rdmafsdax-b0-v4
> > >>
> > >> I'm talking about dropping the final patch:
> > >> 05fd2d3afa6b rdma/umem_odp: Use vaddr_pin_pages_remote() in ODP
> > >>
> > >> The other 2 can stay.  I split out the *_remote() call.  We don't have a 
> > >> user
> > >> but I'll keep it around for a bit.
> > >>
> > >> This tree is still WIP as I work through all the comments.  So I've not 
> > >> changed
> > >> names or variable types etc...  Just wanted to settle this.
> > >>
> > > 
> > > Right. And now that ODP is not a user, I'll take a quick look through my 
> > > other
> > > call site conversions and see if I can find an easy one, to include here 
> > > as
> > > the first user of vaddr_pin_pages_remote(). I'll send it your way if that
> > > works out.
> > > 
> > 
> > OK, there was only process_vm_access.c, plus (sort of) Bharath's sgi-gru
> > patch, maybe eventually [1].  But looking at process_vm_access.c, I think 
> > it is one of the patches that is no longer applicable, and I can just
> > drop it entirely...I'd welcome a second opinion on that...
> 
> I don't think you can drop the patch. process_vm_rw_pages() clearly touches
> page contents and does not synchronize with page_mkclean(). So it is case
> 1) and needs FOLL_PIN semantics.

John could you send a formal patch using vaddr_pin* and I'll add it to the
tree?

Ira

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR
> 


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-16 Thread Ira Weiny
On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > > > Pre-requisites
> > > > ==
> > > > Based on mmotm tree.
> > > > 
> > > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > > 
> > > > Solution summary
> > > > 
> > > > 
> > > > The real issue is that there is no use case for a user to have RDMA 
> > > > pinn'ed
> > > > memory which is then truncated.  So really any solution we present 
> > > > which:
> > > > 
> > > > A) Prevents file system corruption or data leaks
> > > > ...and...
> > > > B) Informs the user that they did something wrong
> > > > 
> > > > Should be an acceptable solution.
> > > > 
> > > > Because this is slightly new behavior.  And because this is going to be
> > > > specific to DAX (because of the lack of a page cache) we have made the 
> > > > user
> > > > "opt in" to this behavior.
> > > > 
> > > > The following patches implement the following solution.
> > > > 
> > > > 0) Registrations to Device DAX char devs are not affected
> > > > 
> > > > 1) The user has to opt in to allowing page pins on a file with an 
> > > > exclusive
> > > >layout lease.  Both exclusive and layout lease flags are user 
> > > > visible now.
> > > > 
> > > > 2) page pins will fail if the lease is not active when the file back 
> > > > page is
> > > >encountered.
> > > > 
> > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > > 
> > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > mean a page which has corresponding file_pin covering it? Or do you mean a
> > > page which has pincount increased? If the first then I'd rephrase this to
> > > be less ambiguous, if the second then I think it is wrong. 
> > 
> > I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> > pincount processing will hang the truncate.  Given the discussion with John 
> > H
> > we can make this a bit better if we use something like FOLL_PIN and the page
> > count bias to indicate this type of pin.  Then I could fail the truncate
> > outright.  but that is not done yet.
> > 
> > so... I used the word "fail" to be a bit more vague as the final 
> > implementation
> > may return ETXTBUSY or hang as noted.
> 
> Ah, OK. Hanging is fine in principle but with longterm pins, your work
> makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> e.g. DIO will use page pins as well for its buffers and we must wait there
> until the pin is released. So please just clarify your 'fail' here a bit
> :).

It will fail with ETXTBSY.  I've fixed a bug...  See below.

> 
> > > > 4) The user has the option of holding the lease or releasing it.  If 
> > > > they
> > > >release it no other pin calls will work on the file.
> > > 
> > > Last time we spoke the plan was that the lease is kept while the pages are
> > > pinned (and an attempt to release the lease would block until the pages 
> > > are
> > > unpinned). That also makes it clear that the *lease* is what is making
> > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > just an implementation detail how the existence is efficiently tracked 
> > > (and
> > > what keeps the backing file for the pages open so that the lease does not
> > > get auto-destroyed). Why did you change this?
> > 
> > closing the file _and_ unmaping it will cause the lease to be released
> > regardless of if we allow this or not.
> > 
> > As we discussed preventing the close seemed intractable.
> 
> Yes, preventing the application from closing the file is difficult. But
> from a quick look at your patches it seemed to me that you actually hold a
> backing file reference from the file_pin structure thus even though the
> application closes its file descriptor, the struct file (and thus the
> lease) l

Re: [RFC PATCH 2/2] mm/gup: introduce vaddr_pin_pages_remote()

2019-08-16 Thread Ira Weiny
On Fri, Aug 16, 2019 at 11:50:09AM -0700, John Hubbard wrote:
> On 8/16/19 11:33 AM, Ira Weiny wrote:
> > On Fri, Aug 16, 2019 at 05:41:08PM +0200, Jan Kara wrote:
> > > On Thu 15-08-19 19:14:08, John Hubbard wrote:
> > > > On 8/15/19 10:41 AM, John Hubbard wrote:
> > > > > On 8/15/19 10:32 AM, Ira Weiny wrote:
> > > > > > On Thu, Aug 15, 2019 at 03:35:10PM +0200, Jan Kara wrote:
> > > > > > > On Thu 15-08-19 15:26:22, Jan Kara wrote:
> > > > > > > > On Wed 14-08-19 20:01:07, John Hubbard wrote:
> > > > > > > > > On 8/14/19 5:02 PM, John Hubbard wrote:
> > > > ...
> > > > 
> > > > OK, there was only process_vm_access.c, plus (sort of) Bharath's sgi-gru
> > > > patch, maybe eventually [1].  But looking at process_vm_access.c, I 
> > > > think
> > > > it is one of the patches that is no longer applicable, and I can just
> > > > drop it entirely...I'd welcome a second opinion on that...
> > > 
> > > I don't think you can drop the patch. process_vm_rw_pages() clearly 
> > > touches
> > > page contents and does not synchronize with page_mkclean(). So it is case
> > > 1) and needs FOLL_PIN semantics.
> > 
> > John could you send a formal patch using vaddr_pin* and I'll add it to the
> > tree?
> > 
> 
> Yes...hints about which struct file to use here are very welcome, btw. This 
> part
> of mm is fairly new to me.

I'm still working out the final semantics of vaddr_pin*.  But right now you
don't need a vaddr_pin if you don't specify FOLL_LONGTERM.

Since case 1, this case, does not need FOLL_LONGTERM I think it is safe to
simply pass NULL here.

OTOH we could just track this against the mm_struct.  But I don't think we need
to because this pin should be transient.

And this is why I keep leaning toward _not_ putting these flags in the
vaddr_pin*() calls.  I know this is what I did but I think I'm wrong.  It should
be the caller specifying what they want and the vaddr_pin*() calls check that
what they are asking for is correct.

Ira

> 
> thanks,
> -- 
> John Hubbard
> NVIDIA


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-16 Thread Ira Weiny
On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > Hello!
> > > > 
> > > > On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > > > > Pre-requisites
> > > > > ==
> > > > >   Based on mmotm tree.
> > > > > 
> > > > > Based on the feedback from LSFmm, the LWN article, the RFC series 
> > > > > since
> > > > > then, and a ton of scenarios I've worked in my mind and/or 
> > > > > tested...[1]
> > > > > 
> > > > > Solution summary
> > > > > 
> > > > > 
> > > > > The real issue is that there is no use case for a user to have RDMA 
> > > > > pinn'ed
> > > > > memory which is then truncated.  So really any solution we present 
> > > > > which:
> > > > > 
> > > > > A) Prevents file system corruption or data leaks
> > > > > ...and...
> > > > > B) Informs the user that they did something wrong
> > > > > 
> > > > > Should be an acceptable solution.
> > > > > 
> > > > > Because this is slightly new behavior.  And because this is going to 
> > > > > be
> > > > > specific to DAX (because of the lack of a page cache) we have made 
> > > > > the user
> > > > > "opt in" to this behavior.
> > > > > 
> > > > > The following patches implement the following solution.
> > > > > 
> > > > > 0) Registrations to Device DAX char devs are not affected
> > > > > 
> > > > > 1) The user has to opt in to allowing page pins on a file with an 
> > > > > exclusive
> > > > >layout lease.  Both exclusive and layout lease flags are user 
> > > > > visible now.
> > > > > 
> > > > > 2) page pins will fail if the lease is not active when the file back 
> > > > > page is
> > > > >encountered.
> > > > > 
> > > > > 3) Any truncate or hole punch operation on a pinned DAX page will 
> > > > > fail.
> > > > 
> > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > > mean a page which has corresponding file_pin covering it? Or do you 
> > > > mean a
> > > > page which has pincount increased? If the first then I'd rephrase this 
> > > > to
> > > > be less ambiguous, if the second then I think it is wrong. 
> > > 
> > > I mean the second.  but by "fail" I mean hang.  Right now the "normal" 
> > > page
> > > pincount processing will hang the truncate.  Given the discussion with 
> > > John H
> > > we can make this a bit better if we use something like FOLL_PIN and the 
> > > page
> > > count bias to indicate this type of pin.  Then I could fail the truncate
> > > outright.  but that is not done yet.
> > > 
> > > so... I used the word "fail" to be a bit more vague as the final 
> > > implementation
> > > may return ETXTBUSY or hang as noted.
> > 
> > Ah, OK. Hanging is fine in principle but with longterm pins, your work
> > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> > e.g. DIO will use page pins as well for its buffers and we must wait there
> > until the pin is released. So please just clarify your 'fail' here a bit
> > :).
> 
> It will fail with ETXTBSY.  I've fixed a bug...  See below.
> 
> > 
> > > > > 4) The user has the option of holding the lease or releasing it.  If 
> > > > > they
> > > > >release it no other pin calls will work on the file.
> > > > 
> > > > Last time we spoke the plan was that the lease is kept while the pages 
> > > > are
> > > > pinned (and an attempt to release the lease would block until the pages 
> > > > are
> > > > unpinned). That also makes it clear that the *lease* is what is making
> > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > > just an implementation detail how the existence is efficiently tracked 
> > > > (and
> > > > w

Re: add a not device managed memremap_pages v2

2019-08-16 Thread Ira Weiny
On Fri, Aug 16, 2019 at 08:54:30AM +0200, Christoph Hellwig wrote:
> Hi Dan and Jason,
> 
> Bharata has been working on secure page management for kvmppc guests,
> and one I thing I noticed is that he had to fake up a struct device
> just so that it could be passed to the devm_memremap_pages
> instrastructure for device private memory.
> 
> This series adds non-device managed versions of the
> devm_request_free_mem_region and devm_memremap_pages functions for
> his use case.
> 
> Changes since v1:
>  - don't overload devm_request_free_mem_region
>  - export the memremap_pages and munmap_pages as kvmppc can be a module

Except for the questions from Andrew this does not look to change anything so:

Reviewed-by: Ira Weiny 

> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-05-30 Thread Ira Weiny
On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote:
> As for FOLL_LONGTERM, it is checked in the slow path
> __gup_longterm_unlocked(). But it is not checked in the fast path, which
> means a possible leak of CMA page to longterm pinned requirement through
> this crack.
> 
> Place a check in the fast path.
> 
> Signed-off-by: Pingfan Liu 
> Cc: Ira Weiny 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Dan Williams 
> Cc: Matthew Wilcox 
> Cc: John Hubbard 
> Cc: "Aneesh Kumar K.V" 
> Cc: Keith Busch 
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/gup.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f173fcb..00feab3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2235,6 +2235,18 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>   local_irq_enable();
>   ret = nr;
>   }
> +#if defined(CONFIG_CMA)
> + if (unlikely(gup_flags & FOLL_LONGTERM)) {
> + int i, j;
> +
> + for (i = 0; i < nr; i++)
> + if (is_migrate_cma_page(pages[i])) {
> + for (j = i; j < nr; j++)
> + put_page(pages[j]);

Should be put_user_page() now.  For now that just calls put_page() but it is
slated to change soon.

I also wonder if this would be more efficient as a check as we are walking the
page tables and bail early.

Perhaps the code complexity is not worth it?

> + nr = i;

Why not just break from the loop here?

Or better yet just use 'i' in the inner loop...

Ira

> + }
> + }
> +#endif
>  
>   if (nr < nr_pages) {
>   /* Try to get the remaining pages with get_user_pages */
> -- 
> 2.7.5
> 


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-05-30 Thread Ira Weiny
On Thu, May 30, 2019 at 04:21:19PM -0700, John Hubbard wrote:
> On 5/30/19 2:47 PM, Ira Weiny wrote:
> > On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote:
> [...]
> >> +  for (j = i; j < nr; j++)
> >> +  put_page(pages[j]);
> > 
> > Should be put_user_page() now.  For now that just calls put_page() but it is
> > slated to change soon.
> > 
> > I also wonder if this would be more efficient as a check as we are walking 
> > the
> > page tables and bail early.
> > 
> > Perhaps the code complexity is not worth it?
> 
> Good point, it might be worth it. Because now we've got two loops that
> we run, after the interrupts-off page walk, and it's starting to look like
> a potential performance concern. 

FWIW I don't see this being a huge issue at the moment.  Perhaps those more
familiar with CMA can weigh in here.  How was this issue found?  If it was
found by running some test perhaps that indicates a performance preference?

> 
> > 
> >> +  nr = i;
> > 
> > Why not just break from the loop here?
> > 
> > Or better yet just use 'i' in the inner loop...
> > 
> 
> ...but if you do end up putting in the after-the-fact check, then we can
> go one or two steps further in cleaning it up, by:
> 
> * hiding the visible #ifdef that was slicing up gup_fast,
> 
> * using put_user_pages() instead of either put_page or put_user_page,
>   thus getting rid of j entirely, and
> 
> * renaming an ancient minor confusion: nr --> nr_pinned), 
> 
> we could have this, which is looks cleaner and still does the same thing:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f173fcbaf1b2..0c1f36be1863 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1486,6 +1486,33 @@ static __always_inline long 
> __gup_longterm_locked(struct task_struct *tsk,
>  }
>  #endif /* CONFIG_FS_DAX || CONFIG_CMA */
>  
> +#ifdef CONFIG_CMA
> +/*
> + * Returns the number of pages that were *not* rejected. This makes it
> + * exactly compatible with its callers.
> + */
> +static int reject_cma_pages(int nr_pinned, unsigned gup_flags,
> + struct page **pages)
> +{
> + int i = 0;
> + if (unlikely(gup_flags & FOLL_LONGTERM)) {
> +
> + for (i = 0; i < nr_pinned; i++)
> + if (is_migrate_cma_page(pages[i])) {
> + put_user_pages([i], nr_pinned - i);

Yes this is cleaner.

> + break;
> + }
> + }
> + return i;
> +}
> +#else
> +static int reject_cma_pages(int nr_pinned, unsigned gup_flags,
> + struct page **pages)
> +{
> + return nr_pinned;
> +}
> +#endif
> +
>  /*
>   * This is the same as get_user_pages_remote(), just with a
>   * less-flexible calling convention where we assume that the task
> @@ -2216,7 +2243,7 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>   unsigned int gup_flags, struct page **pages)
>  {
>   unsigned long addr, len, end;
> - int nr = 0, ret = 0;
> + int nr_pinned = 0, ret = 0;

To be absolutely pedantic I would have split the nr_pinned change to a separate
patch.

Ira

>  
>   start &= PAGE_MASK;
>   addr = start;
> @@ -2231,25 +2258,27 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>  
>   if (gup_fast_permitted(start, nr_pages)) {
>   local_irq_disable();
> - gup_pgd_range(addr, end, gup_flags, pages, );
> + gup_pgd_range(addr, end, gup_flags, pages, _pinned);
>   local_irq_enable();
> - ret = nr;
> + ret = nr_pinned;
>   }
>  
> - if (nr < nr_pages) {
> + nr_pinned = reject_cma_pages(nr_pinned, gup_flags, pages);
> +
> + if (nr_pinned < nr_pages) {
>   /* Try to get the remaining pages with get_user_pages */
> - start += nr << PAGE_SHIFT;
> - pages += nr;
> + start += nr_pinned << PAGE_SHIFT;
> + pages += nr_pinned;
>  
> - ret = __gup_longterm_unlocked(start, nr_pages - nr,
> + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
> gup_flags, pages);
>  
>   /* Have to be a bit careful with return values */
> - if (nr > 0) {
> + if (nr_pinned > 0) {
>   if (ret < 0)
> - ret = nr;
> + ret = nr_pinned;
>   else
> - ret += nr;
> + ret += nr_pinned;
>   }
>   }
>  
> 
> Rather lightly tested...I've compile-tested with CONFIG_CMA and !CONFIG_CMA, 
> and boot tested with CONFIG_CMA, but could use a second set of eyes on whether
> I've added any off-by-one errors, or worse. :)
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-05-31 Thread Ira Weiny
On Fri, May 31, 2019 at 07:05:27PM +0800, Pingfan Liu wrote:
> On Fri, May 31, 2019 at 7:21 AM John Hubbard  wrote:
> >
> >
> > Rather lightly tested...I've compile-tested with CONFIG_CMA and !CONFIG_CMA,
> > and boot tested with CONFIG_CMA, but could use a second set of eyes on 
> > whether
> > I've added any off-by-one errors, or worse. :)
> >
> Do you mind I send V2 based on your above patch? Anyway, it is a simple bug 
> fix.

FWIW please split out the nr_pinned change to a separate patch.

Thanks,
Ira


Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-13 Thread Ira Weiny
On Wed, Jun 12, 2019 at 09:54:58PM +0800, Pingfan Liu wrote:
> On Tue, Jun 11, 2019 at 04:29:11PM +, Weiny, Ira wrote:
> > > Pingfan Liu  writes:
> > > 
> > > > As for FOLL_LONGTERM, it is checked in the slow path
> > > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > > which means a possible leak of CMA page to longterm pinned requirement
> > > > through this crack.
> > > 
> > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > > dax check we need vma to ensure whether a long term pin is allowed or not.
> > > If FOLL_LONGTERM is specified we should fallback to slow path.
> > 
> > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it 
> > does this while walking the page tables.  I missed the CMA case and 
> > Pingfan's patch fixes this.  We could check for CMA pages while walking the 
> > page tables but most agreed that it was not worth it.  For DAX we already 
> > had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks 
> > there.
> > 
> Then for CMA pages, are you suggesting something like:

I'm not suggesting this.

Sorry I wrote this prior to seeing the numbers in your other email.  Given
the numbers it looks like performing the check whilst walking the tables is
worth the extra complexity.  I was just trying to summarize the thread.  I
don't think we should disallow FOLL_LONGTERM because it only affects CMA and
DAX.  Other pages will be fine with FOLL_LONGTERM.  Why penalize every call if
we don't have to.  Also in the case of DAX the use of vma will be going
away...[1]  Eventually...  ;-)

Ira

[1] https://lkml.org/lkml/2019/6/5/1049

> diff --git a/mm/gup.c b/mm/gup.c
> index 42a47c0..8bf3cc3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2251,6 +2251,8 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
> if (unlikely(!access_ok((void __user *)start, len)))
> return -EFAULT;
> 
> +   if (unlikely(gup_flags & FOLL_LONGTERM))
> +   goto slow;
> if (gup_fast_permitted(start, nr_pages)) {
> local_irq_disable();
> gup_pgd_range(addr, end, gup_flags, pages, );
> @@ -2258,6 +2260,7 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
> ret = nr;
> }
> 
> +slow:
> if (nr < nr_pages) {
> /* Try to get the remaining pages with get_user_pages */
> start += nr << PAGE_SHIFT;
> 
> Thanks,
>   Pingfan


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Ira Weiny
On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > Are you suggesting that we have something like this from user space?
> > > 
> > >   fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > 
> > Rather than "unbreakable", perhaps a clearer description of the
> > policy it entails is "exclusive"?
> > 
> > i.e. what we are talking about here is an exclusive lease that
> > prevents other processes from changing the layout. i.e. the
> > mechanism used to guarantee a lease is exclusive is that the layout
> > becomes "unbreakable" at the filesystem level, but the policy we are
> > actually presenting to uses is "exclusive access"...
> 
> That's rather different from the normal meaning of 'exclusive' in the
> context of locks, which is "only one user can have access to this at
> a time".  As I understand it, this is rather more like a 'shared' or
> 'read' lock.  The filesystem would be the one which wants an exclusive
> lock, so it can modify the mapping of logical to physical blocks.
> 
> The complication being that by default the filesystem has an exclusive
> lock on the mapping, and what we're trying to add is the ability for
> readers to ask the filesystem to give up its exclusive lock.

This is an interesting view...

And after some more thought, exclusive does not seem like a good name for this
because technically F_WRLCK _is_ an exclusive lease...

In addition, the user does not need to take the "exclusive" write lease to be
notified of (broken by) an unexpected truncate.  A "read" lease is broken by
truncate.  (And "write" leases really don't do anything different WRT the
interaction of the FS and the user app.  Write leases control "exclusive"
access between other file descriptors.)

Another thing to consider is that this patch set _allows_ a truncate/hole punch
to proceed _if_ the pages being affected are not actually pinned.  So the
unbreakable/exclusive nature of the lease is not absolute.

Personally I like this functionality.  I'm not quite sure I can make it work
with what Jan is suggesting.  But I like it.

Given the muddied water of "exclusive" and "write" lease I'm now feeling like
Jeff has a point WRT the conflation of F_RDLCK/F_WRLCK/F_UNLCK and this new
functionality.

Should we use his suggested F_SETLAYOUT/F_GETLAYOUT cmd type?[1]

Ira

[1] https://lkml.org/lkml/2019/6/9/117



Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Ira Weiny
On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote:
> On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny  wrote:
> >
> > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > > >
> > > > > > > > The main objection to the current ODP & DAX solution is that 
> > > > > > > > very
> > > > > > > > little HW can actually implement it, having the alternative 
> > > > > > > > still
> > > > > > > > require HW support doesn't seem like progress.
> > > > > > > >
> > > > > > > > I think we will eventually start seein some HW be able to do 
> > > > > > > > this
> > > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > > optional, for recovery from truely catastrophic errors (ie my 
> > > > > > > > DAX is
> > > > > > > > on fire, I need to unplug it).
> > > > > > >
> > > > > > > Agreed.  I think software wise there is not much some of the 
> > > > > > > devices can do
> > > > > > > with such an "invalidate".
> > > > > >
> > > > > > So out of curiosity: What does RDMA driver do when userspace just 
> > > > > > closes
> > > > > > the file pointing to RDMA object? It has to handle that somehow by 
> > > > > > aborting
> > > > > > everything that's going on... And I wanted similar behavior here.
> > > > >
> > > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > > everything avoids creating inconsistencies that destroying a subset
> > > > > would create.
> > > > >
> > > > > What has been talked about for lease break is not destroying anything
> > > > > but very selectively saying that one memory region linked to the GUP
> > > > > is no longer functional.
> > > >
> > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > > and closes the file with existing pins (and thus layout lease) we would
> > > > force it to abort everything. Yes, it is disruptive but then the app 
> > > > didn't
> > > > obey the rule that it has to maintain file lease while holding pins. 
> > > > Thus
> > > > such situation should never happen unless the app is malicious / buggy.
> > >
> > > We do have the infrastructure to completely revoke the entire
> > > *content* of a FD (this is called device disassociate). It is
> > > basically close without the app doing close. But again it only works
> > > with some drivers. However, this is more likely something a driver
> > > could support without a HW change though.
> > >
> > > It is quite destructive as it forcibly kills everything RDMA related
> > > the process(es) are doing, but it is less violent than SIGKILL, and
> > > there is perhaps a way for the app to recover from this, if it is
> > > coded for it.
> >
> > I don't think many are...  I think most would effectively be "killed" if 
> > this
> > happened to them.
> >
> > >
> > > My preference would be to avoid this scenario, but if it is really
> > > necessary, we could probably build it with some work.
> > >
> > > The only case we use it today is forced HW hot unplug, so it is rarely
> > > used and only for an 'emergency' like use case.
> >
> > I'd really like to avoid this as well.  I think it will be very confusing 
> > for
> > RDMA apps to have their context suddenly be invalid.  I think if we have a 
> > way
> > for admins to ID who is pinning a file the admin can take more appropriate
> > action on those processes.   Up to and including killing the process.
> 
> Can RDMA context invalidation, "device disassociate", be inflicted on
> a process from the outside? Identifying the pid of a pin holder only
> leaves SIGKILL of the entire process as the remediation for revoking a
> pin, and I assume admins would use the finer grained invalidation
> where it was available.

No not in the way you are describing it.  As Jason said you can hotplug the
device which is "from the outside" but this would affect all users of that
device.

Effectively, we would need a way for an admin to close a specific file
descriptor (or set of fds) which point to that file.  AFAIK there is no way to
do that at all, is there?

Ira



Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Ira Weiny
On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > 
> > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > little HW can actually implement it, having the alternative still
> > > > > > require HW support doesn't seem like progress.
> > > > > > 
> > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > on fire, I need to unplug it).
> > > > > 
> > > > > Agreed.  I think software wise there is not much some of the devices 
> > > > > can do
> > > > > with such an "invalidate".
> > > > 
> > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > the file pointing to RDMA object? It has to handle that somehow by 
> > > > aborting
> > > > everything that's going on... And I wanted similar behavior here.
> > > 
> > > It aborts *everything* connected to that file descriptor. Destroying
> > > everything avoids creating inconsistencies that destroying a subset
> > > would create.
> > > 
> > > What has been talked about for lease break is not destroying anything
> > > but very selectively saying that one memory region linked to the GUP
> > > is no longer functional.
> > 
> > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > and closes the file with existing pins (and thus layout lease) we would
> > force it to abort everything. Yes, it is disruptive but then the app didn't
> > obey the rule that it has to maintain file lease while holding pins. Thus
> > such situation should never happen unless the app is malicious / buggy.
> 
> We do have the infrastructure to completely revoke the entire
> *content* of a FD (this is called device disassociate). It is
> basically close without the app doing close. But again it only works
> with some drivers. However, this is more likely something a driver
> could support without a HW change though.
> 
> It is quite destructive as it forcibly kills everything RDMA related
> the process(es) are doing, but it is less violent than SIGKILL, and
> there is perhaps a way for the app to recover from this, if it is
> coded for it.

I don't think many are...  I think most would effectively be "killed" if this
happened to them.

> 
> My preference would be to avoid this scenario, but if it is really
> necessary, we could probably build it with some work.
> 
> The only case we use it today is forced HW hot unplug, so it is rarely
> used and only for an 'emergency' like use case.

I'd really like to avoid this as well.  I think it will be very confusing for
RDMA apps to have their context suddenly be invalid.  I think if we have a way
for admins to ID who is pinning a file the admin can take more appropriate
action on those processes.   Up to and including killing the process.

Ira



Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Ira Weiny
On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > Are you suggesting that we have something like this from user space?
> > > > > 
> > > > >   fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > > 
> > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > policy it entails is "exclusive"?
> > > > 
> > > > i.e. what we are talking about here is an exclusive lease that
> > > > prevents other processes from changing the layout. i.e. the
> > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > actually presenting to uses is "exclusive access"...
> > > 
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time".  As I understand it, this is rather more like a 'shared' or
> > > 'read' lock.  The filesystem would be the one which wants an exclusive
> > > lock, so it can modify the mapping of logical to physical blocks.
> > > 
> > > The complication being that by default the filesystem has an exclusive
> > > lock on the mapping, and what we're trying to add is the ability for
> > > readers to ask the filesystem to give up its exclusive lock.
> > 
> > This is an interesting view...
> > 
> > And after some more thought, exclusive does not seem like a good name for 
> > this
> > because technically F_WRLCK _is_ an exclusive lease...
> > 
> > In addition, the user does not need to take the "exclusive" write lease to 
> > be
> > notified of (broken by) an unexpected truncate.  A "read" lease is broken by
> > truncate.  (And "write" leases really don't do anything different WRT the
> > interaction of the FS and the user app.  Write leases control "exclusive"
> > access between other file descriptors.)
> 
> I've been assuming that there is only one type of layout lease -
> there is no use case I've heard of for read/write layout leases, and
> like you say there is zero difference in behaviour at the filesystem
> level - they all have to be broken to allow a non-lease truncate to
> proceed.
> 
> IMO, taking a "read lease" to be able to modify and write to the
> underlying mapping of a file makes absolutely no sense at all.
> IOWs, we're talking exaclty about a revokable layout lease vs an
> exclusive layout lease here, and so read/write really doesn't match
> the policy or semantics we are trying to provide.

I humbly disagree, at least depending on how you look at it...  :-D

The patches as they stand expect the user to take a "read" layout lease which
indicates they are currently using "reading" the layout as is.  They are not
changing ("writing" to) the layout.  They then pin pages which locks parts of
the layout and therefore they expect no "writers" to change the layout.

The "write" layout lease breaks the "read" layout lease indicating that the
layout is being written to.  Should the layout be pinned in such a way that the
layout can't be changed the "layout writer" (truncate) fails.

In fact, this is what NFS does right now.  The lease it puts on the file is of
"read" type.

nfs4layouts.c:
static int
nfsd4_layout_setlease(struct nfs4_layout_stateid *ls)
{
...
fl->fl_flags = FL_LAYOUT;
fl->fl_type = F_RDLCK;
...
}

I was not changing that much from the NFS patter which meant the break lease
code worked.

Jans proposal is solid but it means that there is no breaking of the lease.  I
tried to add an "exclusive" flag to the "write" lease but the __break_lease()
code gets weird.  I'm not saying it is not possible.  Just that I have not
seen a good way to do it.

> 
> > Another thing to consider is that this patch set _allows_ a truncate/hole 
> > punch
> > to proceed _if_ the pages being affected are not actually pinned.  So the
> > unbreakable/exclusive nature of the lease is not absolute.
> 
> If you're talking about the process that owns the layout lease
> running the truncate, then that is fine.
> 
> However, if you are talking about a process that does not own 

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Ira Weiny
On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > Are you suggesting that we have something like this from user space?
> > > > 
> > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > 
> > > Rather than "unbreakable", perhaps a clearer description of the
> > > policy it entails is "exclusive"?
> > > 
> > > i.e. what we are talking about here is an exclusive lease that
> > > prevents other processes from changing the layout. i.e. the
> > > mechanism used to guarantee a lease is exclusive is that the layout
> > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > actually presenting to uses is "exclusive access"...
> > 
> > That's rather different from the normal meaning of 'exclusive' in the
> > context of locks, which is "only one user can have access to this at
> > a time".
> 
> 
> Layout leases are not locks, they are a user access policy object.
> It is the process/fd which holds the lease and it's the process/fd
> that is granted exclusive access.  This is exactly the same semantic
> as O_EXCL provides for granting exclusive access to a block device
> via open(), yes?
> 
> > As I understand it, this is rather more like a 'shared' or
> > 'read' lock.  The filesystem would be the one which wants an exclusive
> > lock, so it can modify the mapping of logical to physical blocks.
> 
> ISTM that you're conflating internal filesystem implementation with
> application visible semantics. Yes, the filesystem uses internal
> locks to serialise the modification of the things the lease manages
> access too, but that has nothing to do with the access policy the
> lease provides to users.
> 
> e.g. Process A has an exclusive layout lease on file F. It does an
> IO to file F. The filesystem IO path checks that Process A owns the
> lease on the file and so skips straight through layout breaking
> because it owns the lease and is allowed to modify the layout. It
> then takes the inode metadata locks to allocate new space and write
> new data.
> 
> Process B now tries to write to file F. The FS checks whether
> Process B owns a layout lease on file F. It doesn't, so then it
> tries to break the layout lease so the IO can proceed. The layout
> breaking code sees that process A has an exclusive layout lease
> granted, and so returns -ETXTBSY to process B - it is not allowed to
> break the lease and so the IO fails with -ETXTBSY.
> 
> i.e. the exclusive layout lease prevents other processes from
> performing operations that may need to modify the layout from
> performing those operations. It does not "lock" the file/inode in
> any way, it just changes how the layout lease breaking behaves.

Question: Do we expect Process A to get notified that Process B was attempting
to change the layout?

This changes the exclusivity semantics.  While Process A has an exclusive lease
it could release it if notified to allow process B temporary exclusivity.

Question 2: Do we expect other process' (say Process C) to also be able to map
and pin the file?  I believe users will need this and for layout purposes it is
ok to do so.  But this means that Process A does not have "exclusive" access to
the lease.

So given Process C has also placed a layout lease on the file.  Indicating
that it does not want the layout to change.  Both A and C need to be "broken"
by Process B to change the layout.  If there is no Process B; A and C can run
just fine with a "locked" layout.

Ira

> 
> Further, the "exclusiveness" of a layout lease is completely
> irrelevant to the filesystem that is indicating that an operation
> that may need to modify the layout is about to be performed. All the
> filesystem has to do is handle failures to break the lease
> appropriately.  Yes, XFS serialises the layout lease validation
> against other IO to the same file via it's IO locks, but that's an
> internal data IO coherency requirement, not anything to do with
> layout lease management.
> 
> Note that I talk about /writes/ here. This is interchangable with
> any other operation that may need to modify the extent layout of the
> file, be it truncate, fallocate, etc: the attempt to break the
> layout lease by a non-owner should fail if the lease is "exclusive"
> to the owner.
> 
> > The complication being that by default the filesystem has an exclusive
> > lock on the mapping, and what we're trying to add is the ability for
> > readers to ask the filesystem to give up its exclusive lock.
> 
> The filesystem doesn't even lock the "mapping" until after the
> layout lease has been validated or broken.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Ira Weiny
On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > e.g. Process A has an exclusive layout lease on file F. It does an
> > IO to file F. The filesystem IO path checks that Process A owns the
> > lease on the file and so skips straight through layout breaking
> > because it owns the lease and is allowed to modify the layout. It
> > then takes the inode metadata locks to allocate new space and write
> > new data.
> > 
> > Process B now tries to write to file F. The FS checks whether
> > Process B owns a layout lease on file F. It doesn't, so then it
> > tries to break the layout lease so the IO can proceed. The layout
> > breaking code sees that process A has an exclusive layout lease
> > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > break the lease and so the IO fails with -ETXTBSY.
> 
> This description doesn't match the behaviour that RDMA wants either.
> Even if Process A has a lease on the file, an IO from Process A which
> results in blocks being freed from the file is going to result in the
> RDMA device being able to write to blocks which are now freed (and
> potentially reallocated to another file).

I don't understand why this would not work for RDMA?  As long as the layout
does not change the page pins can remain in place.

Ira



Re: [PATCHv4 1/3] mm/gup: rename nr as nr_pinned in get_user_pages_fast()

2019-06-13 Thread Ira Weiny
On Thu, Jun 13, 2019 at 06:45:00PM +0800, Pingfan Liu wrote:
> To better reflect the held state of pages and make code self-explaining,
> rename nr as nr_pinned.
> 
> Signed-off-by: Pingfan Liu 

Reviewed-by: Ira Weiny 

> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Dan Williams 
> Cc: Matthew Wilcox 
> Cc: John Hubbard 
> Cc: "Aneesh Kumar K.V" 
> Cc: Keith Busch 
> Cc: Christoph Hellwig 
> Cc: Shuah Khan 
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/gup.c | 20 ++--
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f173fcb..766ae54 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2216,7 +2216,7 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>   unsigned int gup_flags, struct page **pages)
>  {
>   unsigned long addr, len, end;
> - int nr = 0, ret = 0;
> + int nr_pinned = 0, ret = 0;
>  
>   start &= PAGE_MASK;
>   addr = start;
> @@ -2231,25 +2231,25 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>  
>   if (gup_fast_permitted(start, nr_pages)) {
>   local_irq_disable();
> - gup_pgd_range(addr, end, gup_flags, pages, );
> + gup_pgd_range(addr, end, gup_flags, pages, _pinned);
>   local_irq_enable();
> - ret = nr;
> + ret = nr_pinned;
>   }
>  
> - if (nr < nr_pages) {
> + if (nr_pinned < nr_pages) {
>   /* Try to get the remaining pages with get_user_pages */
> - start += nr << PAGE_SHIFT;
> - pages += nr;
> + start += nr_pinned << PAGE_SHIFT;
> + pages += nr_pinned;
>  
> - ret = __gup_longterm_unlocked(start, nr_pages - nr,
> + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
> gup_flags, pages);
>  
>   /* Have to be a bit careful with return values */
> - if (nr > 0) {
> + if (nr_pinned > 0) {
>   if (ret < 0)
> - ret = nr;
> + ret = nr_pinned;
>   else
> - ret += nr;
> + ret += nr_pinned;
>   }
>   }
>  
> -- 
> 2.7.5
> 


Re: [PATCHv4 2/3] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path

2019-06-13 Thread Ira Weiny
On Thu, Jun 13, 2019 at 06:45:01PM +0800, Pingfan Liu wrote:
> FOLL_LONGTERM suggests a pin which is going to be given to hardware and
> can't move. It would truncate CMA permanently and should be excluded.
> 
> FOLL_LONGTERM has already been checked in the slow path, but not checked in
> the fast path, which means a possible leak of CMA page to longterm pinned
> requirement through this crack.
> 
> Place a check in gup_pte_range() in the fast path.
> 
> Signed-off-by: Pingfan Liu 
> Cc: Ira Weiny 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Dan Williams 
> Cc: Matthew Wilcox 
> Cc: John Hubbard 
> Cc: "Aneesh Kumar K.V" 
> Cc: Keith Busch 
> Cc: Christoph Hellwig 
> Cc: Shuah Khan 
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/gup.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 766ae54..de1b03f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1757,6 +1757,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> addr, unsigned long end,
>   VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>   page = pte_page(pte);
>  
> + /*
> +  * FOLL_LONGTERM suggests a pin given to hardware. Prevent it
> +  * from truncating CMA area
> +  */
> + if (unlikely(flags & FOLL_LONGTERM) &&
> + is_migrate_cma_page(page))
> + goto pte_unmap;
> +
>   head = try_get_compound_head(page, 1);
>   if (!head)
>   goto pte_unmap;
> @@ -1900,6 +1908,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>   refs++;
>   } while (addr += PAGE_SIZE, addr != end);
>  
> + if (unlikely(flags & FOLL_LONGTERM) &&
> + is_migrate_cma_page(page)) {
> + *nr -= refs;
> + return 0;
> + }
> +

Why can't we place this check before the while loop and skip subtracting the
page count?

Can is_migrate_cma_page() operate on any "subpage" of a compound page? 

Here this calls is_magrate_cma_page() on the tail page of the compound page.

I'm not an expert on compound pages nor cma handling so is this ok?

It seems like you need to call is_migrate_cma_page() on each page within the
while loop?

>   head = try_get_compound_head(pmd_page(orig), refs);
>   if (!head) {
>   *nr -= refs;
> @@ -1941,6 +1955,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>   refs++;
>   } while (addr += PAGE_SIZE, addr != end);
>  
> + if (unlikely(flags & FOLL_LONGTERM) &&
> + is_migrate_cma_page(page)) {
> + *nr -= refs;
> + return 0;
> + }
> +

Same comment here.

>   head = try_get_compound_head(pud_page(orig), refs);
>   if (!head) {
>   *nr -= refs;
> @@ -1978,6 +1998,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, 
> unsigned long addr,
>   refs++;
>   } while (addr += PAGE_SIZE, addr != end);
>  
> + if (unlikely(flags & FOLL_LONGTERM) &&
> + is_migrate_cma_page(page)) {
> + *nr -= refs;
> + return 0;
> + }
> +

And here.

Ira

>   head = try_get_compound_head(pgd_page(orig), refs);
>   if (!head) {
>   *nr -= refs;
> -- 
> 2.7.5
> 


Re: [PATCHv4 3/3] mm/gup_benchemark: add LONGTERM_BENCHMARK test in gup fast path

2019-06-13 Thread Ira Weiny
On Thu, Jun 13, 2019 at 06:45:02PM +0800, Pingfan Liu wrote:
> Introduce a GUP_LONGTERM_BENCHMARK ioctl to test longterm pin in gup fast
> path.
> 
> Signed-off-by: Pingfan Liu 
> Cc: Ira Weiny 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Dan Williams 
> Cc: Matthew Wilcox 
> Cc: John Hubbard 
> Cc: "Aneesh Kumar K.V" 
> Cc: Keith Busch 
> Cc: Christoph Hellwig 
> Cc: Shuah Khan 
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/gup_benchmark.c | 11 +--
>  tools/testing/selftests/vm/gup_benchmark.c | 10 +++---
>  2 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7dd602d..83f3378 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -6,8 +6,9 @@
>  #include 
>  
>  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> -#define GUP_LONGTERM_BENCHMARK   _IOWR('g', 2, struct gup_benchmark)
> -#define GUP_BENCHMARK_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_FAST_LONGTERM_BENCHMARK  _IOWR('g', 2, struct gup_benchmark)
> +#define GUP_LONGTERM_BENCHMARK   _IOWR('g', 3, struct gup_benchmark)
> +#define GUP_BENCHMARK_IOWR('g', 4, struct gup_benchmark)

But I really like this addition!  Thanks!

But why not just add GUP_FAST_LONGTERM_BENCHMARK to the end of this list (value
4)?  I know the user space test program is probably expected to be lock step
with this code but it seems odd to redefine GUP_LONGTERM_BENCHMARK and
GUP_BENCHMARK with this change.

Ira

>  
>  struct gup_benchmark {
>   __u64 get_delta_usec;
> @@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>   nr = get_user_pages_fast(addr, nr, gup->flags & 1,
>pages + i);
>   break;
> + case GUP_FAST_LONGTERM_BENCHMARK:
> + nr = get_user_pages_fast(addr, nr,
> + (gup->flags & 1) | FOLL_LONGTERM,
> +  pages + i);
> + break;
>   case GUP_LONGTERM_BENCHMARK:
>   nr = get_user_pages(addr, nr,
>   (gup->flags & 1) | FOLL_LONGTERM,
> @@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, 
> unsigned int cmd,
>  
>   switch (cmd) {
>   case GUP_FAST_BENCHMARK:
> + case GUP_FAST_LONGTERM_BENCHMARK:
>   case GUP_LONGTERM_BENCHMARK:
>   case GUP_BENCHMARK:
>   break;
> diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
> b/tools/testing/selftests/vm/gup_benchmark.c
> index c0534e2..ade8acb 100644
> --- a/tools/testing/selftests/vm/gup_benchmark.c
> +++ b/tools/testing/selftests/vm/gup_benchmark.c
> @@ -15,8 +15,9 @@
>  #define PAGE_SIZE sysconf(_SC_PAGESIZE)
>  
>  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> -#define GUP_LONGTERM_BENCHMARK   _IOWR('g', 2, struct gup_benchmark)
> -#define GUP_BENCHMARK_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_FAST_LONGTERM_BENCHMARK  _IOWR('g', 2, struct gup_benchmark)
> +#define GUP_LONGTERM_BENCHMARK   _IOWR('g', 3, struct gup_benchmark)
> +#define GUP_BENCHMARK_IOWR('g', 4, struct gup_benchmark)
>  
>  struct gup_benchmark {
>   __u64 get_delta_usec;
> @@ -37,7 +38,7 @@ int main(int argc, char **argv)
>   char *file = "/dev/zero";
>   char *p;
>  
> - while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
> + while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
>   switch (opt) {
>   case 'm':
>   size = atoi(optarg) * MB;
> @@ -54,6 +55,9 @@ int main(int argc, char **argv)
>   case 'T':
>   thp = 0;
>   break;
> + case 'l':
> + cmd = GUP_FAST_LONGTERM_BENCHMARK;
> + break;
>   case 'L':
>   cmd = GUP_LONGTERM_BENCHMARK;
>   break;
> -- 
> 2.7.5
> 


Re: [PATCHv4 3/3] mm/gup_benchemark: add LONGTERM_BENCHMARK test in gup fast path

2019-06-13 Thread Ira Weiny
On Thu, Jun 13, 2019 at 02:42:47PM -0700, 'Ira Weiny' wrote:
> On Thu, Jun 13, 2019 at 06:45:02PM +0800, Pingfan Liu wrote:
> > Introduce a GUP_LONGTERM_BENCHMARK ioctl to test longterm pin in gup fast
> > path.
> > 
> > Signed-off-by: Pingfan Liu 
> > Cc: Ira Weiny 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: Dan Williams 
> > Cc: Matthew Wilcox 
> > Cc: John Hubbard 
> > Cc: "Aneesh Kumar K.V" 
> > Cc: Keith Busch 
> > Cc: Christoph Hellwig 
> > Cc: Shuah Khan 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  mm/gup_benchmark.c | 11 +--
> >  tools/testing/selftests/vm/gup_benchmark.c | 10 +++---
> >  2 files changed, 16 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> > index 7dd602d..83f3378 100644
> > --- a/mm/gup_benchmark.c
> > +++ b/mm/gup_benchmark.c
> > @@ -6,8 +6,9 @@
> >  #include 
> >  
> >  #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
> > -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
> > -#define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_FAST_LONGTERM_BENCHMARK_IOWR('g', 2, struct 
> > gup_benchmark)
> > +#define GUP_LONGTERM_BENCHMARK _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_BENCHMARK  _IOWR('g', 4, struct gup_benchmark)
> 
> But I really like this addition!  Thanks!
> 
> But why not just add GUP_FAST_LONGTERM_BENCHMARK to the end of this list 
> (value
> 4)?  I know the user space test program is probably expected to be lock step
> with this code but it seems odd to redefine GUP_LONGTERM_BENCHMARK and
> GUP_BENCHMARK with this change.

I see that Andrew pull this change.  So if others don't think this renumbering
is an issue feel free to add my:

Reviewed-by: Ira Weiny 

> 
> Ira
> 
> >  
> >  struct gup_benchmark {
> > __u64 get_delta_usec;
> > @@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
> > nr = get_user_pages_fast(addr, nr, gup->flags & 1,
> >  pages + i);
> > break;
> > +   case GUP_FAST_LONGTERM_BENCHMARK:
> > +   nr = get_user_pages_fast(addr, nr,
> > +   (gup->flags & 1) | FOLL_LONGTERM,
> > +pages + i);
> > +   break;
> > case GUP_LONGTERM_BENCHMARK:
> > nr = get_user_pages(addr, nr,
> > (gup->flags & 1) | FOLL_LONGTERM,
> > @@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, 
> > unsigned int cmd,
> >  
> > switch (cmd) {
> > case GUP_FAST_BENCHMARK:
> > +   case GUP_FAST_LONGTERM_BENCHMARK:
> > case GUP_LONGTERM_BENCHMARK:
> > case GUP_BENCHMARK:
> > break;
> > diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
> > b/tools/testing/selftests/vm/gup_benchmark.c
> > index c0534e2..ade8acb 100644
> > --- a/tools/testing/selftests/vm/gup_benchmark.c
> > +++ b/tools/testing/selftests/vm/gup_benchmark.c
> > @@ -15,8 +15,9 @@
> >  #define PAGE_SIZE sysconf(_SC_PAGESIZE)
> >  
> >  #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
> > -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
> > -#define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_FAST_LONGTERM_BENCHMARK_IOWR('g', 2, struct 
> > gup_benchmark)
> > +#define GUP_LONGTERM_BENCHMARK _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_BENCHMARK  _IOWR('g', 4, struct gup_benchmark)
> >  
> >  struct gup_benchmark {
> > __u64 get_delta_usec;
> > @@ -37,7 +38,7 @@ int main(int argc, char **argv)
> > char *file = "/dev/zero";
> > char *p;
> >  
> > -   while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
> > +   while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
> > switch (opt) {
> > case 'm':
> > size = atoi(optarg) * MB;
> > @@ -54,6 +55,9 @@ int main(int argc, char **argv)
> > case 'T':
> > thp = 0;
> > break;
> > +   case 'l':
> > +   cmd = GUP_FAST_LONGTERM_BENCHMARK;
> > +   break;
> > case 'L':
> > cmd = GUP_LONGTERM_BENCHMARK;
> > break;
> > -- 
> > 2.7.5
> > 
> 


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Ira Weiny
On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote:
> > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > > > e.g. Process A has an exclusive layout lease on file F. It does an
> > > > IO to file F. The filesystem IO path checks that Process A owns the
> > > > lease on the file and so skips straight through layout breaking
> > > > because it owns the lease and is allowed to modify the layout. It
> > > > then takes the inode metadata locks to allocate new space and write
> > > > new data.
> > > > 
> > > > Process B now tries to write to file F. The FS checks whether
> > > > Process B owns a layout lease on file F. It doesn't, so then it
> > > > tries to break the layout lease so the IO can proceed. The layout
> > > > breaking code sees that process A has an exclusive layout lease
> > > > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > > > break the lease and so the IO fails with -ETXTBSY.
> > > 
> > > This description doesn't match the behaviour that RDMA wants either.
> > > Even if Process A has a lease on the file, an IO from Process A which
> > > results in blocks being freed from the file is going to result in the
> > > RDMA device being able to write to blocks which are now freed (and
> > > potentially reallocated to another file).
> > 
> > I don't understand why this would not work for RDMA?  As long as the layout
> > does not change the page pins can remain in place.
> 
> Because process A had a layout lease (and presumably a MR) and the
> layout was still modified in way that invalidates the RDMA MR.

Oh sorry I miss read the above...  (got Process A and  B mixed up...)

Right, but Process A still can't free those blocks because the gup pin exists
on them...  So yea it can't _just_ be a layout lease which controls this on the
"file fd".

Ira



Re: [PATCH 3/3] net/xdp: convert put_page() to put_user_page*()

2019-07-22 Thread Ira Weiny
On Mon, Jul 22, 2019 at 03:34:15PM -0700, john.hubb...@gmail.com wrote:
> From: John Hubbard 
> 
> For pages that were retained via get_user_pages*(), release those pages
> via the new put_user_page*() routines, instead of via put_page() or
> release_pages().
> 
> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions").
> 
> Cc: Björn Töpel 
> Cc: Magnus Karlsson 
> Cc: David S. Miller 
> Cc: net...@vger.kernel.org
> Signed-off-by: John Hubbard 
> ---
>  net/xdp/xdp_umem.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 83de74ca729a..0325a17915de 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -166,14 +166,7 @@ void xdp_umem_clear_dev(struct xdp_umem *umem)
>  
>  static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>  {
> - unsigned int i;
> -
> - for (i = 0; i < umem->npgs; i++) {
> - struct page *page = umem->pgs[i];
> -
> - set_page_dirty_lock(page);
> - put_page(page);
> - }
> + put_user_pages_dirty_lock(umem->pgs, umem->npgs);

What is the difference between this and

__put_user_pages(umem->pgs, umem->npgs, PUP_FLAGS_DIRTY_LOCK);

?

I'm a bit concerned with adding another form of the same interface.  We should
either have 1 call with flags (enum in this case) or multiple calls.  Given the
previous discussion lets move in the direction of having the enum but don't
introduce another caller of the "old" interface.

So I think on this patch NAK from me.

I also don't like having a __* call in the exported interface but there is a
__get_user_pages_fast() call so I guess there is precedent.  :-/

Ira

>  
>   kfree(umem->pgs);
>   umem->pgs = NULL;
> -- 
> 2.22.0
> 


Re: [PATCH] mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM

2020-08-19 Thread Ira Weiny
On Wed, Aug 19, 2020 at 11:01:00PM +1200, Barry Song wrote:
> gug prohibits users from calling get_user_pages() with FOLL_PIN. But it
> allows users to call get_user_pages() with FOLL_LONGTERM only. It seems
> insensible.
> 
> since FOLL_LONGTERM is a stricter case of FOLL_PIN, we should prohibit
> users from calling get_user_pages() with FOLL_LONGTERM while not with
> FOLL_PIN.
> 
> mm/gup_benchmark.c used to be the only user who did this improperly.
> But it has been fixed by moving to use pin_user_pages().
> 
> Cc: John Hubbard 
> Cc: Jan Kara 
> Cc: Jérôme Glisse 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: Al Viro 
> Cc: Christoph Hellwig 
> Cc: Dan Williams 
> Cc: Dave Chinner 
> Cc: Jason Gunthorpe 
> Cc: Jonathan Corbet 
> Cc: Michal Hocko 
> Cc: Mike Kravetz 
> Cc: Shuah Khan 
> Cc: Vlastimil Babka 
> Signed-off-by: Barry Song 

Seems reasonable to me.

Reviewed-by: Ira Weiny 

> ---
>  mm/gup.c | 37 ++---
>  1 file changed, 22 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index ae096ea7583f..4da669f79566 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1789,6 +1789,25 @@ static long __get_user_pages_remote(struct mm_struct 
> *mm,
>  gup_flags | FOLL_TOUCH | FOLL_REMOTE);
>  }
>  
> +static bool is_valid_gup_flags(unsigned int gup_flags)
> +{
> + /*
> +  * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
> +  * never directly by the caller, so enforce that with an assertion:
> +  */
> + if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> + return false;
> + /*
> +  * FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying
> +  * that is, FOLL_LONGTERM is a specific case, more restrictive case of
> +  * FOLL_PIN.
> +  */
> + if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> + return false;
> +
> + return true;
> +}
> +
>  /**
>   * get_user_pages_remote() - pin user pages in memory
>   * @mm:  mm_struct of target mm
> @@ -1854,11 +1873,7 @@ long get_user_pages_remote(struct mm_struct *mm,
>   unsigned int gup_flags, struct page **pages,
>   struct vm_area_struct **vmas, int *locked)
>  {
> - /*
> -  * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
> -  * never directly by the caller, so enforce that with an assertion:
> -  */
> - if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> + if (!is_valid_gup_flags(gup_flags))
>   return -EINVAL;
>  
>   return __get_user_pages_remote(mm, start, nr_pages, gup_flags,
> @@ -1904,11 +1919,7 @@ long get_user_pages(unsigned long start, unsigned long 
> nr_pages,
>   unsigned int gup_flags, struct page **pages,
>   struct vm_area_struct **vmas)
>  {
> - /*
> -  * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
> -  * never directly by the caller, so enforce that with an assertion:
> -  */
> - if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> + if (!is_valid_gup_flags(gup_flags))
>   return -EINVAL;
>  
>   return __gup_longterm_locked(current->mm, start, nr_pages,
> @@ -2810,11 +2821,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast_only);
>  int get_user_pages_fast(unsigned long start, int nr_pages,
>   unsigned int gup_flags, struct page **pages)
>  {
> - /*
> -  * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
> -  * never directly by the caller, so enforce that:
> -  */
> - if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> + if (!is_valid_gup_flags(gup_flags))
>   return -EINVAL;
>  
>   /*
> -- 
> 2.27.0
> 
> 
> 


Re: [PATCH 0/2] Cyrpto: Clean up kmap() use

2020-08-19 Thread Ira Weiny
On Mon, Aug 10, 2020 at 05:40:13PM -0700, 'Ira Weiny' wrote:
> From: Ira Weiny 
> 
> While going through kmap() users the following 2 issues were found via code
> inspection.

Any feedback on these patches?  Perhaps I've not included the correct people?
Adding some people to the CC list.

Specifically, Linus Walleij for the ux500 work.  Linus can you comment on the
first patch?

patch1: 
https://lore.kernel.org/lkml/20200811004015.2800392-2-ira.we...@intel.com/
patch2: 
https://lore.kernel.org/lkml/20200811004015.2800392-3-ira.we...@intel.com/

Thanks,
Ira

> 
> Ira Weiny (2):
>   crypto/ux500: Fix kmap() bug
>   crypto: Remove unused async iterators
> 
>  crypto/ahash.c| 41 +++
>  drivers/crypto/ux500/hash/hash_core.c | 30 
>  include/crypto/internal/hash.h| 13 -
>  3 files changed, 22 insertions(+), 62 deletions(-)
> 
> -- 
> 2.28.0.rc0.12.gb6a658bd00c9
> 


Re: [PATCH] drivers/dax: Use kobj_to_dev() instead

2020-08-19 Thread Ira Weiny
On Thu, Aug 13, 2020 at 11:27:02AM +0800, Wang Qing wrote:
> Use kobj_to_dev() instead of container_of()
> 
> Signed-off-by: Wang Qing 

LTGM

Reviewed-by: Ira Weiny 

> ---
>  drivers/dax/bus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index df238c8..24625d2
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -331,7 +331,7 @@ static DEVICE_ATTR_RO(numa_node);
>  
>  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
> - struct device *dev = container_of(kobj, struct device, kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>  
>   if (a == _attr_target_node.attr && dev_dax_target_node(dev_dax) < 0)
> -- 
> 2.7.4
> ___
> Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH] mm/highmem: Clean up endif comments

2020-08-19 Thread ira . weiny
From: Ira Weiny 

The #endif at the end of the file matches up with the '#if
defined(HASHED_PAGE_VIRTUAL)' on line 374.  Not the CONFIG_HIGHMEM #if
earlier.

Fix comments on both of the #endif's to indicate the correct end of
blocks for each.

Signed-off-by: Ira Weiny 
---
 mm/highmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/highmem.c b/mm/highmem.c
index 64d8dea47dd1..1352a27951e3 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -369,7 +369,7 @@ void kunmap_high(struct page *page)
 }
 
 EXPORT_SYMBOL(kunmap_high);
-#endif
+#endif /* CONFIG_HIGHMEM */
 
 #if defined(HASHED_PAGE_VIRTUAL)
 
@@ -481,4 +481,4 @@ void __init page_address_init(void)
}
 }
 
-#endif /* defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) */
+#endif /* defined(HASHED_PAGE_VIRTUAL) */
-- 
2.25.1



Re: [PATCH v2 2/3] libnvdimm/security: the 'security' attr never show 'overwrite' state

2020-08-06 Thread Ira Weiny
On Mon, Aug 03, 2020 at 04:41:38PM -0600, Jane Chu wrote:
> 'security' attribute displays the security state of an nvdimm.
> During normal operation, the nvdimm state maybe one of 'disabled',
> 'unlocked' or 'locked'.  When an admin issues
>   # ndctl sanitize-dimm nmem0 --overwrite
> the attribute is expected to change to 'overwrite' until the overwrite
> operation completes.
> 
> But tests on our systems show that 'overwrite' is never shown during
> the overwrite operation. i.e.
>   # cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
>   unlocked
> the attribute remain 'unlocked' through out the operation, consequently
> "ndctl wait-overwrite nmem0" command doesn't wait at all.
> 
> The driver tracks the state in 'nvdimm->sec.flags': when the operation
> starts, it adds an overwrite bit to the flags; and when the operation
> completes, it removes the bit. Hence security_show() should check the
> 'overwrite' bit first, in order to indicate the actual state when multiple
> bits are set in the flags.
> 
> Signed-off-by: Jane Chu 
> Reviewed-by: Dave Jiang 

Reviewed-by: Ira Weiny 

> ---
>  drivers/nvdimm/dimm_devs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index b7b77e8..5d72026 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -363,14 +363,14 @@ __weak ssize_t security_show(struct device *dev,
>  {
>   struct nvdimm *nvdimm = to_nvdimm(dev);
>  
> + if (test_bit(NVDIMM_SECURITY_OVERWRITE, >sec.flags))
> + return sprintf(buf, "overwrite\n");
>   if (test_bit(NVDIMM_SECURITY_DISABLED, >sec.flags))
>   return sprintf(buf, "disabled\n");
>   if (test_bit(NVDIMM_SECURITY_UNLOCKED, >sec.flags))
>   return sprintf(buf, "unlocked\n");
>   if (test_bit(NVDIMM_SECURITY_LOCKED, >sec.flags))
>   return sprintf(buf, "locked\n");
> - if (test_bit(NVDIMM_SECURITY_OVERWRITE, >sec.flags))
> - return sprintf(buf, "overwrite\n");
>   return -ENOTTY;
>  }
>  
> -- 
> 1.8.3.1
> 


Re: [PATCH v2 1/3] libnvdimm/security: fix a typo

2020-08-06 Thread Ira Weiny
On Mon, Aug 03, 2020 at 04:41:37PM -0600, Jane Chu wrote:
> commit d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
> introduced a typo, causing a 'nvdimm->sec.flags' update being overwritten
> by the subsequent update meant for 'nvdimm->sec.ext_flags'.
> 
> Cc: Dan Williams 
> Fixes: d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
> Signed-off-by: Jane Chu 
> Reviewed-by: Dave Jiang 

Reviewed-by: Ira Weiny 

> ---
>  drivers/nvdimm/security.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 4cef69b..8f3971c 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -457,7 +457,7 @@ void __nvdimm_security_overwrite_query(struct nvdimm 
> *nvdimm)
>   clear_bit(NDD_WORK_PENDING, >flags);
>   put_device(>dev);
>   nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
> - nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
> + nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
>  }
>  
>  void nvdimm_security_overwrite_query(struct work_struct *work)
> -- 
> 1.8.3.1
> 


Re: [PATCH v2 3/3] libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr

2020-08-06 Thread Ira Weiny
On Mon, Aug 03, 2020 at 04:41:39PM -0600, Jane Chu wrote:
> commit 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM 
> overwrite support")
> adds a sysfs_notify_dirent() to wake up userspace poll thread when the 
> "overwrite"
> operation has completed. But the notification is issued before the internal
> dimm security state and flags have been updated, so the userspace poll thread
> wakes up and fetches the not-yet-updated attr and falls back to sleep, 
> forever.
> But if user from another terminal issue "ndctl wait-overwrite nmemX" again,
> the command returns instantly.
> 
> Cc: Dave Jiang 
> Cc: Dan Williams 
> Fixes: 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM 
> overwrite support")
> Signed-off-by: Jane Chu 
> Reviewed-by: Dave Jiang 

Reviewed-by: Ira Weiny 

> ---
>  drivers/nvdimm/security.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 8f3971c..4b80150 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -450,14 +450,19 @@ void __nvdimm_security_overwrite_query(struct nvdimm 
> *nvdimm)
>   else
>   dev_dbg(>dev, "overwrite completed\n");
>  
> - if (nvdimm->sec.overwrite_state)
> - sysfs_notify_dirent(nvdimm->sec.overwrite_state);
> + /*
> +  * Mark the overwrite work done and update dimm security flags,
> +  * then send a sysfs event notification to wake up userspace
> +  * poll threads to picked up the changed state.
> +  */
>   nvdimm->sec.overwrite_tmo = 0;
>   clear_bit(NDD_SECURITY_OVERWRITE, >flags);
>   clear_bit(NDD_WORK_PENDING, >flags);
> - put_device(>dev);
>   nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
>   nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
> + if (nvdimm->sec.overwrite_state)
> + sysfs_notify_dirent(nvdimm->sec.overwrite_state);
> + put_device(>dev);
>  }
>  
>  void nvdimm_security_overwrite_query(struct work_struct *work)
> -- 
> 1.8.3.1
> 


Re: [PATCH v3 33/38] virtio_pmem: convert to LE accessors

2020-08-07 Thread Ira Weiny
On Wed, Aug 05, 2020 at 09:44:45AM -0400, Michael S. Tsirkin wrote:
> Virtio pmem is modern-only. Use LE accessors for config space.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/nvdimm/virtio_pmem.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index 5e3d07b47e0c..726c7354d465 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -58,9 +58,9 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
>   goto out_err;
>   }
>  
> - virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
>   start, >start);
> - virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
>   size, >size);

FWIW I think squashing patch 15/38 and this patch would have made more sense.

Acked-by: Ira Weiny 

>  
>   res.start = vpmem->start;
> -- 
> MST
> 


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2020-09-17 Thread Ira Weiny
On Thu, Sep 17, 2020 at 07:09:00PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 17, 2020 at 05:40:59PM -0400, Peter Xu wrote:
> > On Thu, Sep 17, 2020 at 01:35:56PM -0700, Linus Torvalds wrote:
> > > For that to happen, we'd need to have the vma flag so that we wouldn't
> > > have any worry about non-pinners, but as you suggested, I think even
> > > just a mm-wide counter - or flag - to deal with the fast-bup case is
> > > likely perfectly sufficient.
> > 
> > Would mm_struct.pinned_vm suffice?
> 
> I think that could be a good long term goal
> 
> IIRC last time we dug into the locked_vm vs pinned_vm mess it didn't
> get fixed. There is a mix of both kinds, as you saw, and some
> resistance I don't clearly remember to changing it.
> 
> My advice for this -rc fix is to go with a single bit in the mm_struct
> set on any call to pin_user_pages*
> 
> Then only users using pin_user_pages and forking are the only ones who
> would ever do extra COW on fork. I think that is OK for -rc, this
> workload should be rare due to the various historical issues. Anyhow,
> a slow down regression is better than a it is broken regression.
> 
> This can be improved into a counter later. Due to the pinned_vm
> accounting all call sites should have the mm_struct at unpin, but I
> have a feeling it will take a alot of driver patches to sort it all
> out.

Agreed.  The HFI1 driver for example increments/decrements pinned_vm on it's
own.  I've kind of always felt dirty for that...

I think long term it would be better to move this accounting to
pin_user_pages() but Jason is correct that I think that is going to be too
complex for an rc.

Could we move pinned_vm out of the drivers/rdma subsystem?

Ira

> 
> Jason
> 


Re: [PATCH v2] fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set

2020-09-24 Thread Ira Weiny
On Thu, Sep 24, 2020 at 01:59:58PM +0800, Hao Li wrote:
> If DCACHE_REFERENCED is set, fast_dput() will return true, and then
> retain_dentry() have no chance to check DCACHE_DONTCACHE. As a result,
> the dentry won't be killed and the corresponding inode can't be evicted.
> In the following example, the DAX policy can't take effects unless we
> do a drop_caches manually.
> 
>   # DCACHE_LRU_LIST will be set
>   echo abcdefg > test.txt
> 
>   # DCACHE_REFERENCED will be set and DCACHE_DONTCACHE can't do anything
>   xfs_io -c 'chattr +x' test.txt
> 
>   # Drop caches to make DAX changing take effects
>   echo 2 > /proc/sys/vm/drop_caches
> 
> What this patch does is preventing fast_dput() from returning true if
> DCACHE_DONTCACHE is set. Then retain_dentry() will detect the
> DCACHE_DONTCACHE and will return false. As a result, the dentry will be
> killed and the inode will be evicted. In this way, if we change per-file
> DAX policy, it will take effects automatically after this file is closed
> by all processes.
> 
> I also add some comments to make the code more clear.
> 
> Signed-off-by: Hao Li 

Reviewed-by: Ira Weiny 

> ---
> v1 is split into two standalone patch as discussed in [1], and the first
> patch has been reviewed in [2]. This is the second patch.
> 
> [1]: 
> https://lore.kernel.org/linux-fsdevel/20200831003407.ge12...@dread.disaster.area/
> [2]: 
> https://lore.kernel.org/linux-fsdevel/20200906214002.gi12...@dread.disaster.area/
> 
>  fs/dcache.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index ea0485861d93..97e81a844a96 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -793,10 +793,17 @@ static inline bool fast_dput(struct dentry *dentry)
>* a reference to the dentry and change that, but
>* our work is done - we can leave the dentry
>* around with a zero refcount.
> +  *
> +  * Nevertheless, there are two cases that we should kill
> +  * the dentry anyway.
> +  * 1. free disconnected dentries as soon as their refcount
> +  *reached zero.
> +  * 2. free dentries if they should not be cached.
>*/
>   smp_rmb();
>   d_flags = READ_ONCE(dentry->d_flags);
> - d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
> + d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
> + DCACHE_DISCONNECTED | DCACHE_DONTCACHE;
>  
>   /* Nothing to do? Dropping the reference was all we needed? */
>   if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && 
> !d_unhashed(dentry))
> -- 
> 2.28.0
> 
> 
> 


Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-11-04 Thread Ira Weiny
On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> On Mon, Nov 02 2020 at 12:53, ira weiny wrote:
> > Fenghua Yu (2):
> >   x86/pks: Enable Protection Keys Supervisor (PKS)
> >   x86/pks: Add PKS kernel API
> >
> > Ira Weiny (7):
> >   x86/pkeys: Create pkeys_common.h
> >   x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
> >   x86/pks: Preserve the PKRS MSR on context switch
> >   x86/entry: Pass irqentry_state_t by reference
> >   x86/entry: Preserve PKRS MSR across exceptions
> >   x86/fault: Report the PKRS state on fault
> >   x86/pks: Add PKS test code
> >
> > Thomas Gleixner (1):
> >   x86/entry: Move nmi entry/exit into common code
> 
> So the actual patch ordering is:
> 
>x86/pkeys: Create pkeys_common.h
>x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
>x86/pks: Enable Protection Keys Supervisor (PKS)
>x86/pks: Preserve the PKRS MSR on context switch
>x86/pks: Add PKS kernel API
> 
>x86/entry: Move nmi entry/exit into common code
>x86/entry: Pass irqentry_state_t by reference
> 
>x86/entry: Preserve PKRS MSR across exceptions
>x86/fault: Report the PKRS state on fault
>x86/pks: Add PKS test code
> 
> This is the wrong ordering, really.
> 
>  x86/entry: Move nmi entry/exit into common code
> 
> is a general cleanup and has absolutely nothing to do with PKRS.So this
> wants to go first.
> 

Sorry, yes this should be a pre-patch.

> Also:
> 
> x86/entry: Move nmi entry/exit into common code
> [from other email]
>>  x86/entry: Pass irqentry_state_t by reference
>>  > 
>>  >
> 
> is a prerequisite for the rest. So why is it in the middle of the
> series?

It is in the middle because passing by reference is not needed until additional
information is added to irqentry_state_t which is done immediately after this
patch by:

x86/entry: Preserve PKRS MSR across exceptions

I debated squashing the 2 but it made review harder IMO.  But I thought keeping
them in order together made a lot of sense.

> 
> And then you enable all that muck _before_ it is usable:
> 

Strictly speaking you are correct, sorry.  I will reorder the series.

> 
> Bisectability is overrrated, right?

Agreed, bisectability is important.  I thought I had it covered but I was
wrong.

> 
> Once again: Read an understand Documentation/process/*
> 
> Aside of that using a spell checker is not optional.

Agreed.

In looking closer at the entry code I've found a couple of other instances I'll
add another precursor patch.

I've also found other errors with the series which I should have caught.  My
apologies I made some last minute changes which I should have checked more
thoroughly.

Thanks,
Ira


Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-11-04 Thread Ira Weiny
On Wed, Nov 04, 2020 at 11:00:04PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 04 2020 at 09:46, Ira Weiny wrote:
> > On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> >> This is the wrong ordering, really.
> >> 
> >>  x86/entry: Move nmi entry/exit into common code
> >> 
> >> is a general cleanup and has absolutely nothing to do with PKRS.So this
> >> wants to go first.
> >
> > Sorry, yes this should be a pre-patch.
> 
> I picked it out of the series and applied it to tip core/entry as I have
> other stuff coming up in that area. 

Thanks!  I'll rebase to that tree.

I assume you fixed the spelling error?  Sorry about that.

Ira


Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-11-04 Thread Ira Weiny
On Wed, Nov 04, 2020 at 02:45:54PM -0800, 'Ira Weiny' wrote:
> On Wed, Nov 04, 2020 at 11:00:04PM +0100, Thomas Gleixner wrote:
> > On Wed, Nov 04 2020 at 09:46, Ira Weiny wrote:
> > > On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> > >> This is the wrong ordering, really.
> > >> 
> > >>  x86/entry: Move nmi entry/exit into common code
> > >> 
> > >> is a general cleanup and has absolutely nothing to do with PKRS.So this
> > >> wants to go first.
> > >
> > > Sorry, yes this should be a pre-patch.
> > 
> > I picked it out of the series and applied it to tip core/entry as I have
> > other stuff coming up in that area. 
> 
> Thanks!  I'll rebase to that tree.
> 
> I assume you fixed the spelling error?  Sorry about that.

I'll fix it and send with the other spelling errors I found.

Ira


[PATCH] entry: Fix spelling/typo errors in irq entry code

2020-11-04 Thread ira . weiny
From: Ira Weiny 

s/reguired/required/
s/Interupts/Interrupts/
s/quiescient/quiescent/
s/assemenbly/assembly/

Signed-off-by: Ira Weiny 
---
 include/linux/entry-common.h | 4 ++--
 kernel/entry/common.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1a128baf3628..66938121c4b1 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -415,7 +415,7 @@ void irqentry_exit_cond_resched(void);
  * @state: Return value from matching call to irqentry_enter()
  *
  * Depending on the return target (kernel/user) this runs the necessary
- * preemption and work checks if possible and reguired and returns to
+ * preemption and work checks if possible and required and returns to
  * the caller with interrupts disabled and no further work pending.
  *
  * This is the last action before returning to the low level ASM code which
@@ -438,7 +438,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs 
*regs);
  * @regs:  Pointer to pt_regs (NMI entry regs)
  * @irq_state: Return value from matching call to irqentry_nmi_enter()
  *
- * Last action before returning to the low level assmenbly code.
+ * Last action before returning to the low level assmebly code.
  *
  * Counterpart to irqentry_nmi_enter().
  */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bc75c114c1b3..fa17baadf63e 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -304,7 +304,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs 
*regs)
 * If this entry hit the idle task invoke rcu_irq_enter() whether
 * RCU is watching or not.
 *
-* Interupts can nest when the first interrupt invokes softirq
+* Interrupts can nest when the first interrupt invokes softirq
 * processing on return which enables interrupts.
 *
 * Scheduler ticks in the idle task can mark quiescent state and
@@ -315,7 +315,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs 
*regs)
 * interrupt to invoke rcu_irq_enter(). If that nested interrupt is
 * the tick then rcu_flavor_sched_clock_irq() would wrongfully
 * assume that it is the first interupt and eventually claim
-* quiescient state and end grace periods prematurely.
+* quiescent state and end grace periods prematurely.
 *
 * Unconditionally invoke rcu_irq_enter() so RCU state stays
 * consistent.
-- 
2.28.0.rc0.12.gb6a658bd00c9



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-22 Thread Ira Weiny
On Mon, Sep 21, 2020 at 12:19:07PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 15 Sep 2020, Dan Williams wrote:
> 
> > > TODO:
> > >
> > > - programs run approximately 4% slower when running from Optane-based
> > > persistent memory. Therefore, programs and libraries should use page cache
> > > and not DAX mapping.
> > 
> > This needs to be based on platform firmware data f(ACPI HMAT) for the
> > relative performance of a PMEM range vs DRAM. For example, this
> > tradeoff should not exist with battery backed DRAM, or virtio-pmem.
> 
> Hi
> 
> I have implemented this functionality - if we mmap a file with 
> (vma->vm_flags & VM_DENYWRITE), then it is assumed that this is executable 
> file mapping - the flag S_DAX on the inode is cleared on and the inode 
> will use normal page cache.
> 
> Is there some way how to test if we are using Optane-based module (where 
> this optimization should be applied) or battery backed DRAM (where it 
> should not)?
> 
> I've added mount options dax=never, dax=auto, dax=always, so that the user 
  
  dax=inode?

'inode' is the option used by ext4/xfs.

Ira

> can override the automatic behavior.
> 
> Mikulas
> 


Re: [PATCH] man/statx: Add STATX_ATTR_DAX

2020-09-28 Thread Ira Weiny
On Mon, May 04, 2020 at 05:20:16PM -0700, 'Ira Weiny' wrote:
> From: Ira Weiny 
> 
> Linux 5.8 is slated to have STATX_ATTR_DAX support.
> 
> https://lore.kernel.org/lkml/20200428002142.404144-4-ira.we...@intel.com/
> https://lore.kernel.org/lkml/20200504161352.GA13783@magnolia/
> 
> Add the text to the statx man page.
> 
> Signed-off-by: Ira Weiny 

Have I sent this to the wrong list?  Or perhaps I have missed a reply.

I don't see this applied to the man-pages project.[1]  But perhaps I am looking
at the wrong place?

Thank you,
Ira

[1] git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git

> ---
>  man2/statx.2 | 24 
>  1 file changed, 24 insertions(+)
> 
> diff --git a/man2/statx.2 b/man2/statx.2
> index 2e90f07dbdbc..14c4ab78e7bd 100644
> --- a/man2/statx.2
> +++ b/man2/statx.2
> @@ -468,6 +468,30 @@ The file has fs-verity enabled.
>  It cannot be written to, and all reads from it will be verified
>  against a cryptographic hash that covers the
>  entire file (e.g., via a Merkle tree).
> +.TP
> +.BR STATX_ATTR_DAX (since Linux 5.8)
> +The file is in the DAX (cpu direct access) state.  DAX state attempts to
> +minimize software cache effects for both I/O and memory mappings of this 
> file.
> +It requires a file system which has been configured to support DAX.
> +.PP
> +DAX generally assumes all accesses are via cpu load / store instructions 
> which
> +can minimize overhead for small accesses, but may adversely affect cpu
> +utilization for large transfers.
> +.PP
> +File I/O is done directly to/from user-space buffers and memory mapped I/O 
> may
> +be performed with direct memory mappings that bypass kernel page cache.
> +.PP
> +While the DAX property tends to result in data being transferred 
> synchronously,
> +it does not give the same guarantees of O_SYNC where data and the necessary
> +metadata are transferred together.
> +.PP
> +A DAX file may support being mapped with the MAP_SYNC flag, which enables a
> +program to use CPU cache flush instructions to persist CPU store operations
> +without an explicit
> +.BR fsync(2).
> +See
> +.BR mmap(2)
> +for more information.
>  .SH RETURN VALUE
>  On success, zero is returned.
>  On error, \-1 is returned, and
> -- 
> 2.25.1
> 


Re: [PATCH] nvdimm: Use kobj_to_dev() API

2020-09-28 Thread Ira Weiny
On Sat, Sep 26, 2020 at 02:54:17PM +0800, Wang Qing wrote:
> Use kobj_to_dev() instead of container_of().
> 
> Signed-off-by: Wang Qing 

Reviewed-by: Ira Weiny 

> ---
>  drivers/nvdimm/namespace_devs.c | 2 +-
>  drivers/nvdimm/region_devs.c| 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index 6da67f4..1d11ca7
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -1623,7 +1623,7 @@ static struct attribute *nd_namespace_attributes[] = {
>  static umode_t namespace_visible(struct kobject *kobj,
>   struct attribute *a, int n)
>  {
> - struct device *dev = container_of(kobj, struct device, kobj);
> + struct device *dev = kobj_to_dev(kobj);
>  
>   if (a == _attr_resource.attr && is_namespace_blk(dev))
>   return 0;
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index ef23119..92adfaf
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -644,7 +644,7 @@ static struct attribute *nd_region_attributes[] = {
>  
>  static umode_t region_visible(struct kobject *kobj, struct attribute *a, int 
> n)
>  {
> - struct device *dev = container_of(kobj, typeof(*dev), kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct nd_region *nd_region = to_nd_region(dev);
>   struct nd_interleave_set *nd_set = nd_region->nd_set;
>   int type = nd_region_to_nstype(nd_region);
> @@ -759,7 +759,7 @@ REGION_MAPPING(31);
>  
>  static umode_t mapping_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
> - struct device *dev = container_of(kobj, struct device, kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct nd_region *nd_region = to_nd_region(dev);
>  
>   if (n < nd_region->ndr_mappings)
> -- 
> 2.7.4
> 


Re: [PATCH] mm/vmalloc.c: check the addr first

2020-09-28 Thread Ira Weiny
On Mon, Sep 28, 2020 at 12:33:37AM +0800, Hui Su wrote:
> As the comments said, if @addr is NULL, no operation
> is performed, check the addr first in vfree() and
> vfree_atomic() maybe a better choice.

I don't see how this change helps anything.  kmemleak_free() checks addr so no
danger there.  Also kmemleak_free() contains a pr_debug() which some might find
useful.

Ira

> 
> Signed-off-by: Hui Su 
> ---
>  mm/vmalloc.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index be4724b916b3..1cf50749a209 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2305,10 +2305,11 @@ void vfree_atomic(const void *addr)
>  {
>   BUG_ON(in_nmi());
>  
> - kmemleak_free(addr);
> -
>   if (!addr)
>   return;
> +
> + kmemleak_free(addr);
> +
>   __vfree_deferred(addr);
>  }
>  
> @@ -2340,13 +2341,13 @@ void vfree(const void *addr)
>  {
>   BUG_ON(in_nmi());
>  
> + if (!addr)
> + return;
> +
>   kmemleak_free(addr);
>  
>   might_sleep_if(!in_interrupt());
>  
> - if (!addr)
> - return;
> -
>   __vfree(addr);
>  }
>  EXPORT_SYMBOL(vfree);
> -- 
> 2.25.1
> 
> 
> 


Re: [PATCH 6/8] selftests/vm: gup_test: introduce the dump_pages() sub-test

2020-09-28 Thread Ira Weiny
On Sun, Sep 27, 2020 at 11:21:57PM -0700, John Hubbard wrote:
> For quite a while, I was doing a quick hack to gup_test.c (previously,
> gup_benchmark.c) whenever I wanted to try out my changes to dump_page().
> This makes that hack unnecessary, and instead allows anyone to easily
> get the same coverage from a user space program. That saves a lot of
> time because you don't have to change the kernel, in order to test
> different pages and options.
> 
> The new sub-test takes advantage of the existing gup_test
> infrastructure, which already provides a simple user space program, some
> allocated user space pages, an ioctl call, pinning of those pages (via
> either get_user_pages or pin_user_pages) and a corresponding kernel-side
> test invocation. There's not much more required, mainly just a couple of
> inputs from the user.
> 
> In fact, the new test re-uses the existing command line options in order
> to get various helpful combinations (THP or normal, _fast or slow gup,
> gup vs. pup, and more).
> 
> New command line options are: which pages to dump, and what type of
> "get/pin" to use.
> 
> In order to figure out which pages to dump, the logic is:
> 
> * If the user doesn't specify anything, the page 0 (the first page in
> the address range that the program sets up for testing) is dumped.
> 
> * Or, the user can type up to 8 page indices anywhere on the command
> line. If you type more than 8, then it uses the first 8 and ignores the
> remaining items.
> 
> For example:
> 
> ./gup_test -ct -F 1 0 19 0x1000
> 
> Meaning:
> -c:  dump pages sub-test
> -t:  use THP pages
> -F 1:use pin_user_pages() instead of get_user_pages()
> 0 19 0x1000: dump pages 0, 19, and 4096
> 
> Also, invoke the new test from run_vmtests.sh. This keeps it in use, and

I don't see a change to run_vmtests.sh?

Ira

> also provides a good example of how to invoke it.
> 
> Signed-off-by: John Hubbard 
> ---
>  mm/Kconfig|  6 +++
>  mm/gup_test.c | 54 ++-
>  mm/gup_test.h | 10 +
>  tools/testing/selftests/vm/gup_test.c | 47 +--
>  4 files changed, 112 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 588984ee5fb4..f7c4c21e5cb1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -845,6 +845,12 @@ config GUP_TEST
> get_user_pages*() and pin_user_pages*(), as well as smoke tests of
> the non-_fast variants.
>  
> +   There is also a sub-test that allows running dump_page() on any
> +   of up to eight pages (selected by command line args) within the
> +   range of user-space addresses. These pages are either pinned via
> +   pin_user_pages*(), or pinned via get_user_pages*(), as specified
> +   by other command line arguments.
> +
> See tools/testing/selftests/vm/gup_test.c
>  
>  config GUP_GET_PTE_LOW_HIGH
> diff --git a/mm/gup_test.c b/mm/gup_test.c
> index a980c4a194f0..e79dc364eafb 100644
> --- a/mm/gup_test.c
> +++ b/mm/gup_test.c
> @@ -7,7 +7,7 @@
>  #include "gup_test.h"
>  
>  static void put_back_pages(unsigned int cmd, struct page **pages,
> -unsigned long nr_pages)
> +unsigned long nr_pages, unsigned int gup_test_flags)
>  {
>   unsigned long i;
>  
> @@ -23,6 +23,15 @@ static void put_back_pages(unsigned int cmd, struct page 
> **pages,
>   case PIN_LONGTERM_BENCHMARK:
>   unpin_user_pages(pages, nr_pages);
>   break;
> + case DUMP_USER_PAGES_TEST:
> + if (gup_test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN) {
> + unpin_user_pages(pages, nr_pages);
> + } else {
> + for (i = 0; i < nr_pages; i++)
> + put_page(pages[i]);
> +
> + }
> + break;
>   }
>  }
>  
> @@ -49,6 +58,37 @@ static void verify_dma_pinned(unsigned int cmd, struct 
> page **pages,
>   }
>  }
>  
> +static void dump_pages_test(struct gup_test *gup, struct page **pages,
> + unsigned long nr_pages)
> +{
> + unsigned int index_to_dump;
> + unsigned int i;
> +
> + /*
> +  * Zero out any user-supplied page index that is out of range. Remember:
> +  * .which_pages[] contains a 1-based set of page indices.
> +  */
> + for (i = 0; i < GUP_TEST_MAX_PAGES_TO_DUMP; i++) {
> + if (gup->which_pages[i] > nr_pages) {
> + pr_warn("ZEROING due to out of range: .which_pages[%u]: 
> %u\n",
> + i, gup->which_pages[i]);
> + gup->which_pages[i] = 0;
> + }
> + }
> +
> + for (i = 0; i < GUP_TEST_MAX_PAGES_TO_DUMP; i++) {
> + index_to_dump = gup->which_pages[i];
> +
> + if (index_to_dump) {
> + index_to_dump--; // Decode from 1-based, to 0-based
> +   

Re: [PATCH 7/8] selftests/vm: run_vmtest.sh: update and clean up gup_test invocation

2020-09-28 Thread Ira Weiny
On Sun, Sep 27, 2020 at 11:21:58PM -0700, John Hubbard wrote:
> Run benchmarks on the _fast variants of gup and pup, as originally
> intended.
> 
> Run the new gup_test sub-test: dump pages. In addition to exercising the
> dump_page() call, it also demonstrates the various options you can use
> to specify which pages to dump, and how.
> 
> Signed-off-by: John Hubbard 
> ---
>  tools/testing/selftests/vm/run_vmtest.sh | 24 ++--
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/run_vmtest.sh 
> b/tools/testing/selftests/vm/run_vmtest.sh
> index d1843d5f3c30..e3a8b14d9df6 100755
> --- a/tools/testing/selftests/vm/run_vmtest.sh
> +++ b/tools/testing/selftests/vm/run_vmtest.sh
> @@ -124,9 +124,9 @@ else
>  fi
>  
>  echo ""
> -echo "running 'gup_test -U' (normal/slow gup)"
> +echo "running 'gup_test -u' (fast gup benchmark)"
>  echo ""
> -./gup_test -U
> +./gup_test -u
>  if [ $? -ne 0 ]; then
>   echo "[FAIL]"
>   exitcode=1
> @@ -134,10 +134,22 @@ else
>   echo "[PASS]"
>  fi
>  
> -echo "--"
> -echo "running gup_test -b (pin_user_pages)"
> -echo "--"
> -./gup_test -b
> +echo "---"
> +echo "running gup_test -a (pin_user_pages_fast benchmark)"
> +echo "---"
> +./gup_test -a
> +if [ $? -ne 0 ]; then
> + echo "[FAIL]"
> + exitcode=1
> +else
> + echo "[PASS]"
> +fi
> +
> +echo "--"
> +echo "running gup_test -ct -F 0x1 0 19 0x1000"
> +echo "   Dumps pages 0, 19, and 4096, using pin_user_pages (-F 0x1)"
> +echo "--"
> +./gup_test -ct -F 0x1 0 19 0x1000

Ah here it is...  Maybe just remove that from the previous commit message.

Ira

>  if [ $? -ne 0 ]; then
>   echo "[FAIL]"
>   exitcode=1
> -- 
> 2.28.0
> 
> 


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Ira Weiny
On Mon, Oct 12, 2020 at 05:44:38PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 12, 2020 at 09:28:29AM -0700, Dave Hansen wrote:
> > kmap_atomic() is always preferred over kmap()/kmap_thread().
> > kmap_atomic() is _much_ more lightweight since its TLB invalidation is
> > always CPU-local and never broadcast.
> > 
> > So, basically, unless you *must* sleep while the mapping is in place,
> > kmap_atomic() is preferred.
> 
> But kmap_atomic() disables preemption, so the _ideal_ interface would map
> it only locally, then on preemption make it global.  I don't even know
> if that _can_ be done.  But this email makes it seem like kmap_atomic()
> has no downsides.

And that is IIUC what Thomas was trying to solve.

Also, Linus brought up that kmap_atomic() has quirks in nesting.[1]

>From what I can see all of these discussions support the need to have something
between kmap() and kmap_atomic().

However, the reason behind converting call sites to kmap_thread() are different
between Thomas' patch set and mine.  Both require more kmap granularity.
However, they do so with different reasons and underlying implementations but
with the _same_ resulting semantics; a thread local mapping which is
preemptable.[2]  Therefore they each focus on changing different call sites.

While this patch set is huge I think it serves a valuable purpose to identify a
large number of call sites which are candidates for this new semantic.

Ira

[1] 
https://lore.kernel.org/lkml/CAHk-=wgbmwsTOKs23Z=71ebtruloeah2u3tnqt2athewvkb...@mail.gmail.com/
[2] It is important to note these implementations are not incompatible with
each other.  So I don't see yet another 'kmap_something()' being required.


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Ira Weiny
On Mon, Oct 12, 2020 at 09:02:54PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 12, 2020 at 12:53:54PM -0700, Ira Weiny wrote:
> > On Mon, Oct 12, 2020 at 05:44:38PM +0100, Matthew Wilcox wrote:
> > > On Mon, Oct 12, 2020 at 09:28:29AM -0700, Dave Hansen wrote:
> > > > kmap_atomic() is always preferred over kmap()/kmap_thread().
> > > > kmap_atomic() is _much_ more lightweight since its TLB invalidation is
> > > > always CPU-local and never broadcast.
> > > > 
> > > > So, basically, unless you *must* sleep while the mapping is in place,
> > > > kmap_atomic() is preferred.
> > > 
> > > But kmap_atomic() disables preemption, so the _ideal_ interface would map
> > > it only locally, then on preemption make it global.  I don't even know
> > > if that _can_ be done.  But this email makes it seem like kmap_atomic()
> > > has no downsides.
> > 
> > And that is IIUC what Thomas was trying to solve.
> > 
> > Also, Linus brought up that kmap_atomic() has quirks in nesting.[1]
> > 
> > >From what I can see all of these discussions support the need to have 
> > >something
> > between kmap() and kmap_atomic().
> > 
> > However, the reason behind converting call sites to kmap_thread() are 
> > different
> > between Thomas' patch set and mine.  Both require more kmap granularity.
> > However, they do so with different reasons and underlying implementations 
> > but
> > with the _same_ resulting semantics; a thread local mapping which is
> > preemptable.[2]  Therefore they each focus on changing different call sites.
> > 
> > While this patch set is huge I think it serves a valuable purpose to 
> > identify a
> > large number of call sites which are candidates for this new semantic.
> 
> Yes, I agree.  My problem with this patch-set is that it ties it to
> some Intel feature that almost nobody cares about.

I humbly disagree.  At this level the only thing this is tied to is the idea
that there are additional memory protections available which can be enabled
quickly on a per-thread basis.  PKS on Intel is but 1 implementation of that.

Even the kmap code only has knowledge that there is something which needs to be
done special on a devm page.

>
> Maybe we should
> care about it, but you didn't try very hard to make anyone care about
> it in the cover letter.

Ok my bad.  We have customers who care very much about restricting access to
the PMEM pages to prevent bugs in the kernel from causing permanent damage to
their data/file systems.  I'll reword the cover letter better.

> 
> For a future patch-set, I'd like to see you just introduce the new
> API.  Then you can optimise the Intel implementation of it afterwards.
> Those patch-sets have entirely different reviewers.

I considered doing this.  But this seemed more logical because the feature is
being driven by PMEM which is behind the kmap interface not by the users of the
API.

I can introduce a patch set with a kmap_thread() call which does nothing if
that is more palatable but it seems wrong to me to do so.

Ira


Re: [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 10:46:16AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> > Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work
> > in similar fashions and can share common defines.
> 
> Could we be a bit less abstract?  PKS and PKU each have:
> 1. A single control register
> 2. The same number of keys
> 3. The same number of bits in the register per key
> 4. Access and Write disable in the same bit locations
> 
> That means that we can share all the macros that synthesize and
> manipulate register values between the two features.

Sure.  Done.

> 
> > +++ b/arch/x86/include/asm/pkeys_common.h
> > @@ -0,0 +1,11 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_PKEYS_INTERNAL_H
> > +#define _ASM_X86_PKEYS_INTERNAL_H
> > +
> > +#define PKR_AD_BIT 0x1
> > +#define PKR_WD_BIT 0x2
> > +#define PKR_BITS_PER_PKEY 2
> > +
> > +#define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
> 
> Now that this has moved away from its use-site, it's a bit less
> self-documenting.  Let's add a comment:
> 
> /*
>  * Generate an Access-Disable mask for the given pkey.  Several of these
>  * can be OR'd together to generate pkey register values.
>  */

Fair enough. done.

> 
> Once that's in place, along with the updated changelog:
> 
> Reviewed-by: Dave Hansen 

Thanks,
Ira



Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 08:36:43PM +0100, Matthew Wilcox wrote:
> On Tue, Oct 13, 2020 at 11:44:29AM -0700, Dan Williams wrote:
> > On Fri, Oct 9, 2020 at 12:52 PM  wrote:
> > >
> > > From: Ira Weiny 
> > >
> > > The kmap() calls in this FS are localized to a single thread.  To avoid
> > > the over head of global PKRS updates use the new kmap_thread() call.
> > >
> > > Cc: Nicolas Pitre 
> > > Signed-off-by: Ira Weiny 
> > > ---
> > >  fs/cramfs/inode.c | 10 +-
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > > index 912308600d39..003c014a42ed 100644
> > > --- a/fs/cramfs/inode.c
> > > +++ b/fs/cramfs/inode.c
> > > @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block 
> > > *sb, unsigned int offset,
> > > struct page *page = pages[i];
> > >
> > > if (page) {
> > > -   memcpy(data, kmap(page), PAGE_SIZE);
> > > -   kunmap(page);
> > > +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> > > +   kunmap_thread(page);
> > 
> > Why does this need a sleepable kmap? This looks like a textbook
> > kmap_atomic() use case.
> 
> There's a lot of code of this form.  Could we perhaps have:
> 
> static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> int size)
> {
>   char *vto = kmap_atomic(to);
> 
>   memcpy(vto, vfrom, size);
>   kunmap_atomic(vto);
> }
> 
> in linux/highmem.h ?

Christoph had the same idea.  I'll work on it.

Ira



Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 09:01:49PM +0100, Al Viro wrote:
> On Tue, Oct 13, 2020 at 08:36:43PM +0100, Matthew Wilcox wrote:
> 
> > static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> > int size)
> > {
> > char *vto = kmap_atomic(to);
> > 
> > memcpy(vto, vfrom, size);
> > kunmap_atomic(vto);
> > }
> > 
> > in linux/highmem.h ?
> 
> You mean, like
> static void memcpy_from_page(char *to, struct page *page, size_t offset, 
> size_t len)
> {
> char *from = kmap_atomic(page);
> memcpy(to, from + offset, len);
> kunmap_atomic(from);
> }
> 
> static void memcpy_to_page(struct page *page, size_t offset, const char 
> *from, size_t len)
> {
> char *to = kmap_atomic(page);
> memcpy(to + offset, from, len);
> kunmap_atomic(to);
> }
> 
> static void memzero_page(struct page *page, size_t offset, size_t len)
> {
> char *addr = kmap_atomic(page);
> memset(addr + offset, 0, len);
> kunmap_atomic(addr);
> }
> 
> in lib/iov_iter.c?  FWIW, I don't like that "highpage" in the name and
> highmem.h as location - these make perfect sense regardless of highmem;
> they are normal memory operations with page + offset used instead of
> a pointer...

I was thinking along those lines as well especially because of the direction
this patch set takes kmap().

Thanks for pointing these out to me.  How about I lift them to a common header?
But if not highmem.h where?

Ira


Re: [PATCH RFC PKS/PMEM 24/58] fs/freevxfs: Utilize new kmap_thread()

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 12:25:44PM +0100, Christoph Hellwig wrote:
> > -   kaddr = kmap(pp);
> > +   kaddr = kmap_thread(pp);
> > memcpy(kaddr, vip->vii_immed.vi_immed + offset, PAGE_SIZE);
> > -   kunmap(pp);
> > +   kunmap_thread(pp);
> 
> You only Cced me on this particular patch, which means I have absolutely
> no idea what kmap_thread and kunmap_thread actually do, and thus can't
> provide an informed review.

Sorry the list was so big I struggled with who to CC and on which patches.

> 
> That being said I think your life would be a lot easier if you add
> helpers for the above code sequence and its counterpart that copies
> to a potential hughmem page first, as that hides the implementation
> details from most users.

Matthew Wilcox and Al Viro have suggested similar ideas.

https://lore.kernel.org/lkml/20201013205012.gi2046...@iweiny-desk2.sc.intel.com/

Ira


Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 10:50:05AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> > +/*
> > + * Update the pk_reg value and return it.
> 
> How about:
> 
>   Replace disable bits for @pkey with values from @flags.

Done.

> 
> > + * Kernel users use the same flags as user space:
> > + * PKEY_DISABLE_ACCESS
> > + * PKEY_DISABLE_WRITE
> > + */
> > +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> > +{
> > +   int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> > +
> > +   pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> > +
> > +   if (flags & PKEY_DISABLE_ACCESS)
> > +   pk_reg |= PKR_AD_BIT << pkey_shift;
> > +   if (flags & PKEY_DISABLE_WRITE)
> > +   pk_reg |= PKR_WD_BIT << pkey_shift;
> 
> I still think this deserves two lines of comments:
> 
>   /* Mask out old bit values */
> 
>   /* Or in new values */

Sure, done.
Ira



Re: [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS)

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 11:23:08AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> > +/*
> > + * PKS is independent of PKU and either or both may be supported on a CPU.
> > + * Configure PKS if the cpu supports the feature.
> > + */
> 
> Let's at least be consistent about CPU vs. cpu in a single comment. :)

Sorry, done.

> 
> > +static void setup_pks(void)
> > +{
> > +   if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))
> > +   return;
> > +   if (!cpu_feature_enabled(X86_FEATURE_PKS))
> > +   return;
> 
> If you put X86_FEATURE_PKS in disabled-features.h, you can get rid of
> the explicit CONFIG_ check.

Done.

> 
> > +   cr4_set_bits(X86_CR4_PKS);
> > +}
> > +
> >  /*
> >   * This does the hard work of actually picking apart the CPU stuff...
> >   */
> > @@ -1544,6 +1558,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
> >  
> > x86_init_rdrand(c);
> > setup_pku(c);
> > +   setup_pks();
> >  
> > /*
> >  * Clear/Set all flags overridden by options, need do it
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 6c974888f86f..1b9bc004d9bc 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -822,6 +822,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
> > bool
> >  config ARCH_HAS_PKEYS
> > bool
> > +config ARCH_HAS_SUPERVISOR_PKEYS
> > +   bool
> >  
> >  config PERCPU_STATS
> > bool "Collect percpu memory statistics"
> > 
> 


Re: [PATCH V2 05/10] x86/pks: Add PKS kernel API

2020-11-03 Thread Ira Weiny
On Tue, Nov 03, 2020 at 07:50:24AM +0100, Greg KH wrote:
> On Mon, Nov 02, 2020 at 12:53:15PM -0800, ira.we...@intel.com wrote:
> > From: Fenghua Yu 
> > 

[snip]

> > diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> > index 2955ba976048..0959a4c0ca64 100644
> > --- a/include/linux/pkeys.h
> > +++ b/include/linux/pkeys.h
> > @@ -50,4 +50,28 @@ static inline void copy_init_pkru_to_fpregs(void)
> >  
> >  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
> >  
> > +#define PKS_FLAG_EXCLUSIVE 0x00
> > +
> > +#ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +static inline int pks_key_alloc(const char * const pkey_user, int flags)
> > +{
> > +   return -EOPNOTSUPP;
> > +}
> > +static inline void pks_key_free(int pkey)
> > +{
> > +}
> > +static inline void pks_mk_noaccess(int pkey)
> > +{
> > +   WARN_ON_ONCE(1);
> 
> So for panic-on-warn systems, this is ok to reboot the box?

I would not expect this to reboot the box no.  But it is a violation of the API
contract.  If pky_key_alloc() returns an error calling any of the other
functions is an error.

> 
> Are you sure, that feels odd...

It does feel odd and downright wrong...  But there are a lot of WARN_ON_ONCE's
out there to catch this type of internal programming error.  Is panic-on-warn
commonly used?

Ira


Re: [PATCH V2 05/10] x86/pks: Add PKS kernel API

2020-11-03 Thread Ira Weiny
On Tue, Nov 03, 2020 at 07:14:07PM +0100, Greg KH wrote:
> On Tue, Nov 03, 2020 at 09:53:36AM -0800, Ira Weiny wrote:
> > On Tue, Nov 03, 2020 at 07:50:24AM +0100, Greg KH wrote:
> > > On Mon, Nov 02, 2020 at 12:53:15PM -0800, ira.we...@intel.com wrote:
> > > > From: Fenghua Yu 
> > > > 
> > 
> > [snip]
> > 
> > > > diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> > > > index 2955ba976048..0959a4c0ca64 100644
> > > > --- a/include/linux/pkeys.h
> > > > +++ b/include/linux/pkeys.h
> > > > @@ -50,4 +50,28 @@ static inline void copy_init_pkru_to_fpregs(void)
> > > >  
> > > >  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
> > > >  
> > > > +#define PKS_FLAG_EXCLUSIVE 0x00
> > > > +
> > > > +#ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > > > +static inline int pks_key_alloc(const char * const pkey_user, int 
> > > > flags)
> > > > +{
> > > > +   return -EOPNOTSUPP;
> > > > +}
> > > > +static inline void pks_key_free(int pkey)
> > > > +{
> > > > +}
> > > > +static inline void pks_mk_noaccess(int pkey)
> > > > +{
> > > > +   WARN_ON_ONCE(1);
> > > 
> > > So for panic-on-warn systems, this is ok to reboot the box?
> > 
> > I would not expect this to reboot the box no.  But it is a violation of the 
> > API
> > contract.  If pky_key_alloc() returns an error calling any of the other
> > functions is an error.
> > 
> > > 
> > > Are you sure, that feels odd...
> > 
> > It does feel odd and downright wrong...  But there are a lot of 
> > WARN_ON_ONCE's
> > out there to catch this type of internal programming error.  Is 
> > panic-on-warn
> > commonly used?
> 
> Yes it is, and we are trying to recover from that as it is something
> that you should recover from.  Properly handle the error and move on.

Sorry, I did not know that...  Ok I'll look at the series because I probably
have others I need to change.

Thanks,
Ira


[PATCH V2 01/10] x86/pkeys: Create pkeys_common.h

2020-11-02 Thread ira . weiny
From: Ira Weiny 

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines.  Specifically PKS and PKU
each have:

1. A single control register
2. The same number of keys
3. The same number of bits in the register per key
4. Access and Write disable in the same bit locations

That means that we can share all the macros that synthesize and
manipulate register values between the two features.  Normally, these
macros would be put in asm/pkeys.h to be used internally and externally
to the arch code.  However, the defines are required in pgtable.h and
inclusion of pkeys.h in that header creates complex dependencies which
are best resolved in a separate header.

Share these defines by moving them into a new header, change their names
to reflect the common use, and include the header where needed.

Reviewed-by: Dave Hansen 
Signed-off-by: Ira Weiny 

---
NOTE: The initialization of init_pkru_value cause checkpatch errors
because of the space after the '(' in the macros.  We leave this as is
because it is more readable in this format.  And it was existing code.

---
Changes from RFC V3
Per Dave Hansen
Update commit message
Add comment to PKR_AD_KEY macro
---
 arch/x86/include/asm/pgtable.h  | 13 ++---
 arch/x86/include/asm/pkeys.h|  2 ++
 arch/x86/include/asm/pkeys_common.h | 15 +++
 arch/x86/kernel/fpu/xstate.c|  8 
 arch/x86/mm/pkeys.c | 14 ++
 5 files changed, 33 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..bfbfb951fe65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1360,9 +1360,7 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 }
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
-#define PKRU_AD_BIT 0x1
-#define PKRU_WD_BIT 0x2
-#define PKRU_BITS_PER_PKEY 2
+#include 
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -1372,18 +1370,19 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-   int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-   return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+   int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+   return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-   int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
+   int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
/*
 * Access-disable disables writes too so we need to check
 * both bits here.
 */
-   return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+   return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u16 pte_flags_pkey(unsigned long pte_flags)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2ff9b98812b7..f9feba80894b 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
+#include 
+
 #define ARCH_DEFAULT_PKEY  0
 
 /*
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index ..737d916f476c
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_INTERNAL_H
+#define _ASM_X86_PKEYS_INTERNAL_H
+
+#define PKR_AD_BIT 0x1
+#define PKR_WD_BIT 0x2
+#define PKR_BITS_PER_PKEY 2
+
+/*
+ * Generate an Access-Disable mask for the given pkey.  Several of these can be
+ * OR'd together to generate pkey register values.
+ */
+#define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5d8047441a0a..a99afc70cc0a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -995,7 +995,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
unsigned long init_val)
 {
u32 old_pkru;
-   int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+   int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
u32 new_pkru_bits = 0;
 
/*
@@ -1014,16 +1014,16 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
 
/* Set the bits we need in PKRU:  */
if (init_val & PKEY_DISABLE_ACCESS)
-   new_pkru_bits |= PKRU_AD_BIT;
+   new_pkru_bits |= PKR_AD_BIT;
if (init_val & PKEY_DISABLE_WRITE)
-   new_pkru_bits |= PKRU_WD_BIT;
+   new_pkru_bits |= PKR_WD_BIT;
 
/* Shift the bits in to the correct place in 

[PATCH V2 05/10] x86/pks: Add PKS kernel API

2020-11-02 Thread ira . weiny
From: Fenghua Yu 

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain.  Export 5 new symbols pks_key_alloc(), pks_mknoaccess(),
pks_mkread(), pks_mkrdwr(), and pks_key_free().  Add 2 new macros;
PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes from V1
Per Dave Hansen
Add flags to pks_key_alloc() to help future proof the
interface if/when the key space is exhausted.

Changes from RFC V3
Per Dave Hansen
Put WARN_ON_ONCE in pks_key_free()
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change return pks_key_alloc() to EOPNOTSUPP when not
supported or configured
Per Peter Zijlstra
Remove unneeded preempt disable/enable
---
 Documentation/core-api/protection-keys.rst | 102 ++---
 arch/x86/include/asm/pgtable_types.h   |  12 ++
 arch/x86/include/asm/pkeys.h   |  11 ++
 arch/x86/include/asm/pkeys_common.h|   4 +
 arch/x86/mm/pkeys.c| 126 +
 include/linux/pgtable.h|   4 +
 include/linux/pkeys.h  |  24 
 7 files changed, 265 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..c4e6c480562f 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,33 @@
 Memory Protection Keys
 ==
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
 Memory Protection Keys provides a mechanism for enforcing page-based
 protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+when an application changes protection domains.
+
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And It will be available in future
+non-server Intel parts and future AMD processors.
+
+Future Intel processors will support Protection Keys for Supervisor pages
+(PKS).
+
+For anyone wishing to test or use user space pkeys, it is available in Amazon's
+EC2 C5 instances and is known to work there using an Ubuntu 17.04 image.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
+
+Protections for each page are controlled with per CPU registers for each type
+of page User and Supervisor.  Each of these 32 bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs.  
These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +109,58 @@ with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==
+
+The following interface i

[PATCH V2 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-11-02 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is defined as a per-logical-processor register.  This
isolates memory access by logical CPU.  Unfortunately, the MSR is not
managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
context switch.

Define a saved PKRS value in the task struct, as well as a cached
per-logical-processor MSR value which mirrors the MSR value of the
current CPU.  Initialize all tasks with the default MSR value.  Then, on
schedule in, check the saved task MSR vs the per-cpu value.  If
different proceed to write the MSR.  If not avoid the overhead of the
MSR write and continue.

Follow on patches will update the saved PKRS as well as the MSR if
needed.

Finally it should be noted that the underlying WRMSR(MSR_IA32_PKRS) is
not serializing but still maintains ordering properties similar to
WRPKRU.  The current SDM section on PKRS needs updating but should be
the same as that of WRPKRU.  So to quote from the WRPKRU text:

WRPKRU will never execute transiently. Memory accesses affected
by PKRU register will not execute (even transiently) until all
prior executions of WRPKRU have completed execution and updated
the PKRU register.

Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ira Weiny 

---
Changes from V1
Rebase to latest tip/master
Resolve conflicts with INIT_THREAD changes

Changes since RFC V3
Per Dave Hansen
Update commit message
move saved_pkrs to be in a nicer place
Per Peter Zijlstra
Add Comment from Peter
Clean up white space
Update authorship
---
 arch/x86/include/asm/msr-index.h|  1 +
 arch/x86/include/asm/pkeys_common.h | 20 +++
 arch/x86/include/asm/processor.h| 18 -
 arch/x86/kernel/cpu/common.c|  2 ++
 arch/x86/kernel/process.c   | 26 
 arch/x86/mm/pkeys.c | 31 +
 6 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 972a34d93505..ddb125e44408 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -754,6 +754,7 @@
 
 #define MSR_IA32_TSC_DEADLINE  0x06E0
 
+#define MSR_IA32_PKRS  0x06E1
 
 #define MSR_TSX_FORCE_ABORT0x010F
 
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index 737d916f476c..801a75615209 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -12,4 +12,24 @@
  */
 #define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
 
+/*
+ * Define a default PKRS value for each task.
+ *
+ * Key 0 has no restriction.  All other keys are set to the most restrictive
+ * value which is access disabled (AD=1).
+ *
+ * NOTE: This needs to be a macro to be used as part of the INIT_THREAD macro.
+ */
+#define INIT_PKRS_VALUE (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \
+PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \
+PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \
+PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
+PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+void write_pkrs(u32 new_pkrs);
+#else
+static inline void write_pkrs(u32 new_pkrs) { }
+#endif
+
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 60dbcdcb833f..78eb5f483410 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -18,6 +18,7 @@ struct vm86;
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -522,6 +523,12 @@ struct thread_struct {
unsigned long   cr2;
unsigned long   trap_nr;
unsigned long   error_code;
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   /* Saved Protection key register for supervisor mappings */
+   u32 saved_pkrs;
+#endif
+
 #ifdef CONFIG_VM86
/* Virtual 86 mode info */
struct vm86 *vm86;
@@ -787,7 +794,16 @@ static inline void spin_lock_prefetch(const void *x)
 #define KSTK_ESP(task) (task_pt_regs(task)->sp)
 
 #else
-#define INIT_THREAD { }
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define INIT_THREAD_PKRS   .saved_pkrs = INIT_PKRS_VALUE
+#else
+#define INIT_THREAD_PKRS   0
+#endif
+
+#define INIT_THREAD  { \
+   INIT_THREAD_PKRS,   \
+}
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/commo

[PATCH V2 07/10] x86/entry: Pass irqentry_state_t by reference

2020-11-02 Thread ira . weiny
From: Ira Weiny 

Currently struct irqentry_state_t only contains a single bool value
which makes passing it by value is reasonable.  However, future patches
propose to add information to this struct, for example the PKRS
register/thread state.

Adding information to irqentry_state_t makes passing by value less
efficient.  Therefore, change the entry/exit calls to pass irq_state by
reference.

While at it, make the code easier to follow by changing all the usage
sites to consistently use the variable name 'irq_state'.

Signed-off-by: Ira Weiny 

---
Changes from V1
From Thomas: Update commit message
Further clean up Kernel doc and comments
Missed some 'return' comments which are no longer valid

Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
 arch/x86/entry/common.c |  8 
 arch/x86/include/asm/idtentry.h | 25 ++--
 arch/x86/kernel/cpu/mce/core.c  |  4 ++--
 arch/x86/kernel/kvm.c   |  6 +++---
 arch/x86/kernel/nmi.c   |  4 ++--
 arch/x86/kernel/traps.c | 21 
 arch/x86/mm/fault.c |  6 +++---
 include/linux/entry-common.h| 18 +
 kernel/entry/common.c   | 34 +
 9 files changed, 65 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 {
struct pt_regs *old_regs;
bool inhcall;
-   irqentry_state_t state;
+   irqentry_state_t irq_state;
 
-   state = irqentry_enter(regs);
+   irqentry_enter(regs, _state);
old_regs = set_irq_regs(regs);
 
instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
set_irq_regs(old_regs);
 
inhcall = get_and_clear_inhcall();
-   if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+   if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
-   irqentry_exit(regs, state);
+   irqentry_exit(regs, _state);
}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); 
\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_exit(regs, _state);
\
 }  \
\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs,  
\
 __visible noinstr void func(struct pt_regs *regs,  \
unsigned long error_code)   \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs, error_code);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_e

[PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-11-02 Thread ira . weiny
From: Ira Weiny 

Changes from V1
Rebase to TIP master; resolve conflicts and test
Clean up some kernel docs updates missed in V1
Add irqentry_state_t kernel doc for PKRS field
Removed redundant irq_state->pkrs
This is only needed when we add the global state and somehow
ended up in this patch series.  That will come back when we add
the global functionality in.
From Thomas Gleixner
Update commit messages
Add kernel doc for struct irqentry_state_t
From Dave Hansen add flags to pks_key_alloc()

Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed


Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM.  Trusted keys
is a newer use case which is still being explored.  PMEM was submitted as part
of the RFC (v2) series[1].  However, since then it was found that some callers
of kmap() require a global implementation of PKS.  Specifically some users of
kmap() expect mappings to be available to all kernel threads.  While global use
of PKS is rare it needs to be included for correctness.  Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out.  Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2]  This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:


https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.we...@intel.com/

https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.we...@intel.com/


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections.  PKS works in
a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change.  Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation
saves/restores the MSR across context switches and during exceptions.  Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one.  Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

   1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
   2) Works on a per thread basis

PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.


[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.we...@intel.com/
[2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.we...@intel.com/
[3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.we...@intel.com/


Fenghua Yu (2):
  x86/pks: Enable Protection Keys Supervisor (PKS)
  x86/pks: Add PKS kernel API

Ira Weiny (7):
  x86/pkeys: Cre

[PATCH V2 10/10] x86/pks: Add PKS test code

2020-11-02 Thread ira . weiny
From: Ira Weiny 

The core PKS functionality provides an interface for kernel users to
reserve keys to their domains set up the page tables with those keys and
control access to those domains when needed.

Define test code which exercises the core functionality of PKS via a
debugfs entry.  Basic checks can be triggered on boot with a kernel
command line option while both basic and preemption checks can be
triggered with separate debugfs values.

debugfs controls are:

'0' -- Run access tests with a single pkey
'1' -- Set up the pkey register with no access for the pkey allocated to
   this fd
'2' -- Check that the pkey register updated in '1' is still the same.
   (To be used after a forced context switch.)
'3' -- Allocate all pkeys possible and run tests on each pkey allocated.
   DEFAULT when run at boot.

Closing the fd will cleanup and release the pkey, therefore to exercise
context switch testing a user space program is provided in:

.../tools/testing/selftests/x86/test_pks.c

Reviewed-by: Dave Hansen 
Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Signed-off-by: Ira Weiny 

---
Changes for V1
Update for new pks_key_alloc()

Changes from RFC V3
Comments from Dave Hansen
clean up whitespace dmanage
Clean up Kconfig help
Clean up user test error output
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Comments from Jing Han
Remove duplicate stdio.h
---
 Documentation/core-api/protection-keys.rst |   1 +
 arch/x86/mm/fault.c|  23 +
 lib/Kconfig.debug  |  12 +
 lib/Makefile   |   3 +
 lib/pks/Makefile   |   3 +
 lib/pks/pks_test.c | 691 +
 tools/testing/selftests/x86/Makefile   |   3 +-
 tools/testing/selftests/x86/test_pks.c |  66 ++
 8 files changed, 801 insertions(+), 1 deletion(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index c4e6c480562f..8ffdfbff013c 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -164,3 +164,4 @@ of WRPKRU.  So to quote from the WRPKRU text:
until all prior executions of WRPKRU have completed execution
and updated the PKRU register.
 
+Example code can be found in lib/pks/pks_test.c
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 931603102010..0a51b168e8ee 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include  /* faulthandler_disabled()  */
 #include  /* efi_recover_from_page_fault()*/
 #include 
+#include 
 
 #include /* boot_cpu_has, ...*/
 #include  /* dotraplinkage, ...   */
@@ -1149,6 +1150,25 @@ bool fault_in_kernel_space(unsigned long address)
return address >= TASK_SIZE_MAX;
 }
 
+#ifdef CONFIG_PKS_TESTING
+bool pks_test_callback(irqentry_state_t *irq_state);
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t 
*irq_state)
+{
+   /*
+* If we get a protection key exception it could be because we
+* are running the PKS test.  If so, pks_test_callback() will
+* clear the protection mechanism and return true to indicate
+* the fault was handled.
+*/
+   return (hw_error_code & X86_PF_PK) && pks_test_callback(irq_state);
+}
+#else
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t 
*irq_state)
+{
+   return false;
+}
+#endif
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1165,6 +1185,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long 
hw_error_code,
if (!cpu_feature_enabled(X86_FEATURE_PKS))
WARN_ON_ONCE(hw_error_code & X86_PF_PK);
 
+   if (handle_pks_testing(hw_error_code, irq_state))
+   return;
+
 #ifdef CONFIG_X86_32
/*
 * We can fault-in kernel-space virtual memory on-demand. The
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d7a7bc3b6098..028beedd86f5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2444,6 +2444,18 @@ config HYPERV_TESTING
help
  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TESTING
+   bool "PKey (S)upervisor testing"
+   default n
+   depends on ARCH_HAS_SUPERVISOR_PKEYS
+   help
+ Select this option to enable testing of PKS core software and
+ hardware.  The PKS core provides a mechanism to allocate keys 

[PATCH V2 09/10] x86/fault: Report the PKRS state on fault

2020-11-02 Thread ira . weiny
From: Ira Weiny 

When only user space pkeys are enabled faulting within the kernel was an
unexpected condition which should never happen.  Therefore a WARN_ON in
the kernel fault handler would detect if it ever did.  Now this is no
longer the case if PKS is enabled and supported.

Report a Pkey fault with a normal splat and add the PKRS state to the
fault splat text.  Note the PKS register is reset during an exception
therefore the saved PKRS value from before the beginning of the
exception is passed down.

If PKS is not enabled, or not active, maintain the WARN_ON_ONCE() from
before.

Because each fault has its own state the pkrs information will be
correctly reported even if a fault 'faults'.

Suggested-by: Andy Lutomirski 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3
Update commit message
Per Dave Hansen
Don't print PKRS if !cpu_feature_enabled(X86_FEATURE_PKS)
Fix comment
Remove check on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS in favor of
disabled-features.h
---
 arch/x86/mm/fault.c | 58 ++---
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8d20c4c13abf..931603102010 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -504,7 +504,8 @@ static void show_ldttss(const struct desc_ptr *gdt, const 
char *name, u16 index)
 }
 
 static void
-show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address)
+show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address,
+   irqentry_state_t *irq_state)
 {
if (!oops_may_print())
return;
@@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
error_code, unsigned long ad
 (error_code & X86_PF_PK)? "protection keys violation" :
   "permissions violation");
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   if (cpu_feature_enabled(X86_FEATURE_PKS) && irq_state && (error_code & 
X86_PF_PK))
+   pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
+#endif
+
if (!(error_code & X86_PF_USER) && user_mode(regs)) {
struct desc_ptr idt, gdt;
u16 ldtr, tr;
@@ -626,7 +632,8 @@ static void set_signal_archinfo(unsigned long address,
 
 static noinline void
 no_context(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, int signal, int si_code)
+  unsigned long address, int signal, int si_code,
+  irqentry_state_t *irq_state)
 {
struct task_struct *tsk = current;
unsigned long flags;
@@ -732,7 +739,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 */
flags = oops_begin();
 
-   show_fault_oops(regs, error_code, address);
+   show_fault_oops(regs, error_code, address, irq_state);
 
if (task_stack_end_corrupted(tsk))
printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
@@ -785,7 +792,8 @@ static bool is_vsyscall_vaddr(unsigned long vaddr)
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, u32 pkey, int si_code)
+  unsigned long address, u32 pkey, int si_code,
+  irqentry_state_t *irq_state)
 {
struct task_struct *tsk = current;
 
@@ -832,14 +840,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned 
long error_code,
if (is_f00f_bug(regs, address))
return;
 
-   no_context(regs, error_code, address, SIGSEGV, si_code);
+   no_context(regs, error_code, address, SIGSEGV, si_code, irq_state);
 }
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-unsigned long address)
+unsigned long address, irqentry_state_t *irq_state)
 {
-   __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR);
+   __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR, 
irq_state);
 }
 
 static void
@@ -853,7 +861,7 @@ __bad_area(struct pt_regs *regs, unsigned long error_code,
 */
mmap_read_unlock(mm);
 
-   __bad_area_nosemaphore(regs, error_code, address, pkey, si_code);
+   __bad_area_nosemaphore(regs, error_code, address, pkey, si_code, NULL);
 }
 
 static noinline void
@@ -923,7 +931,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
 {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
-   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, NULL);
return;
}
 
@@ -957,7 +965,7 @@ mm_fault_

[PATCH V2 08/10] x86/entry: Preserve PKRS MSR across exceptions

2020-11-02 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]
irqentry_state_t was already an object being used in the exception
handling and is straightforward.  It is also easy for any number of
nested states to be tracked and eventually can be enhanced to store the
reference counting required to support PKS through kmap reentry

Preserve the current task's PKRS values in irqentry_state_t on exception
entry and restoring them on exception exit.

Each nested exception is further saved allowing for any number of levels
of exception handling.

Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.

[1] 
https://lore.kernel.org/lkml/calcetrve1i5jdyzd_bcctxqjn+ze3t38efpgjxn1f577m36...@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf@nanos.tec.linutronix.de/#t

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Suggested-by: Peter Zijlstra 
Suggested-by: Thomas Gleixner 
Signed-off-by: Ira Weiny 

---
Changes from V1
remove redundant irq_state->pkrs
This value is only needed for the global tracking.  So
it should be included in that patch and not in this one.

Changes from RFC V3
Standardize on 'irq_state' variable name
Per Dave Hansen
irq_save_pkrs() -> irq_save_set_pkrs()
Rebased based on clean up patch by Thomas Gleixner
This includes moving irq_[save_set|restore]_pkrs() to
the core as well.
---
 arch/x86/entry/common.c | 38 +
 arch/x86/include/asm/pkeys_common.h |  5 ++--
 arch/x86/mm/pkeys.c |  2 +-
 include/linux/entry-common.h| 13 ++
 kernel/entry/common.c   | 14 +--
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 87dea56a15d2..1b6a419a6fac 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_XEN_PV
 #include 
@@ -209,6 +210,41 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the PKRS value for the duration of the
+ * exception.  Thus preventing exception handlers from having the elevated
+ * access of the interrupted task.
+ */
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   irq_state->thread_pkrs = current->thread.saved_pkrs;
+   write_pkrs(INIT_PKRS_VALUE);
+}
+
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   write_pkrs(irq_state->thread_pkrs);
+   current->thread.saved_pkrs = irq_state->thread_pkrs;
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -272,6 +308,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
+   /* Normally called by irqentry_exit, we must restore pkrs here 
*/
+   irq_restore_pkrs(_state);
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index cd492c23b28c..f921c58793f9 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -31,9 +31,10 @@
 #definePKS_NUM_KEYS16
 
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 new_pkrs);
+DECLARE_PER_CPU(u32, pkrs_cache);
+noinstr void write_pkrs(u32 new_pkrs);
 #else
-static inline void write_pkrs(u32 new_pkrs) { }
+static __always_inline void write_pkrs(u32 new_pkrs) { }
 #endif
 
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 0dc77409957a..39ec034cf379 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -252,7 +252,7 @@ DEFINE_PER_CPU(u3

[PATCH V2 06/10] x86/entry: Move nmi entry/exit into common code

2020-11-02 Thread ira . weiny
From: Thomas Gleixner 

Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.

Move it to common code and extend irqentry_state_t to carry lockdep state.

Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
between the IRQ and NMI exceptions, and add kernel documentation for
struct irqentry_state_t

Signed-off-by: Thomas Gleixner 
Signed-off-by: Ira Weiny 

---
Changes from V1
Update commit message
Add kernel doc for struct irqentry_state_t
With Thomas' help
---
 arch/x86/entry/common.c | 34 
 arch/x86/include/asm/idtentry.h |  3 ---
 arch/x86/kernel/cpu/mce/core.c  |  6 ++---
 arch/x86/kernel/nmi.c   |  6 ++---
 arch/x86/kernel/traps.c | 13 ++-
 include/linux/entry-common.h| 39 -
 kernel/entry/common.c   | 36 ++
 7 files changed, 87 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..18d8f17f755c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,40 +209,6 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
-noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
-{
-   bool irq_state = lockdep_hardirqs_enabled();
-
-   __nmi_enter();
-   lockdep_hardirqs_off(CALLER_ADDR0);
-   lockdep_hardirq_enter();
-   rcu_nmi_enter();
-
-   instrumentation_begin();
-   trace_hardirqs_off_finish();
-   ftrace_nmi_enter();
-   instrumentation_end();
-
-   return irq_state;
-}
-
-noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
-{
-   instrumentation_begin();
-   ftrace_nmi_exit();
-   if (restore) {
-   trace_hardirqs_on_prepare();
-   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-   }
-   instrumentation_end();
-
-   rcu_nmi_exit();
-   lockdep_hardirq_exit();
-   if (restore)
-   lockdep_hardirqs_on(CALLER_ADDR0);
-   __nmi_exit();
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb0ac2f..247a60a47331 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,9 +11,6 @@
 
 #include 
 
-bool idtentry_enter_nmi(struct pt_regs *regs);
-void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
-
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 51bf910b1e9d..403561a89c28 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1981,7 +1981,7 @@ void (*machine_check_vector)(struct pt_regs *) = 
unexpected_machine_check;
 
 static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 {
-   bool irq_state;
+   irqentry_state_t irq_state;
 
WARN_ON_ONCE(user_mode(regs));
 
@@ -1993,7 +1993,7 @@ static __always_inline void 
exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;
 
-   irq_state = idtentry_enter_nmi(regs);
+   irq_state = irqentry_nmi_enter(regs);
/*
 * The call targets are marked noinstr, but objtool can't figure
 * that out because it's an indirect call. Annotate it.
@@ -2004,7 +2004,7 @@ static __always_inline void 
exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
-   idtentry_exit_nmi(regs, irq_state);
+   irqentry_nmi_exit(regs, irq_state);
 }
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..bf250a339655 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
 DEFINE_IDTENTRY_RAW(exc_nmi)
 {
-   bool irq_state;
+   irqentry_state_t irq_state;
 
/*
 * Re-enable NMIs right here when running as an SEV-ES guest. This might
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
this_cpu_write(nmi_dr7, local_db_save());
 
-   irq_state = idtentry_enter_nmi(regs);
+   irq_state = irqentry_nmi_enter(regs);
 
inc_irq_stat(__nmi_count);
 
if (!ignore_nmis)
default_do_nmi(regs);
 
-   idtentry_exit_nmi(regs, irq_state);
+   irqentry_nmi_exit(regs, irq_state);
 
local_db_restore(this_cpu_read(nmi_dr7));
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e19df6cde35d..e1b78829d909 100644
--- a/arch/x86/kernel/

[PATCH V2 02/10] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-11-02 Thread ira . weiny
From: Ira Weiny 

Define a helper, update_pkey_val(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3:
Per Dave Hansen
Update and add comments per Dave's review
Per Peter
Correct attribution
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 22 --
 arch/x86/mm/pkeys.c  | 23 +++
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index f9feba80894b..4526245b03e5 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -136,4 +136,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a99afc70cc0a..a3bca3211eba 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -994,9 +994,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
 {
-   u32 old_pkru;
-   int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
-   u32 new_pkru_bits = 0;
+   u32 pkru;
 
/*
 * This check implies XSAVE support.  OSPKE only gets
@@ -1012,21 +1010,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
 */
WARN_ON_ONCE(pkey >= arch_max_pkey());
 
-   /* Set the bits we need in PKRU:  */
-   if (init_val & PKEY_DISABLE_ACCESS)
-   new_pkru_bits |= PKR_AD_BIT;
-   if (init_val & PKEY_DISABLE_WRITE)
-   new_pkru_bits |= PKR_WD_BIT;
-
-   /* Shift the bits in to the correct place in PKRU for pkey: */
-   new_pkru_bits <<= pkey_shift;
-
-   /* Get old PKRU and mask off any old bits in place: */
-   old_pkru = read_pkru();
-   old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
-   /* Write old part along with new part: */
-   write_pkru(old_pkru | new_pkru_bits);
+   pkru = read_pkru();
+   pkru = update_pkey_val(pkru, pkey, init_val);
+   write_pkru(pkru);
 
return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f5efb4007e74..d1dfe743e79f 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -208,3 +208,26 @@ static __init int setup_init_pkru(char *opt)
return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Replace disable bits for @pkey with values from @flags
+ *
+ * Kernel users use the same flags as user space:
+ * PKEY_DISABLE_ACCESS
+ * PKEY_DISABLE_WRITE
+ */
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
+{
+   int pkey_shift = pkey * PKR_BITS_PER_PKEY;
+
+   /*  Mask out old bit values */
+   pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
+
+   /*  Or in new values */
+   if (flags & PKEY_DISABLE_ACCESS)
+   pk_reg |= PKR_AD_BIT << pkey_shift;
+   if (flags & PKEY_DISABLE_WRITE)
+   pk_reg |= PKR_WD_BIT << pkey_shift;
+
+   return pk_reg;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9



[PATCH V2 03/10] x86/pks: Enable Protection Keys Supervisor (PKS)

2020-11-02 Thread ira . weiny
From: Fenghua Yu 

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Kernel users can thus defines 'domains' of page mappings which have an
extra level of protection beyond those specified in the supervisor page
table entries.

Define ARCH_HAS_SUPERVISOR_PKEYS to distinguish this functionality from
the existing ARCH_HAS_PKEYS and then enable PKS when configured and
indicated by the CPU instance.  While not strictly necessary in this
patch, ARCH_HAS_SUPERVISOR_PKEYS separates this functionality through
the patch series so it is introduced here.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes since RFC V3
Per Dave Hansen
Update comment
Add X86_FEATURE_PKS to disabled-features.h
Rebase based on latest TIP tree
---
 arch/x86/Kconfig|  1 +
 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/disabled-features.h|  8 +++-
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c| 13 +
 mm/Kconfig  |  2 ++
 6 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..78c4c749c6a9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1876,6 +1876,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
+   select ARCH_HAS_SUPERVISOR_PKEYS
help
  Memory Protection Keys provides a mechanism for enforcing
  page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..4deb580324e8 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -356,6 +356,7 @@
 #define X86_FEATURE_MOVDIRI(16*32+27) /* MOVDIRI instruction */
 #define X86_FEATURE_MOVDIR64B  (16*32+28) /* MOVDIR64B instruction */
 #define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS 
instructions */
+#define X86_FEATURE_PKS(16*32+31) /* Protection Keys 
for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery 
support */
diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 5861d34f9771..82540f0c5b6c 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,12 @@
 # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+# define DISABLE_PKS   0
+#else
+# define DISABLE_PKS   (1<<(X86_FEATURE_PKS & 31))
+#endif
+
 #ifdef CONFIG_X86_5LEVEL
 # define DISABLE_LA57  0
 #else
@@ -82,7 +88,7 @@
 #define DISABLED_MASK140
 #define DISABLED_MASK150
 #define DISABLED_MASK16
(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-DISABLE_ENQCMD)
+DISABLE_ENQCMD|DISABLE_PKS)
 #define DISABLED_MASK170
 #define DISABLED_MASK180
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP   _BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT22 /* enable Protection Keys support */
 #define X86_CR4_PKE_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT24 /* enable Protection Keys for 
Supervisor */
+#define X86_CR4_PKS_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad8480c464..6a9ca938d9a9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1494,6 +1494,18 @@ static void validate_apic_and_package_id(struct 
cpuinfo_x86 *c)
 #endif
 }
 
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the CPU supports the feature.
+ */
+static void setup_pks(void)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   cr4_set_bits(X86_CR4_PKS);
+}
+
 /*
  * This does the hard work of actually

Re: [PATCH v2 1/1] mm: Optimizing hugepage zeroing in arm64

2021-02-03 Thread Ira Weiny
On Wed, Feb 03, 2021 at 04:08:08PM +0530, Prathu Baronia wrote:
>Hey Ira,
>I looked at your below-mentioned patch and I agree that the
>above-mentioned functions also need modification similar to
>clear_user_highpage().
>Would it be okay with you if I send your patch again with a modified
>commit message by adding my data and maintaining your authorship?
>
> [1]https://lore.kernel.org/lkml/20201210171834.2472353-2-ira.we...@intel.com/

Sure.  I have not changed the patch at all from that version.

Andrew, will this be going through your tree?  If not who?

If you take the above patch I can drop it from the series I'm about to submit
to convert btrfs kmaps.

Ira

>Regards,
>Prathu Baronia
> 
>    On Wed, Feb 3, 2021 at 1:33 AM Ira Weiny <[2]ira.we...@intel.com> wrote:
> 
>  On Tue, Feb 02, 2021 at 01:12:24PM +0530, Prathu Baronia wrote:
>  > In !HIGHMEM cases, specially in 64-bit architectures, we don't need
>  temp
>  > mapping of pages. Hence, k(map|unmap)_atomic() acts as nothing more
>  than
>  > multiple barrier() calls, for example for a 2MB hugepage in
>  > clear_huge_page() these are called 512 times i.e. to map and unmap
>  each
>  > subpage that means in total 2048 barrier calls. This called for
>  > optimization. Simply getting VADDR from page in the form of
>  kmap_local_*
>  > APIs does the job for us.  We profiled clear_huge_page() using ftrace
>  > and observed an improvement of 62%.
> 
>  Nice!
> 
>  >
>  > Setup:-
>  > Below data has been collected on Qualcomm's SM7250 SoC THP enabled
>  > (kernel
>  > v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76) switched
>  on
>  > and set to max frequency, also DDR set to perf governor.
>  >
>  > FTRACE Data:-
>  >
>  > Base data:-
>  > Number of iterations: 48
>  > Mean of allocation time: 349.5 us
>  > std deviation: 74.5 us
>  >
>  > v1 data:-
>  > Number of iterations: 48
>  > Mean of allocation time: 131 us
>  > std deviation: 32.7 us
>  >
>  > The following simple userspace experiment to allocate
>  > 100MB(BUF_SZ) of pages and writing to it gave us a good insight,
>  > we observed an improvement of 42% in allocation and writing timings.
>  > -
>  > Test code snippet
>  > -
>  >       clock_start();
>  >       buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */
>  >
>  >         for(i=0; i < BUF_SZ_PAGES; i++)
>  >         {
>  >                 *((int *)(buf + (i*PAGE_SIZE))) = 1;
>  >         }
>  >       clock_end();
>  > -
>  >
>  > Malloc test timings for 100MB anon allocation:-
>  >
>  > Base data:-
>  > Number of iterations: 100
>  > Mean of allocation time: 31831 us
>  > std deviation: 4286 us
>  >
>  > v1 data:-
>  > Number of iterations: 100
>  > Mean of allocation time: 18193 us
>  > std deviation: 4915 us
>  >
>  > Reported-by: Chintan Pandya <[3]chintan.pan...@oneplus.com>
>  > Signed-off-by: Prathu Baronia <[4]prathu.baro...@oneplus.com>
> 
>  Reviewed-by: Ira Weiny <[5]ira.we...@intel.com>
> 
>  FWIW, I have the same change in a patch in my kmap() changes branch. 
>  However,
>  my patch also changes clear_highpage(), zero_user_segments(),
>  copy_user_highpage(), and copy_highpage().
> 
>  Would changing those help you as well?
> 
>  Ira
> 
>  > ---
>  >  include/linux/highmem.h | 4 ++--
>  >  1 file changed, 2 insertions(+), 2 deletions(-)
>  >
>  > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>  > index d2c70d3772a3..444df139b489 100644
>  > --- a/include/linux/highmem.h
>  > +++ b/include/linux/highmem.h
>  > @@ -146,9 +146,9 @@ static inline void
>  invalidate_kernel_vmap_range(void *vaddr, int size)
>  >  #ifndef clear_user_highpage
>  >  static inline void clear_user_highpage(struct page *page, unsigned
>  long vaddr)
>  >  {
>  > -     void *addr = kmap_atomic(page);
>  > +     void *addr = kmap_local_page(page);
>  >       clear_user_page(addr, vaddr, page);
>  > -     kunmap_atomic(addr);
>  > +     kunmap_local(addr);
>  >  }
>  >  #endif
>  > 
>  > --
>  > 2.17.1
>  >
> 
> References
> 
>Visible links
>1. 
> https://lore.kernel.org/lkml/20201210171834.2472353-2-ira.we...@intel.com/
>2. mailto:ira.we...@intel.com
>3. mailto:chintan.pan...@oneplus.com
>4. mailto:prathu.baro...@oneplus.com
>5. mailto:ira.we...@intel.com


[PATCH] x86: Remove unnecessary kmap() from sgx_ioc_enclave_init()

2021-02-01 Thread ira . weiny
From: Ira Weiny 

kmap is inefficient and we are trying to reduce the usage in the kernel.
There is no readily apparent reason why the initp_page page needs to be
allocated and kmap'ed() but sigstruct needs to be page aligned and token
512 byte aligned.

In this case page_address() can be used instead of kmap_local_page() as
a much more efficient way to use the address because the page is
allocated GFP_KERNEL.

Remove the kmap and replace with page_address() to get a kernel address
for the alloc'ed page.

In addition add a comment regarding the alignment requirements as well
as 2 BUILD_BUG_ON's to ensure future changes to sigstruct and token do
not go unnoticed and cause a bug.

Cc: Sean Christopherson ,
Cc: Jethro Beekman ,
Signed-off-by: Ira Weiny 

---
Changes from v1[1]:
Use page_address() instead of kcmalloc() to ensure sigstruct is
page aligned
Use BUILD_BUG_ON to ensure token and sigstruct don't collide.

[1] https://lore.kernel.org/lkml/20210129001459.1538805-1-ira.we...@intel.com/
---
 arch/x86/kernel/cpu/sgx/ioctl.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 90a5caf76939..678b02d67c3c 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -615,11 +615,18 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, 
void __user *arg)
if (copy_from_user(_arg, arg, sizeof(init_arg)))
return -EFAULT;
 
+   /*
+* sigstruct must be on a page boundry and token on a 512 byte boundry
+* so use alloc_page/page_address instead of a kmalloc().
+*/
initp_page = alloc_page(GFP_KERNEL);
if (!initp_page)
return -ENOMEM;
 
-   sigstruct = kmap(initp_page);
+   sigstruct = page_address(initp_page);
+
+   BUILD_BUG_ON(sizeof(*sigstruct) > (PAGE_SIZE/2));
+   BUILD_BUG_ON(SGX_LAUNCH_TOKEN_SIZE > (PAGE_SIZE/2));
token = (void *)((unsigned long)sigstruct + PAGE_SIZE / 2);
memset(token, 0, SGX_LAUNCH_TOKEN_SIZE);
 
@@ -645,7 +652,6 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, 
void __user *arg)
ret = sgx_encl_init(encl, sigstruct, token);
 
 out:
-   kunmap(initp_page);
__free_page(initp_page);
return ret;
 }
-- 
2.28.0.rc0.12.gb6a658bd00c9



[PATCH] fs/coredump: Use kmap_local_page()

2021-02-03 Thread ira . weiny
From: Ira Weiny 

In dump_user_range() there is no reason for the mapping to be global.
Use kmap_local_page() rather than kmap.

Cc: Andrew Morton 
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Signed-off-by: Ira Weiny 
---
 fs/coredump.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index a2f6ecc8e345..53f63e176a2a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -894,10 +894,10 @@ int dump_user_range(struct coredump_params *cprm, 
unsigned long start,
 */
page = get_dump_page(addr);
if (page) {
-   void *kaddr = kmap(page);
+   void *kaddr = kmap_local_page(page);
 
stop = !dump_emit(cprm, kaddr, PAGE_SIZE);
-   kunmap(page);
+   kunmap_local(kaddr);
put_page(page);
} else {
stop = !dump_skip(cprm, PAGE_SIZE);
-- 
2.28.0.rc0.12.gb6a658bd00c9



[PATCH] fs/btrfs: Fix raid6 qstripe kmap'ing

2021-01-27 Thread ira . weiny
From: Ira Weiny 

When a qstripe is required an extra page is allocated and mapped.  There
were 3 problems.

1) There is no reason to map the qstripe page more than 1 time if the
   number of bits set in rbio->dbitmap is greater than one.
2) There is no reason to map the parity page and unmap it each time
   through the loop.
3) There is no corresponding call of kunmap() for the qstripe page.

The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping.  So map the
page for the duration of the loop.

Similarly, improve the algorithm by mapping the parity page just 1 time.

Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56")
To: Chris Mason 
To: Josef Bacik 
To: David Sterba 
Cc: Miao Xie 
Signed-off-by: Ira Weiny 

---
This was found while replacing kmap() with kmap_local_page().  After
this patch unwinding all the mappings becomes pretty straight forward.

I'm not exactly sure I've worded this commit message intelligently.
Please forgive me if there is a better way to word it.
---
 fs/btrfs/raid56.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 93fbf87bdc8d..b8a39dad0f00 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2363,16 +2363,21 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
SetPageUptodate(p_page);
 
if (has_qstripe) {
+   /* raid6, allocate and map temp space for the qstripe */
q_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
if (!q_page) {
__free_page(p_page);
goto cleanup;
}
SetPageUptodate(q_page);
+   pointers[rbio->real_stripes] = kmap(q_page);
}
 
atomic_set(>error, 0);
 
+   /* map the parity stripe just once */
+   pointers[nr_data] = kmap(p_page);
+
for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
struct page *p;
void *parity;
@@ -2382,16 +2387,8 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
pointers[stripe] = kmap(p);
}
 
-   /* then add the parity stripe */
-   pointers[stripe++] = kmap(p_page);
-
if (has_qstripe) {
-   /*
-* raid6, add the qstripe and call the
-* library function to fill in our p/q
-*/
-   pointers[stripe++] = kmap(q_page);
-
+   /* raid6, call the library function to fill in our p/q 
*/
raid6_call.gen_syndrome(rbio->real_stripes, PAGE_SIZE,
pointers);
} else {
@@ -2412,12 +2409,14 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
 
for (stripe = 0; stripe < nr_data; stripe++)
kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
-   kunmap(p_page);
}
 
+   kunmap(p_page);
__free_page(p_page);
-   if (q_page)
+   if (q_page) {
+   kunmap(q_page);
__free_page(q_page);
+   }
 
 writeback:
/*
-- 
2.28.0.rc0.12.gb6a658bd00c9



[PATCH] x86: Remove unnecessary kmap() from sgx_ioc_enclave_init()

2021-01-28 Thread ira . weiny
From: Ira Weiny 

There is no reason to alloc a page and kmap it to store this temporary
data from the user.  This is especially true when we are trying to
remove kmap usages.  Also placing the token pointer 1/2 way into the
page is fragile.

Replace this allocation with two kzalloc()'s which also removes the need
for the memset().

Signed-off-by: Ira Weiny 
---
 arch/x86/kernel/cpu/sgx/ioctl.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 90a5caf76939..9c9019760585 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -604,7 +604,6 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, 
void __user *arg)
 {
struct sgx_sigstruct *sigstruct;
struct sgx_enclave_init init_arg;
-   struct page *initp_page;
void *token;
int ret;
 
@@ -615,13 +614,15 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, 
void __user *arg)
if (copy_from_user(_arg, arg, sizeof(init_arg)))
return -EFAULT;
 
-   initp_page = alloc_page(GFP_KERNEL);
-   if (!initp_page)
+   sigstruct = kzalloc(sizeof(*sigstruct), GFP_KERNEL);
+   if (!sigstruct)
return -ENOMEM;
 
-   sigstruct = kmap(initp_page);
-   token = (void *)((unsigned long)sigstruct + PAGE_SIZE / 2);
-   memset(token, 0, SGX_LAUNCH_TOKEN_SIZE);
+   token = kzalloc(SGX_LAUNCH_TOKEN_SIZE, GFP_KERNEL);
+   if (!token) {
+   ret = -ENOMEM;
+   goto free_sigstruct;
+   }
 
if (copy_from_user(sigstruct, (void __user *)init_arg.sigstruct,
   sizeof(*sigstruct))) {
@@ -645,8 +646,9 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, 
void __user *arg)
ret = sgx_encl_init(encl, sigstruct, token);
 
 out:
-   kunmap(initp_page);
-   __free_page(initp_page);
+   kfree(token);
+free_sigstruct:
+   kfree(sigstruct);
return ret;
 }
 
-- 
2.28.0.rc0.12.gb6a658bd00c9



[PATCH V3] x86: Remove unnecessary kmap() from sgx_ioc_enclave_init()

2021-02-02 Thread ira . weiny
From: Ira Weiny 

kmap is inefficient and we are trying to reduce the usage in the kernel.
There is no readily apparent reason why initp_page needs to be allocated
and kmap'ed() but sigstruct needs to be page aligned and token
512 byte aligned.

kmalloc() can give us this alignment but we need to allocate PAGE_SIZE
bytes to do so.  Rather than change this kmap() to kmap_local_page() use
kmalloc() instead.

Remove the alloc_page()/kmap() and replace with kmalloc() to get a
kernel address to use.

In addition add a comment to document the alignment requirements so that
others like myself don't attempt to 'fix' this again.  Finally, add 2
BUILD_BUG_ON's to ensure future changes to sigstruct and token do not go
unnoticed and cause a bug.

Cc: Dave Hansen 
Cc: Sean Christopherson 
Cc: Jethro Beekman 
Signed-off-by: Ira Weiny 

---
Changes from v2[1]:
When allocating a power of 2 size kmalloc() now guarantees the
alignment of the respective size.  So go back to using kmalloc() but
with a PAGE_SIZE allocation to get the alignment.  This also follows
the pattern in sgx_ioc_enclave_create()

Changes from v1[1]:
Use page_address() instead of kcmalloc() to ensure sigstruct is
page aligned
Use BUILD_BUG_ON to ensure token and sigstruct don't collide.

[1] https://lore.kernel.org/lkml/20210129001459.1538805-1-ira.we...@intel.com/
[2] https://lore.kernel.org/lkml/20210202013725.3514671-1-ira.we...@intel.com/
---
 arch/x86/kernel/cpu/sgx/ioctl.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 90a5caf76939..e0c3301ccd67 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -604,7 +604,6 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, 
void __user *arg)
 {
struct sgx_sigstruct *sigstruct;
struct sgx_enclave_init init_arg;
-   struct page *initp_page;
void *token;
int ret;
 
@@ -615,11 +614,16 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, 
void __user *arg)
if (copy_from_user(_arg, arg, sizeof(init_arg)))
return -EFAULT;
 
-   initp_page = alloc_page(GFP_KERNEL);
-   if (!initp_page)
+   /*
+* sigstruct must be on a page boundry and token on a 512 byte boundry
+* kmalloc() gives us this alignment when allocating PAGE_SIZE bytes
+*/
+   sigstruct = kmalloc(PAGE_SIZE, GFP_KERNEL);
+   if (!sigstruct)
return -ENOMEM;
 
-   sigstruct = kmap(initp_page);
+   BUILD_BUG_ON(sizeof(*sigstruct) > (PAGE_SIZE/2));
+   BUILD_BUG_ON(SGX_LAUNCH_TOKEN_SIZE > (PAGE_SIZE/2));
token = (void *)((unsigned long)sigstruct + PAGE_SIZE / 2);
memset(token, 0, SGX_LAUNCH_TOKEN_SIZE);
 
@@ -645,8 +649,7 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, 
void __user *arg)
ret = sgx_encl_init(encl, sigstruct, token);
 
 out:
-   kunmap(initp_page);
-   __free_page(initp_page);
+   kfree(sigstruct);
return ret;
 }
 
-- 
2.28.0.rc0.12.gb6a658bd00c9



Re: [PATCH v2 1/1] mm: Optimizing hugepage zeroing in arm64

2021-02-02 Thread Ira Weiny
On Tue, Feb 02, 2021 at 01:12:24PM +0530, Prathu Baronia wrote:
> In !HIGHMEM cases, specially in 64-bit architectures, we don't need temp
> mapping of pages. Hence, k(map|unmap)_atomic() acts as nothing more than
> multiple barrier() calls, for example for a 2MB hugepage in
> clear_huge_page() these are called 512 times i.e. to map and unmap each
> subpage that means in total 2048 barrier calls. This called for
> optimization. Simply getting VADDR from page in the form of kmap_local_*
> APIs does the job for us.  We profiled clear_huge_page() using ftrace
> and observed an improvement of 62%.

Nice!

> 
> Setup:-
> Below data has been collected on Qualcomm's SM7250 SoC THP enabled
> (kernel
> v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76) switched on
> and set to max frequency, also DDR set to perf governor.
> 
> FTRACE Data:-
> 
> Base data:-
> Number of iterations: 48
> Mean of allocation time: 349.5 us
> std deviation: 74.5 us
> 
> v1 data:-
> Number of iterations: 48
> Mean of allocation time: 131 us
> std deviation: 32.7 us
> 
> The following simple userspace experiment to allocate
> 100MB(BUF_SZ) of pages and writing to it gave us a good insight,
> we observed an improvement of 42% in allocation and writing timings.
> -
> Test code snippet
> -
>   clock_start();
>   buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */
> 
> for(i=0; i < BUF_SZ_PAGES; i++)
> {
> *((int *)(buf + (i*PAGE_SIZE))) = 1;
> }
>   clock_end();
> -
> 
> Malloc test timings for 100MB anon allocation:-
> 
> Base data:-
> Number of iterations: 100
> Mean of allocation time: 31831 us
> std deviation: 4286 us
> 
> v1 data:-
> Number of iterations: 100
> Mean of allocation time: 18193 us
> std deviation: 4915 us
> 
> Reported-by: Chintan Pandya 
> Signed-off-by: Prathu Baronia 

Reviewed-by: Ira Weiny 

FWIW, I have the same change in a patch in my kmap() changes branch.  However,
my patch also changes clear_highpage(), zero_user_segments(),
copy_user_highpage(), and copy_highpage().

Would changing those help you as well?

Ira

> ---
>  include/linux/highmem.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index d2c70d3772a3..444df139b489 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -146,9 +146,9 @@ static inline void invalidate_kernel_vmap_range(void 
> *vaddr, int size)
>  #ifndef clear_user_highpage
>  static inline void clear_user_highpage(struct page *page, unsigned long 
> vaddr)
>  {
> - void *addr = kmap_atomic(page);
> + void *addr = kmap_local_page(page);
>   clear_user_page(addr, vaddr, page);
> - kunmap_atomic(addr);
> + kunmap_local(addr);
>  }
>  #endif
>  
> -- 
> 2.17.1
> 


[PATCH] nvdimm: Trivial comment fix

2019-09-18 Thread ira . weiny
From: Ira Weiny 

Signed-off-by: Ira Weiny 
---
 include/linux/nd.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/nd.h b/include/linux/nd.h
index f778f962d1b6..55c735997805 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -147,7 +147,7 @@ static inline int nvdimm_read_bytes(struct 
nd_namespace_common *ndns,
 
 /**
  * nvdimm_write_bytes() - synchronously write bytes to an nvdimm namespace
- * @ndns: device to read
+ * @ndns: device to write
  * @offset: namespace-relative starting offset
  * @buf: buffer to drain
  * @size: transfer length
-- 
2.20.1



Re: [PATCH] libnvdimm/nfit_test: Fix acpi_handle redefinition

2019-09-18 Thread Ira Weiny
On Tue, Sep 17, 2019 at 09:21:49PM -0700, Nathan Chancellor wrote:
> After commit 62974fc389b3 ("libnvdimm: Enable unit test infrastructure
> compile checks"), clang warns:
> 
> In file included from
> ../drivers/nvdimm/../../tools/testing/nvdimm/test/iomap.c:15:
> ../drivers/nvdimm/../../tools/testing/nvdimm/test/nfit_test.h:206:15:
> warning: redefinition of typedef 'acpi_handle' is a C11 feature
> [-Wtypedef-redefinition]
> typedef void *acpi_handle;
>   ^
> ../include/acpi/actypes.h:424:15: note: previous definition is here
> typedef void *acpi_handle;  /* Actually a ptr to a NS Node */
>   ^
> 1 warning generated.
> 
> The include chain:
> 
> iomap.c ->
> linux/acpi.h ->
> acpi/acpi.h ->
> acpi/actypes.h
> nfit_test.h
> 
> Avoid this by including linux/acpi.h in nfit_test.h, which allows us to
> remove both the typedef and the forward declaration of acpi_object.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/660
> Signed-off-by: Nathan Chancellor 

Reviewed-by: Ira Weiny 

> ---
> 
> I know that every maintainer has their own thing with the number of
> includes in each header file; this issue can be solved in a various
> number of ways, I went with the smallest diff stat. Please solve it in a
> different way if you see fit :)
> 
>  tools/testing/nvdimm/test/nfit_test.h | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/tools/testing/nvdimm/test/nfit_test.h 
> b/tools/testing/nvdimm/test/nfit_test.h
> index 448d686da8b1..0bf5640f1f07 100644
> --- a/tools/testing/nvdimm/test/nfit_test.h
> +++ b/tools/testing/nvdimm/test/nfit_test.h
> @@ -4,6 +4,7 @@
>   */
>  #ifndef __NFIT_TEST_H__
>  #define __NFIT_TEST_H__
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -202,9 +203,6 @@ struct nd_intel_lss {
>   __u32 status;
>  } __packed;
>  
> -union acpi_object;
> -typedef void *acpi_handle;
> -
>  typedef struct nfit_test_resource *(*nfit_test_lookup_fn)(resource_size_t);
>  typedef union acpi_object *(*nfit_test_evaluate_dsm_fn)(acpi_handle handle,
>const guid_t *guid, u64 rev, u64 func,
> -- 
> 2.23.0
> 


Re: [PATCH V3 07/15] arch/kunmap_atomic: Consolidate duplicate code

2020-05-18 Thread Ira Weiny
On Sun, May 17, 2020 at 09:29:32PM -0700, Guenter Roeck wrote:
> On Sun, May 17, 2020 at 08:49:39PM -0700, Ira Weiny wrote:
> > On Sat, May 16, 2020 at 03:33:06PM -0700, Guenter Roeck wrote:
> > > On Thu, May 07, 2020 at 07:59:55AM -0700, ira.we...@intel.com wrote:
> > > > From: Ira Weiny 
> > > > 
> > > > Every single architecture (including !CONFIG_HIGHMEM) calls...
> > > > 
> > > > pagefault_enable();
> > > > preempt_enable();
> > > > 
> > > > ... before returning from __kunmap_atomic().  Lift this code into the
> > > > kunmap_atomic() macro.
> > > > 
> > > > While we are at it rename __kunmap_atomic() to kunmap_atomic_high() to
> > > > be consistent.
> > > > 
> > > > Reviewed-by: Christoph Hellwig 
> > > > Signed-off-by: Ira Weiny 
> > > 
> > > This patch results in:
> > > 
> > > Starting init: /bin/sh exists but couldn't execute it (error -14)
> > > 
> > > when trying to boot microblazeel:petalogix-ml605 in qemu.
> > 
> > Thanks for the report.  I'm not readily seeing the issue.
> > 
> > Do you have a kernel config?  Specifically is CONFIG_HIGHMEM set?
> > 
> See below. Yes, CONFIG_HIGHMEM is set.
> 
> The scripts used to build and boot the image are at:
> 
> https://github.com/groeck/linux-build-test/tree/master/rootfs/microblazeel

Despite finding the obvious error earlier today I've still been trying to get
this to work.

I had to make some slight modifications to use the 0-day cross compile build
and my local qemu build.  But those were pretty minor modifications.  I'm
running on x86_64 host.

With those slight mods to the scripts I get the following error even without my
patch set on 5.7-rc4.  I have 1 cpu pegged at 100% while it is running...  Is
there anything I can do to get more debug output?  Perhaps I just need to let
it run longer?

Thanks,
Ira

16:46:54 > ../linux-build-test/rootfs/microblazeel/run-qemu-microblazeel.sh 
Build reference: v5.7-rc4-2-g7c2411d7fb6a

Building microblaze:petalogix-s3adsp1800:qemu_microblazeel_defconfig ...
running  failed (silent)

qemu log:
qemu-system-microblazeel: terminating on signal 15 from pid 3277686 (/bin/bash)

Building microblaze:petalogix-ml605:qemu_microblazeel_ml605_defconfig ...
running  failed (silent)

qemu log:
qemu-system-microblazeel: terminating on signal 15 from pid 3277686 (/bin/bash)



16:47:23 > git di
diff --git a/rootfs/microblazeel/run-qemu-microblazeel.sh 
b/rootfs/microblazeel/run-qemu-microblazeel.sh
index 68d4de39ab50..0d6a4f85308f 100755
--- a/rootfs/microblazeel/run-qemu-microblazeel.sh
+++ b/rootfs/microblazeel/run-qemu-microblazeel.sh
@@ -3,7 +3,8 @@
 dir=$(cd $(dirname $0); pwd)
 . ${dir}/../scripts/common.sh
 
-QEMU=${QEMU:-${QEMU_BIN}/qemu-system-microblazeel}
+#QEMU=${QEMU:-${QEMU_BIN}/qemu-system-microblazeel}
+QEMU=/home/iweiny/dev/qemu/microblazeel-softmmu/qemu-system-microblazeel
 PREFIX=microblazeel-linux-
 ARCH=microblaze
 PATH_MICROBLAZE=/opt/kernel/microblazeel/gcc-4.9.1/usr/bin
diff --git a/rootfs/scripts/common.sh b/rootfs/scripts/common.sh
index 8fa6a9be2b2f..c4550a27beaa 100644
--- a/rootfs/scripts/common.sh
+++ b/rootfs/scripts/common.sh
@@ -1,5 +1,9 @@
 #!/bin/bash
 
+# Set up make.cross
+export COMPILER_INSTALL_PATH=$HOME/0day
+export GCC_VERSION=6.5.0
+
 # Set the following variable to true to skip DC395/AM53C97 build tests
 __skip_dc395=0
 
@@ -569,7 +573,7 @@ doclean()
then
git clean -x -d -f -q
else
-   make ARCH=${ARCH} mrproper >/dev/null 2>&1
+   make.cross ARCH=${ARCH} mrproper >/dev/null 2>&1
fi
 }
 
@@ -669,7 +673,7 @@ __setup_config()
cp ${__progdir}/${defconfig} arch/${arch}/configs
 fi
 
-if ! make ARCH=${ARCH} CROSS_COMPILE=${PREFIX} ${defconfig} >/dev/null 
2>&1 /dev/null 2>&1 /dev/null 
2>&1 /dev/null 2>&1 /dev/null 2>${logfile}
+make.cross -j${maxload} ARCH=${ARCH} ${EXTRAS} /dev/null 
2>${logfile}
 rv=$?
 if [ ${rv} -ne 0 ]
 then




Re: [PATCH V3 07/15] arch/kunmap_atomic: Consolidate duplicate code

2020-05-19 Thread Ira Weiny
On Mon, May 18, 2020 at 07:50:36PM -0700, Guenter Roeck wrote:
> Hi Ira,
> 
> On 5/18/20 5:03 PM, Ira Weiny wrote:
> > On Sun, May 17, 2020 at 09:29:32PM -0700, Guenter Roeck wrote:
> >> On Sun, May 17, 2020 at 08:49:39PM -0700, Ira Weiny wrote:
> >>> On Sat, May 16, 2020 at 03:33:06PM -0700, Guenter Roeck wrote:
> >>>> On Thu, May 07, 2020 at 07:59:55AM -0700, ira.we...@intel.com wrote:
> >>>>> From: Ira Weiny 
> >>>>>
> >>>

Sorry for the delay I missed this email last night...  I blame outlook...  ;-)

...

> >>> Do you have a kernel config?  Specifically is CONFIG_HIGHMEM set?
> >>>
> >> See below. Yes, CONFIG_HIGHMEM is set.
> >>
> >> The scripts used to build and boot the image are at:
> >>
> >> https://github.com/groeck/linux-build-test/tree/master/rootfs/microblazeel
> > 
> > Despite finding the obvious error earlier today I've still been trying to 
> > get
> > this to work.
> > 
> > I had to make some slight modifications to use the 0-day cross compile build
> > and my local qemu build.  But those were pretty minor modifications.  I'm
> > running on x86_64 host.
> > 
> > With those slight mods to the scripts I get the following error even 
> > without my
> > patch set on 5.7-rc4.  I have 1 cpu pegged at 100% while it is running...  
> > Is
> > there anything I can do to get more debug output?  Perhaps I just need to 
> > let
> > it run longer?
> > 
> 
> I don't think so. Try running it with "-d" parameter (run-qemu-microblazeel.sh
> -d petalogix-s3adsp1800); that gives you the qemu command line. Once it says
> "running", abort the script and execute qemu directly.

FYI Minor nit...  a simple copy/paste failed...  that print of the cmd line
did not include quotes around the -append text:

09:06:03 > /home/iweiny/dev/qemu/microblazeel-softmmu/qemu-system-microblazeel
   -M petalogix-s3adsp1800 -m 256 -kernel arch/microblaze/boot/linux.bin
   -no-reboot -initrd /tmp/buildbot-cache/microblazeel/rootfs.cpio -append
   panic=-1 slub_debug=FZPUA rdinit=/sbin/init console=ttyUL0,115200 -monitor
   none -serial stdio -nographic

qemu-system-microblazeel: slub_debug=FZPUA: Could not open 'slub_debug=FZPUA': 
No such file or directory

> Oh, and please update
> the repository; turns out I didn't push for a while and made a number of
> changes.

Cool beans...  I've updated.

> 
> My compiler was compiled with buildroot (a long time ago). I don't recall if
> it needed something special in the configuration, unfortunately.

AFAICT the compile is working...  It is running from the command line now...  I
expected it to be slow so I have also increased the timeouts last night.  So
far it still fails.  I did notice that there is a new 'R' in the wait output.


.R. failed (silent)

qemu log:
qemu-system-microblazeel: terminating on signal 15 from pid 3357146 (/bin/bash)


I was hoping that meant it found qemu 'running' but looks like that was just a
retry...  :-(

Last night I increased some of the timeouts I could find.


 LOOPTIME=5 # Wait time before checking status
 -MAXTIME=150# Maximum wait time for qemu session to complete
 -MAXSTIME=60# Maximum wait time for qemu session to generate output
 +#MAXTIME=150   # Maximum wait time for qemu session to complete
 +#MAXSTIME=60   # Maximum wait time for qemu session to generate output
 +MAXTIME=300# Maximum wait time for qemu session to complete
 +MAXSTIME=120   # Maximum wait time for qemu session to generate output


But thanks to the qemu command line hint I can see these were not nearly
enough...  (It has been running for > 20 minutes...  and I'm not getting
output...)  Or I've done something really wrong.  Shouldn't qemu be at least
showing something on the terminal by now?  I normally run qemu with different
display options (and my qemu foo is weak) so I'm not sure what I should be
seeing with this command line.

09:06:28 > /home/iweiny/dev/qemu/microblazeel-softmmu/qemu-system-microblazeel
  -M petalogix-s3adsp1800 -m 256 -kernel arch/microblaze/boot/linux.bin
  -no-reboot -initrd /tmp/buildbot-cache/microblazeel/rootfs.cpio -append
  "panic=-1 slub_debug=FZPUA rdinit=/sbin/init console=ttyUL0,115200" -monitor
  none -serial stdio -nographic

Maybe I just have too slow of a machine...  :-/

My qemu was built back in March.  I'm updating that now...

Sorry for being so dense...
Ira


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-19 Thread Ira Weiny
On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> > enables when vaddr is not in the fixmap.
> > 
> > Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> > Signed-off-by: Ira Weiny 
> 
> microblazeel works with this patch,

Awesome...  Andrew in my rush yesterday I should have put a reported by on the
patch for Guenter as well.

Sorry about that Guenter,
Ira

> as do the nosmp sparc32 boot tests,
> but sparc32 boot tests with SMP enabled still fail with lots of messages
> such as:
> 
> BUG: Bad page state in process swapper/0  pfn:006a1
> page:f0933420 refcount:0 mapcount:1 mapping:(ptrval) index:0x1
> flags: 0x0()
> raw:  0100 0122  0001   
> page dumped because: nonzero mapcount
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Tainted: GB 
> 5.7.0-rc6-next-20200518-2-gb178d2d56f29 #1
> [f00e7ab8 :
> bad_page+0xa8/0x108 ]
> [f00e8b54 :
> free_pcppages_bulk+0x154/0x52c ]
> [f00ea024 :
> free_unref_page+0x54/0x6c ]
> [f00ed864 :
> free_reserved_area+0x58/0xec ]
> [f0527104 :
> kernel_init+0x14/0x110 ]
> [f000b77c :
> ret_from_kernel_thread+0xc/0x38 ]
> [ :
> 0x0 ]
> 
> Code path leading to that message is different but always the same
> from free_unref_page().
> 
> Still testing ppc images.
> 
> Guenter


Re: [PATCH 3/9] fs/ext4: Disallow encryption if inode is DAX

2020-05-19 Thread Ira Weiny
On Mon, May 18, 2020 at 09:24:47AM -0700, Eric Biggers wrote:
> On Sun, May 17, 2020 at 10:03:15PM -0700, Ira Weiny wrote:

First off...  OMG...

I'm seeing some possible user pitfalls which are complicating things IMO.  It
probably does not matter because most users don't care and have either enabled
DAX on _every_ mount or _not_ enabled DAX on _every_ mount.  And have _not_
used verity nor encryption while using DAX.

Verity is a bit easier because verity is not inherited and we only need to
protect against setting it if DAX is on.

However, it can be weird for the user thusly:

1) mount _without_ DAX
2) enable verity on individual inodes
3) unmount/mount _with_ DAX

Now the verity files are not enabled for DAX without any indication...  
This is still true with my patch.  But at least it closes the hole of trying to
change the DAX flag after the fact (because verity was set).

Also both this check and the verity need to be maintained to keep the mount
option working as it was before...

For encryption it is more complicated because encryption can be set on
directories and inherited so the IS_DAX() check does nothing while '-o dax' is
used.  Therefore users can:

1) mount _with_ DAX
2) enable encryption on a directory
3) files created in that directory will not have DAX set

And I now understand why the WARN_ON() was there...  To tell users about this
craziness.

...

> > This is, AFAICS, not going to affect correctness.  It will only be confusing
> > because the user will be able to set both DAX and encryption on the 
> > directory
> > but files there will only see encryption being used...  :-(
> > 
> > Assuming you are correct about this call path only being valid on 
> > directories.
> > It seems this IS_DAX() needs to be changed to check for EXT4_DAX_FL in
> > "fs/ext4: Introduce DAX inode flag"?  Then at that point we can prevent DAX 
> > and
> > encryption on a directory.  ...  and at this point IS_DAX() could be 
> > removed at
> > this point in the series???
> 
> I haven't read the whole series, but if you are indeed trying to prevent a
> directory with EXT4_DAX_FL from being encrypted, then it does look like you'd
> need to check EXT4_DAX_FL, not S_DAX.
> 
> The other question is what should happen when a file is created in an 
> encrypted
> directory when the filesystem is mounted with -o dax.  Actually, I think I
> missed something there.  Currently (based on reading the code) the DAX flag 
> will
> get set first, and then ext4_set_context() will see IS_DAX() && i_size == 0 
> and
> clear the DAX flag when setting the encrypt flag.

I think you are correct.

>
> So, the i_size == 0 check is actually needed.
> Your patch (AFAICS) just makes creating an encrypted file fail
> when '-o dax'.  Is that intended?

Yes that is what I intended but it is more complicated I see now.

The intent is that IS_DAX() should _never_ be true on an encrypted or verity
file...  even if -o dax is specified.  Because IS_DAX() should be a result of
the inode flags being checked.  The order of the setting of those flags is a
bit odd for the encrypted case.  I don't really like that DAX is set then
un-set.  It is convoluted but I'm not clear right now how to fix it.

> If not, maybe you should change it to check
> S_NEW instead of i_size == 0 to make it clearer?

The patch is completely unnecessary.

It is much easier to make (EXT4_ENCRYPT_FL | EXT4_VERITY_FL) incompatible with
EXT4_DAX_FL when it is introduced later in the series.  Furthermore this mutual
exclusion can be done on directories in the encrypt case.  Which I think will
be nicer for the user if they get an error when trying to set one when the other
is set.

Ira



Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-19 Thread Ira Weiny
On Tue, May 19, 2020 at 12:42:15PM -0700, Guenter Roeck wrote:
> On Tue, May 19, 2020 at 11:40:32AM -0700, Ira Weiny wrote:
> > On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> > > On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> > > > From: Ira Weiny 
> > > > 
> > > > The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> > > > enables when vaddr is not in the fixmap.
> > > > 
> > > > Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> > > > Signed-off-by: Ira Weiny 
> > > 
> > > microblazeel works with this patch,
> > 
> > Awesome...  Andrew in my rush yesterday I should have put a reported by on 
> > the
> > patch for Guenter as well.
> > 
> > Sorry about that Guenter,
> 
> No worries.
> 
> > Ira
> > 
> > > as do the nosmp sparc32 boot tests,
> > > but sparc32 boot tests with SMP enabled still fail with lots of messages
> > > such as:
> > > 
> > > BUG: Bad page state in process swapper/0  pfn:006a1
> > > page:f0933420 refcount:0 mapcount:1 mapping:(ptrval) index:0x1
> > > flags: 0x0()
> > > raw:  0100 0122  0001   
> > > 
> > > page dumped because: nonzero mapcount
> > > Modules linked in:
> > > CPU: 0 PID: 1 Comm: swapper/0 Tainted: GB 
> > > 5.7.0-rc6-next-20200518-2-gb178d2d56f29 #1
> > > [f00e7ab8 :
> > > bad_page+0xa8/0x108 ]
> > > [f00e8b54 :
> > > free_pcppages_bulk+0x154/0x52c ]
> > > [f00ea024 :
> > > free_unref_page+0x54/0x6c ]
> > > [f00ed864 :
> > > free_reserved_area+0x58/0xec ]
> > > [f0527104 :
> > > kernel_init+0x14/0x110 ]
> > > [f000b77c :
> > > ret_from_kernel_thread+0xc/0x38 ]
> > > [ :
> > > 0x0 ]

I'm really not seeing how this is related to the kmap clean up.

But just to make sure I'm trying to run your environment for sparc and having
less luck than with microblaze.

Could you give me the command which is failing above?

Ira

> > > 
> > > Code path leading to that message is different but always the same
> > > from free_unref_page().
> > > 
> > > Still testing ppc images.
> > > 
> 
> ppc image tests are passing with this patch.
> 
> Guenter


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-19 Thread Ira Weiny
On Tue, May 19, 2020 at 12:42:15PM -0700, Guenter Roeck wrote:
> On Tue, May 19, 2020 at 11:40:32AM -0700, Ira Weiny wrote:
> > On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> > > On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> > > > From: Ira Weiny 
> > > > 
> > > > The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> > > > enables when vaddr is not in the fixmap.
> > > > 
> > > > Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> > > > Signed-off-by: Ira Weiny 
> > > 
> > > microblazeel works with this patch,
> > 
> > Awesome...  Andrew in my rush yesterday I should have put a reported by on 
> > the
> > patch for Guenter as well.
> > 
> > Sorry about that Guenter,
> 
> No worries.
> 
> > Ira
> > 
> > > as do the nosmp sparc32 boot tests,
> > > but sparc32 boot tests with SMP enabled still fail with lots of messages
> > > such as:
> > > 
> > > BUG: Bad page state in process swapper/0  pfn:006a1
> > > page:f0933420 refcount:0 mapcount:1 mapping:(ptrval) index:0x1
> > > flags: 0x0()
> > > raw:  0100 0122  0001   
> > > 
> > > page dumped because: nonzero mapcount
> > > Modules linked in:
> > > CPU: 0 PID: 1 Comm: swapper/0 Tainted: GB 
> > > 5.7.0-rc6-next-20200518-2-gb178d2d56f29 #1
> > > [f00e7ab8 :
> > > bad_page+0xa8/0x108 ]
> > > [f00e8b54 :
> > > free_pcppages_bulk+0x154/0x52c ]
> > > [f00ea024 :
> > > free_unref_page+0x54/0x6c ]
> > > [f00ed864 :
> > > free_reserved_area+0x58/0xec ]
> > > [f0527104 :
> > > kernel_init+0x14/0x110 ]
> > > [f000b77c :
> > > ret_from_kernel_thread+0xc/0x38 ]
> > > [ :
> > > 0x0 ]
> > > 
> > > Code path leading to that message is different but always the same
> > > from free_unref_page().

Actually it occurs to me that the patch consolidating kmap_prot is odd for
sparc 32 bit...

Its a long shot but could you try reverting this patch?

4ea7d2419e3f kmap: consolidate kmap_prot definitions

Alternately I will need to figure out how to run the sparc on qemu here...

Thanks very much for all the testing though!  :-D

Ira

> > > 
> > > Still testing ppc images.
> > > 
> 
> ppc image tests are passing with this patch.
> 
> Guenter


[PATCH V3 3/8] fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS

2020-05-19 Thread ira . weiny
From: Ira Weiny 

In prep for the new tri-state mount option which then introduces
EXT4_MOUNT_DAX_NEVER.

Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 

---
Changes:
New patch
---
 fs/ext4/ext4.h  |  4 ++--
 fs/ext4/inode.c |  2 +-
 fs/ext4/super.c | 12 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 91eb4381cae5..1a3daf2d18ef 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1123,9 +1123,9 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_MINIX_DF0x00080 /* Mimics the Minix statfs */
 #define EXT4_MOUNT_NOLOAD  0x00100 /* Don't use existing journal*/
 #ifdef CONFIG_FS_DAX
-#define EXT4_MOUNT_DAX 0x00200 /* Direct Access */
+#define EXT4_MOUNT_DAX_ALWAYS  0x00200 /* Direct Access */
 #else
-#define EXT4_MOUNT_DAX 0
+#define EXT4_MOUNT_DAX_ALWAYS  0
 #endif
 #define EXT4_MOUNT_DATA_FLAGS  0x00C00 /* Mode for data writes: */
 #define EXT4_MOUNT_JOURNAL_DATA0x00400 /* Write data to 
journal */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a4aae6acdcb..a10ff12194db 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4400,7 +4400,7 @@ int ext4_get_inode_loc(struct inode *inode, struct 
ext4_iloc *iloc)
 
 static bool ext4_should_use_dax(struct inode *inode)
 {
-   if (!test_opt(inode->i_sb, DAX))
+   if (!test_opt(inode->i_sb, DAX_ALWAYS))
return false;
if (!S_ISREG(inode->i_mode))
return false;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index bf5fcb477f66..7b99c44d0a91 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1775,7 +1775,7 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
-   {Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
+   {Opt_dax, EXT4_MOUNT_DAX_ALWAYS, MOPT_SET},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
@@ -3982,7 +3982,7 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
 "both data=journal and dioread_nolock");
goto failed_mount;
}
-   if (test_opt(sb, DAX)) {
+   if (test_opt(sb, DAX_ALWAYS)) {
ext4_msg(sb, KERN_ERR, "can't mount with "
 "both data=journal and dax");
goto failed_mount;
@@ -4092,7 +4092,7 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
goto failed_mount;
}
 
-   if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+   if (sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) {
if (ext4_has_feature_inline_data(sb)) {
ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
" that may contain inline data");
@@ -5412,7 +5412,7 @@ static int ext4_remount(struct super_block *sb, int 
*flags, char *data)
err = -EINVAL;
goto restore_opts;
}
-   if (test_opt(sb, DAX)) {
+   if (test_opt(sb, DAX_ALWAYS)) {
ext4_msg(sb, KERN_ERR, "can't mount with "
 "both data=journal and dax");
err = -EINVAL;
@@ -5433,10 +5433,10 @@ static int ext4_remount(struct super_block *sb, int 
*flags, char *data)
goto restore_opts;
}
 
-   if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+   if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX_ALWAYS) {
ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
"dax flag with busy inodes while remounting");
-   sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
+   sbi->s_mount_opt ^= EXT4_MOUNT_DAX_ALWAYS;
}
 
if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
-- 
2.25.1



[PATCH V3 7/8] fs/ext4: Introduce DAX inode flag

2020-05-19 Thread ira . weiny
From: Ira Weiny 

Add a flag to preserve FS_XFLAG_DAX in the ext4 inode.

Set the flag to be user visible and changeable.  Set the flag to be
inherited.  Allow applications to change the flag at any time with the
exception of if VERITY or ENCRYPT is set.

Disallow setting VERITY or ENCRYPT if DAX is set.

Finally, on regular files, flag the inode to not be cached to facilitate
changing S_DAX on the next creation of the inode.

Signed-off-by: Ira Weiny 

---
Change from V2:
Add in making verity and DAX exclusive.
'Squash' in making encryption and DAX exclusive.
Add in EXT4_INODE_DAX flag definition to be compatible with
ext4_[set|test]_inode_flag() bit operations
Use ext4_[set|test]_inode_flag() bit operations to be consistent
with other code.

Change from V0:
Add FS_DAX_FL to include/uapi/linux/fs.h
to be consistent
Move ext4_dax_dontcache() to ext4_ioctl_setflags()
This ensures that it is only set when the flags are going to be
set and not if there is an error
Also this sets don't cache in the FS_IOC_SETFLAGS case

Change from RFC:
use new d_mark_dontcache()
Allow caching if ALWAYS/NEVER is set
Rebased to latest Linus master
Change flag to unused 0x0100
update ext4_should_enable_dax()
---
 fs/ext4/ext4.h  | 14 ++
 fs/ext4/inode.c |  2 +-
 fs/ext4/ioctl.c | 34 +-
 fs/ext4/super.c |  3 +++
 fs/ext4/verity.c|  2 +-
 include/uapi/linux/fs.h |  1 +
 6 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6235440e4c39..467c30a789b6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -415,13 +415,16 @@ struct flex_groups {
 #define EXT4_VERITY_FL 0x0010 /* Verity protected inode */
 #define EXT4_EA_INODE_FL   0x0020 /* Inode used for large EA */
 /* 0x0040 was formerly EXT4_EOFBLOCKS_FL */
+
+#define EXT4_DAX_FL0x0100 /* Inode is DAX */
+
 #define EXT4_INLINE_DATA_FL0x1000 /* Inode has inline data. */
 #define EXT4_PROJINHERIT_FL0x2000 /* Create with parents 
projid */
 #define EXT4_CASEFOLD_FL   0x4000 /* Casefolded file */
 #define EXT4_RESERVED_FL   0x8000 /* reserved for ext4 lib */
 
-#define EXT4_FL_USER_VISIBLE   0x705BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE0x604BC0FF /* User modifiable 
flags */
+#define EXT4_FL_USER_VISIBLE   0x715BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE0x614BC0FF /* User modifiable 
flags */
 
 /* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */
 #define EXT4_FL_XFLAG_VISIBLE  (EXT4_SYNC_FL | \
@@ -429,14 +432,16 @@ struct flex_groups {
 EXT4_APPEND_FL | \
 EXT4_NODUMP_FL | \
 EXT4_NOATIME_FL | \
-EXT4_PROJINHERIT_FL)
+EXT4_PROJINHERIT_FL | \
+EXT4_DAX_FL)
 
 /* Flags that should be inherited by new inodes from their parent. */
 #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
   EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
   EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
   EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
-  EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL)
+  EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL |\
+  EXT4_DAX_FL)
 
 /* Flags that are appropriate for regular files (all but dir-specific ones). */
 #define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL | EXT4_CASEFOLD_FL 
|\
@@ -488,6 +493,7 @@ enum {
EXT4_INODE_VERITY   = 20,   /* Verity protected inode */
EXT4_INODE_EA_INODE = 21,   /* Inode used for large EA */
 /* 22 was formerly EXT4_INODE_EOFBLOCKS */
+   EXT4_INODE_DAX  = 24,   /* Inode is DAX */
EXT4_INODE_INLINE_DATA  = 28,   /* Data in inode. */
EXT4_INODE_PROJINHERIT  = 29,   /* Create with parents projid */
EXT4_INODE_RESERVED = 31,   /* reserved for ext4 lib */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 140b1930e2f4..ae61db8b8bae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4418,7 +4418,7 @@ static bool ext4_should_enable_dax(struct inode *inode)
if (test_opt(inode->i_sb, DAX_ALWAYS))
return true;
 
-   return false;
+   return ext4_test_inode_flag(inode, EXT4_INODE_DAX);
 }
 
 void ext4_set_inode_flags(struct inode *inode, bool init)
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 145083e8c

[PATCH V3 2/8] fs/ext4: Disallow verity if inode is DAX

2020-05-19 Thread ira . weiny
From: Ira Weiny 

Verity and DAX are incompatible.  Changing the DAX mode due to a verity
flag change is wrong without a corresponding address_space_operations
update.

Make the 2 options mutually exclusive by returning an error if DAX was
set first.

(Setting DAX is already disabled if Verity is set first.)

Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 

---
Changes from V2:
Remove Section title 'Verity and DAX'

Changes:
remove WARN_ON_ONCE
Add documentation for DAX/Verity exclusivity
---
 Documentation/filesystems/ext4/verity.rst | 3 +++
 fs/ext4/verity.c  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/Documentation/filesystems/ext4/verity.rst 
b/Documentation/filesystems/ext4/verity.rst
index 3e4c0ee0e068..e99ff3fd09f7 100644
--- a/Documentation/filesystems/ext4/verity.rst
+++ b/Documentation/filesystems/ext4/verity.rst
@@ -39,3 +39,6 @@ is encrypted as well as the data itself.
 
 Verity files cannot have blocks allocated past the end of the verity
 metadata.
+
+Verity and DAX are not compatible and attempts to set both of these flags
+on a file will fail.
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index dc5ec724d889..f05a09fb2ae4 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -113,6 +113,9 @@ static int ext4_begin_enable_verity(struct file *filp)
handle_t *handle;
int err;
 
+   if (IS_DAX(inode))
+   return -EINVAL;
+
if (ext4_verity_in_progress(inode))
return -EBUSY;
 
-- 
2.25.1



[PATCH V3 8/8] Documentation/dax: Update DAX enablement for ext4

2020-05-19 Thread ira . weiny
From: Ira Weiny 

Update the document to reflect ext4 and xfs now behave the same.

Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 

---
Changes from RFC:
Update with ext2 text...
---
 Documentation/filesystems/dax.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/dax.txt 
b/Documentation/filesystems/dax.txt
index 735fb4b54117..265c4f808dbf 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -25,7 +25,7 @@ size when creating the filesystem.
 Currently 3 filesystems support DAX: ext2, ext4 and xfs.  Enabling DAX on them
 is different.
 
-Enabling DAX on ext4 and ext2
+Enabling DAX on ext2
 -
 
 When mounting the filesystem, use the "-o dax" option on the command line or
@@ -33,8 +33,8 @@ add 'dax' to the options in /etc/fstab.  This works to enable 
DAX on all files
 within the filesystem.  It is equivalent to the '-o dax=always' behavior below.
 
 
-Enabling DAX on xfs

+Enabling DAX on xfs and ext4
+
 
 Summary
 ---
-- 
2.25.1



[PATCH V3 0/8] Enable ext4 support for per-file/directory DAX operations

2020-05-19 Thread ira . weiny
From: Ira Weiny 

Changes from V2:
Rework DAX exclusivity with verity and encryption based on feedback
from Eric

Enable the same per file DAX support in ext4 as was done for xfs.  This series
builds and depends on the V11 series for xfs.[1]

This passes the same xfstests test as XFS.

The only issue is that this modifies the old mount option parsing code rather
than waiting for the new parsing code to be finalized.

This series starts with 3 fixes which include making Verity and Encrypt truly
mutually exclusive from DAX.  I think these first 3 patches should be picked up
for 5.8 regardless of what is decided regarding the mount parsing.

[1] https://lore.kernel.org/lkml/20200428002142.404144-1-ira.we...@intel.com/

To: linux-kernel@vger.kernel.org
Cc: "Darrick J. Wong" 
Cc: Dan Williams 
Cc: Dave Chinner 
Cc: Christoph Hellwig 
Cc: "Theodore Y. Ts'o" 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org


Ira Weiny (8):
  fs/ext4: Narrow scope of DAX check in setflags
  fs/ext4: Disallow verity if inode is DAX
  fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
  fs/ext4: Update ext4_should_use_dax()
  fs/ext4: Only change S_DAX on inode load
  fs/ext4: Make DAX mount option a tri-state
  fs/ext4: Introduce DAX inode flag
  Documentation/dax: Update DAX enablement for ext4

 Documentation/filesystems/dax.txt |  6 +-
 Documentation/filesystems/ext4/verity.rst |  3 +
 fs/ext4/ext4.h| 22 +--
 fs/ext4/ialloc.c  |  2 +-
 fs/ext4/inode.c   | 25 +--
 fs/ext4/ioctl.c   | 41 ++--
 fs/ext4/super.c   | 80 ++-
 fs/ext4/verity.c  |  5 +-
 include/uapi/linux/fs.h   |  1 +
 9 files changed, 148 insertions(+), 37 deletions(-)

-- 
2.25.1



[PATCH V3 5/8] fs/ext4: Only change S_DAX on inode load

2020-05-19 Thread ira . weiny
From: Ira Weiny 

To prevent complications with in memory inodes we only set S_DAX on
inode load.  FS_XFLAG_DAX can be changed at any time and S_DAX will
change after inode eviction and reload.

Add init bool to ext4_set_inode_flags() to indicate if the inode is
being newly initialized.

Assert that S_DAX is not set on an inode which is just being loaded.

Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 

---
Changes from V2:
Rework based on moving the encryption patch to the end.

Changes from RFC:
Change J_ASSERT() to WARN_ON_ONCE()
Fix bug which would clear S_DAX incorrectly
---
 fs/ext4/ext4.h   |  2 +-
 fs/ext4/ialloc.c |  2 +-
 fs/ext4/inode.c  | 13 ++---
 fs/ext4/ioctl.c  |  3 ++-
 fs/ext4/super.c  |  4 ++--
 fs/ext4/verity.c |  2 +-
 6 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1a3daf2d18ef..86a0994332ce 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2692,7 +2692,7 @@ extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
 extern int ext4_break_layouts(struct inode *);
 extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
-extern void ext4_set_inode_flags(struct inode *);
+extern void ext4_set_inode_flags(struct inode *, bool init);
 extern int ext4_alloc_da_blocks(struct inode *inode);
 extern void ext4_set_aops(struct inode *inode);
 extern int ext4_writepage_trans_blocks(struct inode *);
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4b8c9a9bdf0c..7941c140723f 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1116,7 +1116,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct 
inode *dir,
ei->i_block_group = group;
ei->i_last_alloc_group = ~0;
 
-   ext4_set_inode_flags(inode);
+   ext4_set_inode_flags(inode, true);
if (IS_DIRSYNC(inode))
ext4_handle_sync(handle);
if (insert_inode_locked(inode) < 0) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d3a4c2ed7a1c..23e42a223235 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4419,11 +4419,13 @@ static bool ext4_should_enable_dax(struct inode *inode)
return false;
 }
 
-void ext4_set_inode_flags(struct inode *inode)
+void ext4_set_inode_flags(struct inode *inode, bool init)
 {
unsigned int flags = EXT4_I(inode)->i_flags;
unsigned int new_fl = 0;
 
+   WARN_ON_ONCE(IS_DAX(inode) && init);
+
if (flags & EXT4_SYNC_FL)
new_fl |= S_SYNC;
if (flags & EXT4_APPEND_FL)
@@ -4434,8 +4436,13 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
-   if (ext4_should_enable_dax(inode))
+
+   /* Because of the way inode_set_flags() works we must preserve S_DAX
+* here if already set. */
+   new_fl |= (inode->i_flags & S_DAX);
+   if (init && ext4_should_enable_dax(inode))
new_fl |= S_DAX;
+
if (flags & EXT4_ENCRYPT_FL)
new_fl |= S_ENCRYPTED;
if (flags & EXT4_CASEFOLD_FL)
@@ -4649,7 +4656,7 @@ struct inode *__ext4_iget(struct super_block *sb, 
unsigned long ino,
 * not initialized on a new filesystem. */
}
ei->i_flags = le32_to_cpu(raw_inode->i_flags);
-   ext4_set_inode_flags(inode);
+   ext4_set_inode_flags(inode, true);
inode->i_blocks = ext4_inode_blocks(raw_inode, ei);
ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
if (ext4_has_feature_64bit(sb))
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 5813e5e73eab..145083e8cd1e 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -381,7 +381,8 @@ static int ext4_ioctl_setflags(struct inode *inode,
ext4_clear_inode_flag(inode, i);
}
 
-   ext4_set_inode_flags(inode);
+   ext4_set_inode_flags(inode, false);
+
inode->i_ctime = current_time(inode);
 
err = ext4_mark_iloc_dirty(handle, inode, );
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7b99c44d0a91..3cb9b48d3cc4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1348,7 +1348,7 @@ static int ext4_set_context(struct inode *inode, const 
void *ctx, size_t len,
 * Update inode->i_flags - S_ENCRYPTED will be enabled,
 * S_DAX may be disabled
 */
-   ext4_set_inode_flags(inode);
+   ext4_set_inode_flags(inode, false);
}
return res;
}
@@ -1375,7 +1375,7 @@ static int ext4_set_context(struct inode *inode, const 
void *ctx, size_t len,
 * Update inode->i_flags - S_ENCRYPTED will be enabled,
 * S_DAX may be disabled
 */
-   ext4_set_inode_flags(inode);
+   

[PATCH V3 6/8] fs/ext4: Make DAX mount option a tri-state

2020-05-19 Thread ira . weiny
From: Ira Weiny 

We add 'always', 'never', and 'inode' (default).  '-o dax' continues to
operate the same which is equivalent to 'always'.  This new
functionality is limited to ext4 only.

Specifically we introduce a 2nd DAX mount flag EXT4_MOUNT2_DAX_NEVER and set
it and EXT4_MOUNT_DAX_ALWAYS appropriately for the mode.

We also force EXT4_MOUNT2_DAX_NEVER if !CONFIG_FS_DAX.

Finally, EXT4_MOUNT2_DAX_INODE is used solely to detect if the user
specified that option for printing.

Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 

---
Changes from V1:
Fix up mounting options to only show an option if specified
Fix remount to prevent dax changes
Isolate behavior to ext4 only

Changes from RFC:
Combine remount check for DAX_NEVER with DAX_ALWAYS
Update ext4_should_enable_dax()
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/inode.c |  2 ++
 fs/ext4/super.c | 67 +
 3 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 86a0994332ce..6235440e4c39 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1168,6 +1168,8 @@ struct ext4_inode_info {
  blocks */
 #define EXT4_MOUNT2_HURD_COMPAT0x0004 /* Support 
HURD-castrated
  file systems */
+#define EXT4_MOUNT2_DAX_NEVER  0x0008 /* Do not allow Direct 
Access */
+#define EXT4_MOUNT2_DAX_INODE  0x0010 /* For printing options only 
*/
 
 #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM  0x0008 /* User explicitly
specified journal checksum */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 23e42a223235..140b1930e2f4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4400,6 +4400,8 @@ int ext4_get_inode_loc(struct inode *inode, struct 
ext4_iloc *iloc)
 
 static bool ext4_should_enable_dax(struct inode *inode)
 {
+   if (test_opt2(inode->i_sb, DAX_NEVER))
+   return false;
if (!S_ISREG(inode->i_mode))
return false;
if (ext4_should_journal_data(inode))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3cb9b48d3cc4..5ba65eb0e2ef 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1512,7 +1512,8 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-   Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version, Opt_dax,
+   Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
+   Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
Opt_nowarn_on_error, Opt_mblk_io_submit,
Opt_lazytime, Opt_nolazytime, Opt_debug_want_extra_isize,
@@ -1579,6 +1580,9 @@ static const match_table_t tokens = {
{Opt_nobarrier, "nobarrier"},
{Opt_i_version, "i_version"},
{Opt_dax, "dax"},
+   {Opt_dax_always, "dax=always"},
+   {Opt_dax_inode, "dax=inode"},
+   {Opt_dax_never, "dax=never"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
{Opt_warn_on_error, "warn_on_error"},
@@ -1726,6 +1730,7 @@ static int clear_qf_name(struct super_block *sb, int 
qtype)
 #define MOPT_NO_EXT3   0x0200
 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING0x0400
+#define MOPT_SKIP  0x0800
 
 static const struct mount_opts {
int token;
@@ -1775,7 +1780,13 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
-   {Opt_dax, EXT4_MOUNT_DAX_ALWAYS, MOPT_SET},
+   {Opt_dax, EXT4_MOUNT_DAX_ALWAYS, MOPT_SET | MOPT_SKIP},
+   {Opt_dax_always, EXT4_MOUNT_DAX_ALWAYS,
+   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
+   {Opt_dax_inode, EXT4_MOUNT2_DAX_INODE,
+   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
+   {Opt_dax_never, EXT4_MOUNT2_DAX_NEVER,
+   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
@@ -2084,13 +2095,32 @@ static int handle_mount_opt(struct super_block *sb, 
char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
 #endif
-   } else if (token == Opt_dax) {
+   } else if (token == Opt_dax || token == Opt_dax_always ||
+  token == Opt_dax_inode || token == Opt_dax_never) {
 #ifdef CONFIG_FS_DAX
-   ext4_msg(sb, KERN_WARNING,
-   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
-   sbi->s_mount_opt 

[PATCH V3 4/8] fs/ext4: Update ext4_should_use_dax()

2020-05-19 Thread ira . weiny
From: Ira Weiny 

S_DAX should only be enabled when the underlying block device supports
dax.

Change ext4_should_use_dax() to check for device support prior to the
over riding mount option.

While we are at it change the function to ext4_should_enable_dax() as
this better reflects the ask as well as matches xfs.

Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 

---
Changes from RFC
Change function name to 'should enable'
Clean up bool conversion
Reorder this for better bisect-ability
---
 fs/ext4/inode.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a10ff12194db..d3a4c2ed7a1c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4398,10 +4398,8 @@ int ext4_get_inode_loc(struct inode *inode, struct 
ext4_iloc *iloc)
!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
 }
 
-static bool ext4_should_use_dax(struct inode *inode)
+static bool ext4_should_enable_dax(struct inode *inode)
 {
-   if (!test_opt(inode->i_sb, DAX_ALWAYS))
-   return false;
if (!S_ISREG(inode->i_mode))
return false;
if (ext4_should_journal_data(inode))
@@ -4412,7 +4410,13 @@ static bool ext4_should_use_dax(struct inode *inode)
return false;
if (ext4_test_inode_flag(inode, EXT4_INODE_VERITY))
return false;
-   return true;
+   if (!bdev_dax_supported(inode->i_sb->s_bdev,
+   inode->i_sb->s_blocksize))
+   return false;
+   if (test_opt(inode->i_sb, DAX_ALWAYS))
+   return true;
+
+   return false;
 }
 
 void ext4_set_inode_flags(struct inode *inode)
@@ -4430,7 +4434,7 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
-   if (ext4_should_use_dax(inode))
+   if (ext4_should_enable_dax(inode))
new_fl |= S_DAX;
if (flags & EXT4_ENCRYPT_FL)
new_fl |= S_ENCRYPTED;
-- 
2.25.1



[PATCH V3 1/8] fs/ext4: Narrow scope of DAX check in setflags

2020-05-19 Thread ira . weiny
From: Ira Weiny 

When preventing DAX and journaling on an inode.  Use the effective DAX
check rather than the mount option.

This will be required to support per inode DAX flags.

Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 
---
 fs/ext4/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index bfc1281fc4cb..5813e5e73eab 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -393,9 +393,9 @@ static int ext4_ioctl_setflags(struct inode *inode,
if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
/*
 * Changes to the journaling mode can cause unsafe changes to
-* S_DAX if we are using the DAX mount option.
+* S_DAX if the inode is DAX
 */
-   if (test_opt(inode->i_sb, DAX)) {
+   if (IS_DAX(inode)) {
err = -EBUSY;
goto flags_out;
}
-- 
2.25.1



Re: [RESEND PATCH v7 3/5] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-05-20 Thread Ira Weiny
On Wed, May 20, 2020 at 12:30:56AM +0530, Vaibhav Jain wrote:
> Implement support for fetching nvdimm health information via
> H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair
> of 64-bit big-endian integers, bitwise-and of which is then stored in
> 'struct papr_scm_priv' and subsequently partially exposed to
> user-space via newly introduced dimm specific attribute
> 'papr/flags'. Since the hcall is costly, the health information is
> cached and only re-queried, 60s after the previous successful hcall.
> 
> The patch also adds a  documentation text describing flags reported by
> the the new sysfs attribute 'papr/flags' is also introduced at
> Documentation/ABI/testing/sysfs-bus-papr-scm.
> 
> [1] commit 58b278f568f0 ("powerpc: Provide initial documentation for
> PAPR hcalls")
> 
> Cc: Dan Williams 
> Cc: Michael Ellerman 
> Cc: "Aneesh Kumar K . V" 
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
> 
> Resend:
> * None
> 
> v6..v7 :
> * Used the exported buf_seq_printf() function to generate content for
>   'papr/flags'
> * Moved the PAPR_SCM_DIMM_* bit-flags macro definitions to papr_scm.c
>   and removed the papr_scm.h file [Mpe]
> * Some minor consistency issued in sysfs-bus-papr-scm
>   documentation. [Mpe]
> * s/dimm_mutex/health_mutex/g [Mpe]
> * Split drc_pmem_query_health() into two function one of which takes
>   care of caching and locking. [Mpe]
> * Fixed a local copy creation of dimm health information using
>   READ_ONCE(). [Mpe]
> 
> v5..v6 :
> * Change the flags sysfs attribute from 'papr_flags' to 'papr/flags'
>   [Dan Williams]
> * Include documentation for 'papr/flags' attr [Dan Williams]
> * Change flag 'save_fail' to 'flush_fail' [Dan Williams]
> * Caching of health bitmap to reduce expensive hcalls [Dan Williams]
> * Removed usage of PPC_BIT from 'papr-scm.h' header [Mpe]
> * Replaced two __be64 integers from papr_scm_priv to a single u64
>   integer [Mpe]
> * Updated patch description to reflect the changes made in this
>   version.
> * Removed avoidable usage of 'papr_scm_priv.dimm_mutex' from
>   flags_show() [Dan Williams]
> 
> v4..v5 :
> * None
> 
> v3..v4 :
> * None
> 
> v2..v3 :
> * Removed PAPR_SCM_DIMM_HEALTH_NON_CRITICAL as a condition for
>NVDIMM unarmed [Aneesh]
> 
> v1..v2 :
> * New patch in the series.
> ---
>  Documentation/ABI/testing/sysfs-bus-papr-scm |  27 +++
>  arch/powerpc/platforms/pseries/papr_scm.c| 169 ++-
>  2 files changed, 194 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-bus-papr-scm
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-scm 
> b/Documentation/ABI/testing/sysfs-bus-papr-scm
> new file mode 100644
> index ..6143d06072f1
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-scm
> @@ -0,0 +1,27 @@
> +What:/sys/bus/nd/devices/nmemX/papr/flags
> +Date:Apr, 2020
> +KernelVersion:   v5.8
> +Contact: linuxppc-dev , 
> linux-nvd...@lists.01.org,
> +Description:
> + (RO) Report flags indicating various states of a
> + papr-scm NVDIMM device. Each flag maps to a one or
> + more bits set in the dimm-health-bitmap retrieved in
> + response to H_SCM_HEALTH hcall. The details of the bit
> + flags returned in response to this hcall is available
> + at 'Documentation/powerpc/papr_hcalls.rst' . Below are
> + the flags reported in this sysfs file:
> +
> + * "not_armed"   : Indicates that NVDIMM contents will not
> +   survive a power cycle.
> + * "flush_fail"  : Indicates that NVDIMM contents
> +   couldn't be flushed during last
> +   shut-down event.
> + * "restore_fail": Indicates that NVDIMM contents
> +   couldn't be restored during NVDIMM
> +   initialization.
> + * "encrypted"   : NVDIMM contents are encrypted.
> + * "smart_notify": There is health event for the NVDIMM.
> + * "scrubbed": Indicating that contents of the
> +   NVDIMM have been scrubbed.
> + * "locked"  : Indicating that NVDIMM contents cant
> +   be modified until next power cycle.
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index f35592423380..142636e1a59f 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -22,6 +23,44 @@
>(1ul << ND_CMD_GET_CONFIG_DATA) | \
>(1ul << ND_CMD_SET_CONFIG_DATA))
>  
> +/* DIMM health bitmap bitmap indicators */
> +/* SCM device is unable to persist memory contents */
> +#define PAPR_SCM_DIMM_UNARMED 

<    1   2   3   4   5   6   7   8   9   10   >