Re: [PATCH v5 06/30] arm64: context switch POR_EL0 register

2024-09-11 Thread Dave Hansen
On 9/11/24 08:01, Kevin Brodsky wrote:
> On 22/08/2024 17:10, Joey Gouly wrote:
>> @@ -371,6 +382,9 @@ int copy_thread(struct task_struct *p, const struct 
>> kernel_clone_args *args)
>>  if (system_supports_tpidr2())
>>  p->thread.tpidr2_el0 = read_sysreg_s(SYS_TPIDR2_EL0);
>>  
>> +if (system_supports_poe())
>> +p->thread.por_el0 = read_sysreg_s(SYS_POR_EL0);
> Here we are only reloading POR_EL0's value if the target is a user
> thread. However, as this series stands, POR_EL0 is also relevant to
> kthreads, because any uaccess or GUP done from a kthread will also be
> checked against POR_EL0. This is especially important in cases like the
> io_uring kthread, which accesses the memory of the user process that
> spawned it. To prevent such a kthread from inheriting a stale value of
> POR_EL0, it seems that we should reload POR_EL0's value in all cases
> (user and kernel thread).

The problem with this is trying to figure out which POR_EL0 to use.  The
kthread could have been spawned ages ago and might not have a POR_EL0
which is very different from the current value of any of the threads in
the process right now.

There's also no great way for a kthread to reach out and grab an updated
value.  It's all completely inherently racy.

> Other approaches could also be considered (e.g. resetting POR_EL0 to
> unrestricted when creating a kthread), see my reply on v4 [1].

I kinda think this is the only way to go.  It's the only sensible,
predictable way.  I _think_ it's what x86 will end up doing with PKRU,
but there's been enough churn there that I'd need to go double check
what happens in practice.

Either way, it would be nice to get an io_uring test in here that
actually spawns kthreads:

tools/testing/selftests/mm/protection_keys.c



Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-30 Thread Dave Hansen
On 8/29/24 01:42, Lorenzo Stoakes wrote:
>> These applications work on x86 because x86 does an implicit 47-bit
>> restriction of mmap() address that contain a hint address that is less
>> than 48 bits.
> You mean x86 _has_ to limit to physically available bits in a canonical
> format 🙂 this will not be the case for 5-page table levels though...

By "physically available bits" are you referring to the bits that can be
used as a part of the virtual address?  "Physically" may not have been
the best choice of words. ;)

There's a canonical hole in 4-level paging and 5-level paging on x86.
The 5-level canonical hole is just smaller.

Also, I should probably say that the >47-bit mmap() access hint was more
of a crutch than something that we wanted to make ABI forever.  We knew
that high addresses might break some apps and we hoped that the list of
things it would break would go down over time so that we could
eventually just let mmap() access the whole address space by default.

That optimism may have been misplaced.



Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Dave Hansen
On 8/28/24 13:15, Charlie Jenkins wrote:
> A way to restrict mmap() to return LAM compliant addresses in an entire
> address space also doesn't have to be mutually exclusive with this flag.
> This flag allows for the greatest degree of control from applications.
> I don't believe there is additionally performance saving that could be
> achieved by having this be on a per address space basis.

I agree with you in general.  The MAP_BELOW_HINT _is_ the most flexible.
 But it's also rather complicated.

My _hope_ would be that a per-address-space property could share at
least some infrastructure with what x86/LAM and arm/TBI do to the
address space.  Basically put the restrictions in place for purely
software reasons instead of the mostly hardware reasons for LAM/TBI.

Lorenzo also raised some very valid points about a having a generic
address-restriction ABI.  I'm certainly not discounting those concerns.
It's not something that can be done lightly.



Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT

2024-08-28 Thread Dave Hansen
On 8/27/24 22:49, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the maximum address space,
> unless the hint address is greater than this value.

Which applications are these, btw?

Is this the same crowd as the folks who are using the address tagging
features like X86_FEATURE_LAM?

Even if they are different, I also wonder if a per-mmap() thing
MAP_BELOW_HINT is really what we want.  Or should the applications
you're trying to service here use a similar mechanism to how LAM affects
the *whole* address space as opposed to an individual mmap().



Re: [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs

2024-07-10 Thread Dave Hansen
On 7/10/24 12:22, Liam R. Howlett wrote:
> The arch_unmap call was previously moved above the rbtree modifications
> in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
> corruption").  The move was motivated by an issue with calling
> arch_unmap() after the rbtree was modified.
> 
> Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
> ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
> arch_unmap() prior to modifying the vma tree no longer exists
> (regardless of rbtree or maple tree implementations).
> 
> Furthermore, the powerpc implementation is also no longer needed as per
> [1] and [2].  So the arch_unmap() function can be completely removed.

Thanks for doing this cleanup, Liam!

Acked-by: Dave Hansen 


Re: [PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Dave Hansen
On 6/21/24 08:45, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote:
...
>> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
>> to make it Dirty=1,Write=0?  What prevents that from being
>> misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> 
> Thanks for pointing that out.  I think I was thinking it will only take
> effect on VM_SHADOW_STACK first, so it's not?
> 
> I was indeed trying to find more information on shadow stack at that time
> but I can't find as much on the pgtable implications, on e.g. whether "D=1
> + W=0" globally will be recognized as shadow stack.  At least on SDM March
> 2024 version Vol3 Chap4 pgtable entries still don't explain these details,
> or maybe I missed it.  Please let me know if there's suggestion on what I
> can read before I post a v2.

It's in the "Determination of Access Rights" section.

A linear address is a shadow-stack address if the following are
true of the translation of the linear address: (1) the R/W flag
(bit 1) is 0 and the dirty flag (bit 6) is 1 in the paging-
structure entry that maps the page containing the linear
address; and (2) the R/W flag is 1 in every other paging-
structure entry controlling the translation of the linear
address.

> So if it's globally taking effect, indeed we'll need to handle them in PUDs
> too.
> 
> Asides, not sure whether it's off-topic to ask here, but... why shadow
> stack doesn't reuse an old soft-bit to explicitly mark "this is shadow
> stack ptes" when designing the spec?  Now it consumed bit 58 anyway for
> caching dirty. IIUC we can avoid all these "move back and forth" issue on
> dirty bit if so.

The design accommodates "other" OSes that are using all the software
bits for other things.

For Linux, you're right, we just ended up consuming a software bit
_anyway_ so we got all the complexity of the goofy permissions *AND*
lost a bit in the end.  Lose, lose.

>>>  /*
>>>   * mprotect needs to preserve PAT and encryption bits when updating
>>>   * vm_page_prot
>>> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct 
>>> vm_area_struct *vma,
>>>  }
>>>  #endif
>>>  
>>> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
>>> +   unsigned long address, pud_t *pudp, pud_t pud)
>>> +{
>>> +   if (IS_ENABLED(CONFIG_SMP)) {
>>> +   return xchg(pudp, pud);
>>> +   } else {
>>> +   pud_t old = *pudp;
>>> +   WRITE_ONCE(*pudp, pud);
>>> +   return old;
>>> +   }
>>> +}
>>
>> Why is there no:
>>
>>  page_table_check_pud_set(vma->vm_mm, pudp, pud);
>>
>> ?  Sure, it doesn't _do_ anything today.  But the PMD code has it today.
>>  So leaving it out creates a divergence that honestly can only serve to
>> bite us in the future and will create a head-scratching delta for anyone
>> that is comparing PUD and PMD implementations in the future.
> 
> Good question, I really don't remember why I didn't have that, since I
> should have referenced the pmd helper.  I'll add them and see whether I'll
> hit something otherwise.
> 
> Thanks for the review.

One big thing I did in this review was make sure that the PMD and PUD
helpers were doing the same thing.  Would you mind circling back and
double-checking the same before you repost this?



Re: [PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Dave Hansen
On 6/21/24 07:25, Peter Xu wrote:
> These new helpers will be needed for pud entry updates soon.  Namely:
> 
> - pudp_invalidate()
> - pud_modify()

I think it's also definitely worth noting where you got this code from.
Presumably you copied, pasted and modified the PMD code.  That's fine,
but it should be called out.

...
> +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> +{
> + pudval_t val = pud_val(pud), oldval = val;
> +
> + /*
> +  * NOTE: no need to consider shadow stack complexities because it
> +  * doesn't support 1G mappings.
> +  */
> + val &= _HPAGE_CHG_MASK;
> + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> + val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> +
> + return __pud(val);
> +}

First of all, the comment to explain what you didn't do here is as many
lines as the code to _actually_ implement it.

Second, I believe this might have missed the purpose of the "shadow
stack complexities".  The pmd/pte code is there not to support modifying
shadow stack mappings, it's there to avoid inadvertent shadow stack
mapping creation.

That "NOTE:" is ambiguous as to whether the shadow stacks aren't
supported on 1G mappings in Linux or the hardware (I just checked the
hardware docs and don't see anything making 1G mappings special, btw).

But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
to make it Dirty=1,Write=0?  What prevents that from being
misinterpreted by the hardware as being a valid 1G shadow stack mapping?

>  /*
>   * mprotect needs to preserve PAT and encryption bits when updating
>   * vm_page_prot
> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct 
> vm_area_struct *vma,
>  }
>  #endif
>  
> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> + unsigned long address, pud_t *pudp, pud_t pud)
> +{
> + if (IS_ENABLED(CONFIG_SMP)) {
> + return xchg(pudp, pud);
> + } else {
> + pud_t old = *pudp;
> + WRITE_ONCE(*pudp, pud);
> + return old;
> + }
> +}

Why is there no:

page_table_check_pud_set(vma->vm_mm, pudp, pud);

?  Sure, it doesn't _do_ anything today.  But the PMD code has it today.
 So leaving it out creates a divergence that honestly can only serve to
bite us in the future and will create a head-scratching delta for anyone
that is comparing PUD and PMD implementations in the future.


Re: [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit

2024-06-21 Thread Dave Hansen
On 6/21/24 07:25, Peter Xu wrote:
> An entry should be reported as PUD leaf even if it's PROT_NONE, in which
> case PRESENT bit isn't there. I hit bad pud without this when testing dax
> 1G on zapping a PROT_NONE PUD.

Yeah, looks like pmd_leaf() got fixed up because of its involvement in
THP, but PUDs got missed.  This patch also realigns pmd_leaf() and
pud_leaf() behavior, which is important.

Acked-by: Dave Hansen 



Re: [PATCH v4 03/29] mm: use ARCH_PKEY_BITS to define VM_PKEY_BITN

2024-05-03 Thread Dave Hansen
On 5/3/24 06:01, Joey Gouly wrote:
>  #ifdef CONFIG_ARCH_HAS_PKEYS
> -# define VM_PKEY_SHIFT   VM_HIGH_ARCH_BIT_0
> -# define VM_PKEY_BIT0VM_HIGH_ARCH_0  /* A protection key is a 4-bit 
> value */
> -# define VM_PKEY_BIT1VM_HIGH_ARCH_1  /* on x86 and 5-bit value on 
> ppc64   */
> -# define VM_PKEY_BIT2VM_HIGH_ARCH_2
> -# define VM_PKEY_BIT3VM_HIGH_ARCH_3
> -#ifdef CONFIG_PPC
> +# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> +# define VM_PKEY_BIT0  VM_HIGH_ARCH_0
> +# define VM_PKEY_BIT1  VM_HIGH_ARCH_1
> +# define VM_PKEY_BIT2  VM_HIGH_ARCH_2
> +#if CONFIG_ARCH_PKEY_BITS > 3
> +# define VM_PKEY_BIT3  VM_HIGH_ARCH_3
> +#else
> +# define VM_PKEY_BIT3  0
> +#endif
> +#if CONFIG_ARCH_PKEY_BITS > 4

It's certainly not pretty, but it does get the arch #ifdef out of
generic code.  We might need to rethink this if we get another
architecture or two, but this seems manageable for now.

Acked-by: Dave Hansen 


Re: [PATCH v4 02/29] x86/mm: add ARCH_PKEY_BITS to Kconfig

2024-05-03 Thread Dave Hansen
On 5/3/24 06:01, Joey Gouly wrote:
> The new config option specifies how many bits are in each PKEY.

Acked-by: Dave Hansen 


Re: [PATCH v4 09/15] x86/fpu: Fix asm/fpu/types.h include guard

2024-03-29 Thread Dave Hansen
On 3/29/24 00:18, Samuel Holland wrote:
> The include guard should match the filename, or it will conflict with
> the newly-added asm/fpu.h.

Acked-by: Dave Hansen 


Re: [PATCH v4 10/15] x86: Implement ARCH_HAS_KERNEL_FPU_SUPPORT

2024-03-29 Thread Dave Hansen
On 3/29/24 00:18, Samuel Holland wrote:
> +#
> +# CFLAGS for compiling floating point code inside the kernel.
> +#
> +CC_FLAGS_FPU := -msse -msse2
> +ifdef CONFIG_CC_IS_GCC
> +# Stack alignment mismatch, proceed with caution.
> +# GCC < 7.1 cannot compile code using `double` and 
> -mpreferred-stack-boundary=3
> +# (8B stack alignment).
> +# See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383
> +#
> +# The "-msse" in the first argument is there so that the
> +# -mpreferred-stack-boundary=3 build error:
> +#
> +#  -mpreferred-stack-boundary=3 is not between 4 and 12
> +#
> +# can be triggered. Otherwise gcc doesn't complain.
> +CC_FLAGS_FPU += -mhard-float
> +CC_FLAGS_FPU += $(call cc-option,-msse 
> -mpreferred-stack-boundary=3,-mpreferred-stack-boundary=4)
> +endif

I was expecting to see this (now duplicate) hunk come _out_ of
lib/Makefile somewhere in the series.

Did I miss that, or is there something keeping the duplicate there?


Re: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files

2024-03-14 Thread Dave Hansen
On 3/14/24 09:45, John Baldwin wrote:
> On 3/14/24 8:37 AM, Dave Hansen wrote:
>> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>>> Add a new .note section containing type, size, offset and flags of
>>> every xfeature that is present.
>>
>> Mechanically, I'd much rather have all of that info in the cover letter
>> in the actual changelog instead.
>>
>> I'd also love to see a practical example of what an actual example core
>> dump looks like on two conflicting systems:
>>
>>     * Total XSAVE size
>>     * XCR0 value
>>     * XSTATE_BV from the core dump
>>     * XFEATURE offsets for each feature
> 
> I noticed this when I bought an AMD Ryzen 9 5900X based system for
> my desktop running FreeBSD and found that the XSAVE core dump notes
> were not recognized by GDB (FreeBSD dumps an XSAVE register set note
> that matches the same layout of NT_X86_XSTATE used by Linux).

I just want to make sure that you heard what I asked.  I'd like to see a
practical example of how the real-world enumeration changes between two
real world systems.

Is that possible?

Here's the raw CPUID data from the XSAVE region on my laptop:

>0x000d 0x00: eax=0x02e7 ebx=0x0a88 ecx=0x0a88 
> edx=0x
>0x000d 0x01: eax=0x000f ebx=0x0998 ecx=0x3900 
> edx=0x
>0x000d 0x02: eax=0x0100 ebx=0x0240 ecx=0x 
> edx=0x
>0x000d 0x05: eax=0x0040 ebx=0x0440 ecx=0x 
> edx=0x
>0x000d 0x06: eax=0x0200 ebx=0x0480 ecx=0x 
> edx=0x
>0x000d 0x07: eax=0x0400 ebx=0x0680 ecx=0x 
> edx=0x
>0x000d 0x08: eax=0x0080 ebx=0x ecx=0x0001 
> edx=0x
>0x000d 0x09: eax=0x0008 ebx=0x0a80 ecx=0x 
> edx=0x
>0x000d 0x0b: eax=0x0010 ebx=0x ecx=0x0001 
> edx=0x
>0x000d 0x0c: eax=0x0018 ebx=0x ecx=0x0001 
> edx=0x
>0x000d 0x0d: eax=0x0008 ebx=0x ecx=0x0001 
> edx=0x

Could we get that for an impacted AMD system, please?

cpuid -1 --raw | grep "   0x000d "

should do it.

> In particular, as the cover letter notes, on this AMD processor,
> there is no "gap" for MPX, so the PKRU registers it provides are at a
> different offset than on Intel CPUs.  Furthermore, my reading of the
> SDM is that there is no guarantee of architectural offsets of a given
> XSAVE feature and that software should be querying CPUID to determine
> the layout.

I'd say the SDM is an aspirational document.  Intel _aspires_ to be able
to change the layouts whenever it wants.  That doesn't mean that it
could actually pull it off in practice.

In practice, the offset are fixed and Intel can't change them.

> FWIW, the relevant CPUID leaves for my AMD system:
> 
>    XSAVE features (0xd/0):
>   XCR0 valid bit field mask   = 0x0207>  
> x87 state    = true
...

So, those actually aren't the relevant ones.  We need EAX (size) and EBX
(user offset) from all of the sub-leaves.

>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>> should just bite the bullet and dump out (some of) the raw CPUID space.
> 
> So the current note I initially proposed and implemented for FreeBSD
> (https://reviews.freebsd.org/D42136) and an initial patch set for GDB
> (https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
> do indeed dump a raw set of CPUID leaves.  The version I have for FreeBSD
> only dumps the raw leaf values for leaf 0x0d though the note format is
> extensible should additional leaves be needed in the future.  One of the
> questions if we wanted to use a CPUID leaf note is which leaves to dump
> (e.g. do you dump all of them, or do you just dump the subset that is
> currently needed).

You dump what is needed and add to the dump over time.

> Another quirky question is what to do about systems with hetergeneous
> cores (E vs P for example).
That's irrelevant for now.  The cores may be heterogeneous but the
userspace ISA and (and thus XSAVE formats) are identical.  If they're
not, then we have bigger problems on our hands.

> Currently those systems use the same XSAVE layout across all cores,
> but other CPUID leaves do already vary across cores on those systems.

There shouldn't be any CPUID leaves that differ _and_ matter to
userspace and thus core dumps.

> However, t

Re: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files

2024-03-14 Thread Dave Hansen
On 3/14/24 09:29, Borislav Petkov wrote:
> 
>> That argument breaks down a bit on the flags though:
>>
>>  xc.xfeat_flags = xstate_flags[i];
>>
>> Because it comes _directly_ from CPUID with zero filtering:
>>
>>  cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
>>  ...
>>  xstate_flags[i] = ecx;
>>
>> So this layout is quite dependent on what's in x86's CPUID.
> Yeah, no, this should not be copying CPUID flags - those flags should be
> *translated* to independently defined flags which describe those
> buffers.

Ditto for:

xc.xfeat_type = i;

Right now, that's bound to CPUID and XSAVE.  "feat_type==10" can only
ever be PKRU and that's derived from the XSAVE architecture.

If you want this to be extensible to things outside of the XSAVE
architecture, it needs to be something actually extensible and not
entangled with XSAVE.

In other words "xc.xfeat_type" can enumerate XSAVE state components
being in the dump, but it should not be limited to XSAVE.  Just as an
example:

enum feat_type {
FEATURE_XSAVE_PKRU,
FEATURE_XSAVE__YMM,
FEATURE_XSAVE_BNDREGS,
FEATURE_XSAVE_BNDCSR,
...
RANDOM_STATE_NOT_XSAVE
};

See how feat_type==1 is PKRU and *NOT* feat_type==10?  That opens the
door to RANDOM_STATE_NOT_XSAVE or anything else you want.  This would be
_actually_ extensible.


Re: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files

2024-03-14 Thread Dave Hansen
On 3/14/24 09:08, Borislav Petkov wrote:
> On Thu, Mar 14, 2024 at 08:37:09AM -0700, Dave Hansen wrote:
>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>> should just bite the bullet and dump out (some of) the raw CPUID space.
> 
> Funny you should say that. This was what they had done originally but if
> you dump CPUID and you want to add another component in the future which
> is *not* described by CPUID, your scheme breaks.

Are you envisioning an *XSAVE* state component that's not described by
CPUID?

Or some _other_ (non-XSAVE) component in a core dump that isn't
described by CPUID?

> So the idea is to have a self-describing buffers layout, independent
> from any x86-ism. You can extend this in a straight-forward way then
> later.

That argument breaks down a bit on the flags though:

xc.xfeat_flags = xstate_flags[i];

Because it comes _directly_ from CPUID with zero filtering:

cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
...
xstate_flags[i] = ecx;

So this layout is quite dependent on what's in x86's CPUID.


Re: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files

2024-03-14 Thread Dave Hansen
On 3/14/24 04:23, Vignesh Balasubramanian wrote:
> Add a new .note section containing type, size, offset and flags of
> every xfeature that is present.

Mechanically, I'd much rather have all of that info in the cover letter
in the actual changelog instead.

I'd also love to see a practical example of what an actual example core
dump looks like on two conflicting systems:

   * Total XSAVE size
   * XCR0 value
   * XSTATE_BV from the core dump
   * XFEATURE offsets for each feature

Do you have any information about what other OSes are doing in this
area?  I thought Windows, for instance, was even less flexible about the
XSAVE format than Linux is.

Why didn't LWP cause this problem?

>From the cover letter:

> But this patch series depends on heuristics based on the total XSAVE
> register set size and the XCR0 mask to infer the layouts of the
> various register blocks for core dumps, and hence, is not a foolproof
> mechanism to determine the layout of the XSAVE area.

It may not be theoretically foolproof.  But I'm struggling to think of a
case where it would matter in practice.  Is there any CPU from any
vendor where this is actually _needed_?

Sure, it's ugly as hell, but these notes aren't going to be available
universally _ever_, so it's not like the crummy heuristic code gets to
go away.

Have you seen the APX spec?

>
https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

It makes this even more fun because it adds a new XSAVE state component,
but reuses the MPX offsets.

> This information will be used by the debuggers to understand the XSAVE
> layout of the machine where the core file is dumped, and to read XSAVE
> registers, especially during cross-platform debugging.

This is pretty close to just a raw dump of the XSAVE CPUID leaves.
Rather than come up with an XSAVE-specific ABI that depends on CPUID
*ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
should just bite the bullet and dump out (some of) the raw CPUID space.




Re: [PATCH 3/8] arch/x86: Remove sentinel elem from ctl_table arrays

2023-09-06 Thread Dave Hansen
On 9/6/23 03:03, Joel Granados via B4 Relay wrote:
> This commit comes at the tail end of a greater effort to remove the
> empty elements at the end of the ctl_table arrays (sentinels) which
> will reduce the overall build time size of the kernel and run time
> memory bloat by ~64 bytes per sentinel (further information Link :
> https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)
> 
> Remove sentinel element from sld_sysctl and itmt_kern_table.

There's a *LOT* of content to read for a reviewer to figure out what's
going on here between all the links.  I would have appreciated one more
sentence here, maybe:

This is now safe because the sysctl registration code
(register_sysctl()) implicitly uses ARRAY_SIZE() in addition
to checking for a sentinel.

That needs to be more prominent _somewhere_.  Maybe here, or maybe in
the cover letter, but _somewhere_.

That said, feel free to add this to the two x86 patches:

Acked-by: Dave Hansen  # for x86


Re: [PATCH v2 2/2] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs

2023-06-26 Thread Dave Hansen
On 6/26/23 07:36, ypode...@redhat.com wrote:
> On Thu, 2023-06-22 at 06:37 -0700, Dave Hansen wrote:
>> On 6/22/23 06:14, ypode...@redhat.com wrote:
>>> I will send a new version with the local variable as you suggested
>>> soon.
>>> As for the config name, what about CONFIG_ARCH_HAS_MM_CPUMASK?
>>
>> The confusing part about that name is that mm_cpumask() and
>> mm->cpu_bitmap[] are defined unconditionally.  So, they're *around*
>> unconditionally but just aren't updated.
>>
> I think your right about the config name,
> How about the
> CONFIG_ARCH_USE_MM_CPUMASK?
> This has the right semantic as these archs use the cpumask field of the
> mm struct.

"USE" is still a command.  It should, at worst, be "USES".  But that's
still kinda generic.  How about:

CONFIG_ARCH_UPDATES_MM_CPUMASK

?

>> BTW, it would also be nice to have _some_ kind of data behind this
>> patch.
>>
>> Fewer IPIs are better I guess, but it would still be nice if you
>> could say:
>>
>>  Before this patch, /proc/interrupts showed 123 IPIs/hour for an
>>  isolated CPU.  After the approach here, it was 0.
>>
>> ... or something.
> 
> This is part of an ongoing effort to remove IPIs and this one was found
> via code inspection.

OK, so it should be something more like:

This was found via code inspection, but fixing it isn't very
important so we didn't bother to test it any more than just
making sure the thing still boots when it is applied.

Does that cover it?



Re: [PATCH v2 2/2] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs

2023-06-22 Thread Dave Hansen
On 6/22/23 06:14, ypode...@redhat.com wrote:
> I will send a new version with the local variable as you suggested
> soon.
> As for the config name, what about CONFIG_ARCH_HAS_MM_CPUMASK?

The confusing part about that name is that mm_cpumask() and
mm->cpu_bitmap[] are defined unconditionally.  So, they're *around*
unconditionally but just aren't updated.

BTW, it would also be nice to have _some_ kind of data behind this patch.

Fewer IPIs are better I guess, but it would still be nice if you could say:

Before this patch, /proc/interrupts showed 123 IPIs/hour for an
isolated CPU.  After the approach here, it was 0.

... or something.


Re: [PATCH v2 2/2] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs

2023-06-21 Thread Dave Hansen
On 6/20/23 07:46, Yair Podemsky wrote:
> -void tlb_remove_table_sync_one(void)
> +#ifdef CONFIG_ARCH_HAS_CPUMASK_BITS
> +#define REMOVE_TABLE_IPI_MASK mm_cpumask(mm)
> +#else
> +#define REMOVE_TABLE_IPI_MASK cpu_online_mask
> +#endif /* CONFIG_ARCH_HAS_CPUMASK_BITS */
> +
> +void tlb_remove_table_sync_one(struct mm_struct *mm)
>  {
>   /*
>* This isn't an RCU grace period and hence the page-tables cannot be
> @@ -200,7 +206,8 @@ void tlb_remove_table_sync_one(void)
>* It is however sufficient for software page-table walkers that rely on
>* IRQ disabling.
>*/
> - smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
> + on_each_cpu_mask(REMOVE_TABLE_IPI_MASK, tlb_remove_table_smp_sync,
> + NULL, true);
>  }

That "REMOVE_TABLE_IPI_MASK" thing is pretty confusing.  It *looks* like
a constant.  It does *NOT* look at all like it consumes 'mm'.  Worst
case, just create a local variable:

if (IS_ENABLED(CONFIG_ARCH_HAS_CPUMASK_BITS))
ipi_mask = mm_cpumask(mm);
else
ipi_mask = cpu_online_mask;

on_each_cpu_mask(ipi_mask, ...);

That's a billion times more clear and it'll compile down to the same thing.

I do think the CONFIG_ARCH_HAS_CPUMASK_BITS naming is also pretty
confusing, but I don't have any better suggestions.  Maybe something
with "MM_CPUMASK" in it?


Re: [PATCH] procfs: consolidate arch_report_meminfo declaration

2023-05-16 Thread Dave Hansen
On 5/16/23 12:57, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The arch_report_meminfo() function is provided by four architectures,
> with a __weak fallback in procfs itself. On architectures that don't
> have a custom version, the __weak version causes a warning because
> of the missing prototype.
> 
> Remove the architecture specific prototypes and instead add one
> in linux/proc_fs.h.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/parisc/include/asm/pgtable.h| 3 ---
>  arch/powerpc/include/asm/pgtable.h   | 3 ---
>  arch/s390/include/asm/pgtable.h  | 3 ---
>  arch/s390/mm/pageattr.c  | 1 +
>  arch/x86/include/asm/pgtable.h   | 1 +
>  arch/x86/include/asm/pgtable_types.h | 3 ---

Looks sane.  Thanks Arnd!

Acked-by: Dave Hansen  # for arch/x86


Re: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-11 Thread Dave Hansen
On 4/11/23 04:35, Mark Rutland wrote:
> I agree it'd be nice to have performance figures, but I think those would only
> need to demonstrate a lack of a regression rather than a performance
> improvement, and I think it's fairly clear from eyeballing the generated
> instructions that a regression isn't likely.

Thanks for the additional context.

I totally agree that there's zero burden here to show a performance
increase.  If anyone can think of a quick way to do _some_ kind of
benchmark on the code being changed and just show that it's free of
brown paper bags, it would be appreciated.  Nothing crazy, just think of
one workload (synthetic or not) that will stress the paths being changed
and run it with and without these changes.  Make sure there are not
surprises.

I also agree that it's unlikely to be brown paper bag material.


Re: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-05 Thread Dave Hansen
On 4/5/23 07:17, Uros Bizjak wrote:
> Add generic and target specific support for local{,64}_try_cmpxchg
> and wire up support for all targets that use local_t infrastructure.

I feel like I'm missing some context.

What are the actual end user visible effects of this series?  Is there a
measurable decrease in perf overhead?  Why go to all this trouble for
perf?  Who else will use local_try_cmpxchg()?

I'm all for improving things, and perf is an important user.  But, if
the goal here is improving performance, it would be nice to see at least
a stab at quantifying the performance delta.


Re: [PATCH 0/3] COVER: Remove memcpy_page_flushcache()

2023-03-15 Thread Dave Hansen
On 3/15/23 16:20, Ira Weiny wrote:
> Commit 21b56c847753 ("iov_iter: get rid of separate bvec and xarray 
> callbacks") removed the calls to memcpy_page_flushcache().
> 
> kmap_atomic() is deprecated and used in the x86 version of
> memcpy_page_flushcache().
> 
> Remove the unnecessary memcpy_page_flushcache() call from all arch's.

Hi Ira,

Since the common code user is already gone these three patches seem
quite independent.  It seems like the right thing to do is have
individual arch maintainers cherry pick their arch patch and carry it
independently.

Is there a compelling reason to have someone pick up and carry these all
together that I'm missing?


Re: [RFC PATCH v2 0/3] powerpc/pseries: add support for local secure storage called Platform KeyStore(PKS)

2022-06-27 Thread Dave Hansen
On 6/22/22 14:56, Nayna Jain wrote:
> * Renamed PKS driver to PLPKS to avoid naming conflict as mentioned by
> Dave Hanson.

Thank you for doing this!  The new naming looks much less likely to
cause confusion.


Re: [RFC PATCH 3/6] testing/pkeys: Add additional test for pkey_alloc()

2022-06-16 Thread Dave Hansen
On 6/16/22 12:25, Sohil Mehta wrote:
> Should we have different return error codes when compile support is 
> disabled vs when runtime support is missing?

It doesn't *really* matter.  Programs have to be able to run on old
kernels which will return ENOSYS.  So, _when_ new kernels return ENOSYS
or ENOSPC is pretty immaterial.


Re: [PATCH 5/5] x86/pkeys: Standardize on u8 for pkey type

2022-03-15 Thread Dave Hansen
On 3/15/22 08:53, Ira Weiny wrote:
> On Mon, Mar 14, 2022 at 04:49:12PM -0700, Dave Hansen wrote:
>> On 3/10/22 16:57, ira.we...@intel.com wrote:
>>> From: Ira Weiny 
>>>
>>> The number of pkeys supported on x86 and powerpc are much smaller than a
>>> u16 value can hold.  It is desirable to standardize on the type for
>>> pkeys.  powerpc currently supports the most pkeys at 32.  u8 is plenty
>>> large for that.
>>>
>>> Standardize on the pkey types by changing u16 to u8.
>>
>> How widely was this intended to "standardize" things?  Looks like it may
>> have missed a few spots.
> 
> Sorry I think the commit message is misleading you.  The justification of u8 
> as
> the proper type is that no arch has a need for more than 255 pkeys.
> 
> This specific patch was intended to only change x86.  Per that goal I don't 
> see
> any other places in x86 which uses u16 after this patch.
> 
> $ git grep u16 arch/x86 | grep key
> arch/x86/events/intel/uncore_discovery.c: const u16 *type_id = key;
> arch/x86/include/asm/intel_pconfig.h: u16 keyid;
> arch/x86/include/asm/mmu.h:   u16 pkey_allocation_map;
> arch/x86/include/asm/pkeys.h: u16 all_pkeys_mask = ((1U << arch_max_pkey()) - 
> 1);

I was also looking at the generic mm code.

>> Also if we're worried about the type needing to changY or with the wrong
>> type being used, I guess we could just to a pkey_t typedef.
> 
> I'm not 'worried' about it.  But I do think it makes the code cleaner and more
> self documenting.

Yeah, consistency is good.  Do you mind taking a look at how a pkey_t
would look, and also seeing how much core mm code should use it?


Re: [PATCH 5/5] x86/pkeys: Standardize on u8 for pkey type

2022-03-14 Thread Dave Hansen
On 3/10/22 16:57, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> The number of pkeys supported on x86 and powerpc are much smaller than a
> u16 value can hold.  It is desirable to standardize on the type for
> pkeys.  powerpc currently supports the most pkeys at 32.  u8 is plenty
> large for that.
> 
> Standardize on the pkey types by changing u16 to u8.

How widely was this intended to "standardize" things?  Looks like it may
have missed a few spots.

Also if we're worried about the type needing to change or with the wrong
type being used, I guess we could just to a pkey_t typedef.


Re: [RFC PATCH 0/2] powerpc/pseries: add support for local secure storage called Platform Keystore(PKS)

2022-02-01 Thread Dave Hansen
On 1/21/22 16:56, Nayna Jain wrote:
> Nayna Jain (2):
>   pseries: define driver for Platform Keystore
>   pseries: define sysfs interface to expose PKS variables

Hi Folks,

There another feature that we might want to consider in the naming here:

> https://lore.kernel.org/all/20220127175505.851391-1-ira.we...@intel.com/

Protection Keys for Supervisor pages is also called PKS.  It's also not
entirely impossible that powerpc might want to start using this code at
some point, just like what happened with the userspace protection keys[1].

I don't think it's the end of the world either way, but it might save a
hapless user or kernel developer some confusion if we can avoid
including two "PKS" features in the kernel.  I just wanted to make sure
we were aware of the other's existence. :)

1.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/include/asm/pkeys.h


Re: [PATCH v2 3/3] x86: Support huge vmalloc mappings

2022-01-18 Thread Dave Hansen
On 1/17/22 6:46 PM, Nicholas Piggin wrote:
>> This all sounds very fragile to me.  Every time a new architecture would
>> get added for huge vmalloc() support, the developer needs to know to go
>> find that architecture's module_alloc() and add this flag.
> This is documented in the Kconfig.
> 
>  #
>  #  Archs that select this would be capable of PMD-sized vmaps (i.e.,
>  #  arch_vmap_pmd_supported() returns true), and they must make no assumptions
>  #  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP 
> flag
>  #  can be used to prohibit arch-specific allocations from using hugepages to
>  #  help with this (e.g., modules may require it).
>  #
>  config HAVE_ARCH_HUGE_VMALLOC
>  depends on HAVE_ARCH_HUGE_VMAP
>  bool
> 
> Is it really fair to say it's *very* fragile? Surely it's reasonable to 
> read the (not very long) documentation ad understand the consequences for
> the arch code before enabling it.

Very fragile or not, I think folks are likely to get it wrong.  It would
be nice to have it default *everyone* to safe and slow and make *sure*
they go look at the architecture modules code itself before enabling
this for modules.

Just from that Kconfig text, I don't think I'd know off the top of my
head what do do for x86, or what code I needed to go touch.


Re: [PATCH v2 3/3] x86: Support huge vmalloc mappings

2021-12-28 Thread Dave Hansen
On 12/28/21 2:26 AM, Kefeng Wang wrote:
>>> There are some disadvantages about this feature[2], one of the main
>>> concerns is the possible memory fragmentation/waste in some scenarios,
>>> also archs must ensure that any arch specific vmalloc allocations that
>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.
>> That just says that x86 *needs* PAGE_SIZE allocations.  But, what
>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
>> subsequent permission changes just fragment the 2M mapping?
> 
> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping.
> 
> When module alloc with STRICT_MODULE_RWX on x86, it calls
> __change_page_attr()
> 
> from set_memory_ro/rw/nx which will split large page, so there is no
> need to make
> 
> module alloc with HUGE_VMALLOC.

This all sounds very fragile to me.  Every time a new architecture would
get added for huge vmalloc() support, the developer needs to know to go
find that architecture's module_alloc() and add this flag.  They next
guy is going to forget, just like you did.

Considering that this is not a hot path, a weak function would be a nice
choice:

/* vmalloc() flags used for all module allocations. */
unsigned long __weak arch_module_vm_flags()
{
/*
 * Modules use a single, large vmalloc().  Different
 * permissions are applied later and will fragment
 * huge mappings.  Avoid using huge pages for modules.
 */
return VM_NO_HUGE_VMAP;
}

Stick that in some the common module code, next to:

> void * __weak module_alloc(unsigned long size)
> {
> return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
...

Then, put arch_module_vm_flags() in *all* of the module_alloc()
implementations, including the generic one.  That way (even with a new
architecture) whoever copies-and-pastes their module_alloc()
implementation is likely to get it right.  The next guy who just does a
"select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work.

VM_FLUSH_RESET_PERMS could probably be dealt with in the same way.


Re: [PATCH v2 3/3] x86: Support huge vmalloc mappings

2021-12-27 Thread Dave Hansen
On 12/27/21 6:59 AM, Kefeng Wang wrote:
> This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE
> support huge vmalloc mappings.

In general, this seems interesting and the diff is simple.  But, I don't
see _any_ x86-specific data.  I think the bare minimum here would be a
few kernel compiles and some 'perf stat' data for some TLB events.

> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index 95fa745e310a..6bf5cb7d876a 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size)
>  
>   p = __vmalloc_node_range(size, MODULE_ALIGN,
>   MODULES_VADDR + get_module_load_offset(),
> - MODULES_END, gfp_mask,
> - PAGE_KERNEL, VM_DEFER_KMEMLEAK, 
> NUMA_NO_NODE,
> + MODULES_END, gfp_mask, PAGE_KERNEL,
> + VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, 
> NUMA_NO_NODE,
>   __builtin_return_address(0));
>   if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) {
>   vfree(p);

To figure out what's going on in this hunk, I had to look at the cover
letter (which I wasn't cc'd on).  That's not great and it means that
somebody who stumbles upon this in the code is going to have a really
hard time figuring out what is going on.  Cover letters don't make it
into git history.

This desperately needs a comment and some changelog material in *this*
patch.

But, even the description from the cover letter is sparse:

> There are some disadvantages about this feature[2], one of the main
> concerns is the possible memory fragmentation/waste in some scenarios,
> also archs must ensure that any arch specific vmalloc allocations that
> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.

That just says that x86 *needs* PAGE_SIZE allocations.  But, what
happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
subsequent permission changes just fragment the 2M mapping?


Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()

2021-12-18 Thread Dave Hansen
On 12/18/21 6:31 AM, Nikita Yushchenko wrote:
>>> This allows archs to optimize it, by
>>> freeing multiple tables in a single release_pages() call. This is
>>> faster than individual put_page() calls, especially with memcg
>>> accounting enabled.
>>
>> Could we quantify "faster"?  There's a non-trivial amount of code being
>> added here and it would be nice to back it up with some cold-hard
>> numbers.
> 
> I currently don't have numbers for this patch taken alone. This patch
> originates from work done some years ago to reduce cost of memory
> accounting, and x86-only version of this patch was in virtuozzo/openvz
> kernel since then. Other patches from that work have been upstreamed,
> but this one was missed.
> 
> Still it's obvious that release_pages() shall be faster that a loop
> calling put_page() - isn't that exactly the reason why release_pages()
> exists and is different from a loop calling put_page()?

Yep, but this patch does a bunch of stuff to some really hot paths.  It
would be greatly appreciated if you could put in the effort to actually
put some numbers behind this.  Plenty of weird stuff happens on
computers that we suck at predicting.

I'd be happy with even a quick little micro.  My favorite is:

https://github.com/antonblanchard/will-it-scale

Although, I do wonder if anything will even be measurable.  Please at
least try.

...
>> But, even more than that, do all the architectures even need the
>> free_swap_cache()?
> 
> I was under impression that process page tables are a valid target for
> swapping out. Although I can be wrong here.

It's not out of the realm of possibilities.  But, last I checked, the
only path we free page tables in was when VMAs are being torn down.  I
have a longstanding TODO item to reclaim them if they're empty (all
zeros) or to zero them out if they're mapping page cache.


Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()

2021-12-17 Thread Dave Hansen
On 12/17/21 12:19 AM, Nikita Yushchenko wrote:
> When batched page table freeing via struct mmu_table_batch is used, the
> final freeing in __tlb_remove_table_free() executes a loop, calling
> arch hook __tlb_remove_table() to free each table individually.
> 
> Shift that loop down to archs. This allows archs to optimize it, by
> freeing multiple tables in a single release_pages() call. This is
> faster than individual put_page() calls, especially with memcg
> accounting enabled.

Could we quantify "faster"?  There's a non-trivial amount of code being
added here and it would be nice to back it up with some cold-hard numbers.

> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -95,11 +95,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
> page *page, int page_
>  
>  static void __tlb_remove_table_free(struct mmu_table_batch *batch)
>  {
> - int i;
> -
> - for (i = 0; i < batch->nr; i++)
> - __tlb_remove_table(batch->tables[i]);
> -
> + __tlb_remove_tables(batch->tables, batch->nr);
>   free_page((unsigned long)batch);
>  }

This leaves a single call-site for __tlb_remove_table():

> static void tlb_remove_table_one(void *table)
> {
> tlb_remove_table_sync_one();
> __tlb_remove_table(table);
> }

Is that worth it, or could it just be:

__tlb_remove_tables(&table, 1);

?

> -void free_pages_and_swap_cache(struct page **pages, int nr)
> +static void __free_pages_and_swap_cache(struct page **pages, int nr,
> + bool do_lru)
>  {
> - struct page **pagep = pages;
>   int i;
>  
> - lru_add_drain();
> + if (do_lru)
> + lru_add_drain();
>   for (i = 0; i < nr; i++)
> - free_swap_cache(pagep[i]);
> - release_pages(pagep, nr);
> + free_swap_cache(pages[i]);
> + release_pages(pages, nr);
> +}
> +
> +void free_pages_and_swap_cache(struct page **pages, int nr)
> +{
> + __free_pages_and_swap_cache(pages, nr, true);
> +}
> +
> +void free_pages_and_swap_cache_nolru(struct page **pages, int nr)
> +{
> + __free_pages_and_swap_cache(pages, nr, false);
>  }

This went unmentioned in the changelog.  But, it seems like there's a
specific optimization here.  In the exiting code,
free_pages_and_swap_cache() is wasteful if no page in pages[] is on the
LRU.  It doesn't need the lru_add_drain().

Any code that knows it is freeing all non-LRU pages can call
free_pages_and_swap_cache_nolru() which should perform better than
free_pages_and_swap_cache().

Should we add this to the for loop in __free_pages_and_swap_cache()?

for (i = 0; i < nr; i++) {
if (!do_lru)
VM_WARN_ON_ONCE_PAGE(PageLRU(pagep[i]),
 pagep[i]);
free_swap_cache(...);
}

But, even more than that, do all the architectures even need the
free_swap_cache()?  PageSwapCache() will always be false on x86, which
makes the loop kinda silly.  x86 could, for instance, just do:

static inline void __tlb_remove_tables(void **tables, int nr)
{
release_pages((struct page **)tables, nr);
}

I _think_ this will work everywhere that has whole pages as page tables.
 Taking that one step further, what if we only had one generic:

static inline void tlb_remove_tables(void **tables, int nr)
{
int i;

#ifdef ARCH_PAGE_TABLES_ARE_FULL_PAGE
release_pages((struct page **)tables, nr);
#else
arch_tlb_remove_tables(tables, i);
#endif
}

Architectures that set ARCH_PAGE_TABLES_ARE_FULL_PAGE (or whatever)
don't need to implement __tlb_remove_table() at all *and* can do
release_pages() directly.

This avoids all the  confusion with the swap cache and LRU naming.


Re: [PATCH v2] mm: Move mem_init_print_info() into mm_init()

2021-03-17 Thread Dave Hansen
On 3/16/21 6:52 PM, Kefeng Wang wrote:
> mem_init_print_info() is called in mem_init() on each architecture,
> and pass NULL argument, so using void argument and move it into mm_init().
> 
> Acked-by: Dave Hansen 

It's not a big deal but you might want to say something like:

Acked-by: Dave Hansen  # x86 bits

Just to make it clear that I didn't look at the alpha bits at all. :)


Re: [PATCH] mm: Move mem_init_print_info() into mm_init()

2021-03-16 Thread Dave Hansen
On 3/16/21 7:26 AM, Kefeng Wang wrote:
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 5430c81eefc9..aa8387aab9c1 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1350,8 +1350,6 @@ void __init mem_init(void)
>   kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, 
> KCORE_USER);
>  
>   preallocate_vmalloc_pages();
> -
> - mem_init_print_info(NULL);
>  }

Ignoring any issues with the printk...

Looks harmless enough on x86.  The 32-bit code has some cruft in
mem_init() after mem_init_print_info(), so this patch will change the
location of the mem_init_print_info(), but I think it's actually for the
better, since it will be pushed later in boot.  As long as the x86
pieces stay the same:

Acked-by: Dave Hansen 


Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-16 Thread Dave Hansen
On 11/16/20 8:32 AM, Matthew Wilcox wrote:
>>
>> That's really the best we can do from software without digging into
>> microarchitecture-specific events.
> I mean this is perf.  Digging into microarch specific events is what it
> does ;-)

Yeah, totally.

But, if we see a bunch of 4k TLB hit events, it's still handy to know
that those 4k TLB hits originated from a 2M page table entry.  This
series just makes sure that perf has the data about the page table
mapping sizes regardless of what the microarchitecture does with it.

I'm just saying we need to make the descriptions in this perf feature
specifically about the page tables, not the TLB.


Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-16 Thread Dave Hansen
On 11/16/20 7:54 AM, Matthew Wilcox wrote:
> It gets even more complicated with CPUs with multiple levels of TLB
> which support different TLB entry sizes.  My CPU reports:
> 
> TLB info
>  Instruction TLB: 2M/4M pages, fully associative, 8 entries
>  Instruction TLB: 4K pages, 8-way associative, 64 entries
>  Data TLB: 1GB pages, 4-way set associative, 4 entries
>  Data TLB: 4KB pages, 4-way associative, 64 entries
>  Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries

It's even "worse" on recent AMD systems.  Those will coalesce multiple
adjacent PTEs into a single TLB entry.  I think Alphas did something
like this back in the day with an opt-in.

Anyway, the changelog should probably replace:

> This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
> page sizes.

with something more like:

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate page
table mapping sizes.

That's really the best we can do from software without digging into
microarchitecture-specific events.


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Dave Hansen
On 10/12/20 9:19 AM, Eric Biggers wrote:
> On Sun, Oct 11, 2020 at 11:56:35PM -0700, Ira Weiny wrote:
>>> And I still don't really understand.  After this patchset, there is still 
>>> code
>>> nearly identical to the above (doing a temporary mapping just for a memcpy) 
>>> that
>>> would still be using kmap_atomic().
>> I don't understand.  You mean there would be other call sites calling:
>>
>> kmap_atomic()
>> memcpy()
>> kunmap_atomic()
> Yes, there are tons of places that do this.  Try 'git grep -A6 kmap_atomic'
> and look for memcpy().
> 
> Hence why I'm asking what will be the "recommended" way to do this...
> kunmap_thread() or kmap_atomic()?

kmap_atomic() is always preferred over kmap()/kmap_thread().
kmap_atomic() is _much_ more lightweight since its TLB invalidation is
always CPU-local and never broadcast.

So, basically, unless you *must* sleep while the mapping is in place,
kmap_atomic() is preferred.



Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Dave Hansen
On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Dave Hansen
On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function old new   delta
> vmemmap_populate 587 592  +5
> munlock_vma_pages_range  556 561  +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c| 15 -
>  arch/x86/mm/kasan_init_64.c  | 16 +-

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Dave Hansen
On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
> unsigned long end,
>   do {
>   pmd_t pmd = READ_ONCE(*pmdp);
>  
> - next = pmd_addr_end(addr, end);
> + next = pmd_addr_end_folded(pmd, addr, end);
>   if (!pmd_present(pmd))
>   return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.



Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Dave Hansen
On 4/30/20 8:52 AM, David Hildenbrand wrote:
>> Justifying behavior by documentation that does not consider memory
>> hotplug is bad thinking.
> Are you maybe confusing this patch series with the arm64 approach? This
> is not about ordinary hotplugged DIMMs.
> 
> I'd love to get Dan's, Dave's and Michal's opinion.

The impact on kexec from the DAX "kmem" driver's use of hotplug was
inadvertent and unfortunate.

The problem statement and solution seem pretty sane to me.


Re: [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Dave Hansen
On 4/30/20 3:29 AM, David Hildenbrand wrote:
> Currently, when adding memory, we create entries in /sys/firmware/memmap/
> as "System RAM". This does not reflect the reality and will lead to
> kexec-tools to add that memory to the fixed-up initial memmap for a
> kexec kernel (loaded via kexec_load()). The memory will be considered
> initial System RAM by the kexec kernel.
> 
> We should let the kexec kernel decide how to use that memory - just as
> we do during an ordinary reboot.
...
> - rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
> + rc = add_memory(numa_node, new_res->start, resource_size(new_res),
> + MHP_NO_FIRMWARE_MEMMAP);

Looks fine.  But, if you send another revision, could you add a comment
about the actual goal of MHP_NO_FIRMWARE_MEMMAP?  Maybe:

/*
 * MHP_NO_FIRMWARE_MEMMAP ensures that future
 * kexec'd kernels will not treat this as RAM.
     */

Not a biggie, though.

Acked-by: Dave Hansen 


Re: [PATCH 1/4] hugetlbfs: add arch_hugetlb_valid_size

2020-03-26 Thread Dave Hansen
On 3/26/20 2:56 PM, Mike Kravetz wrote:
> Perhaps it would be best to check hugepages_supported() when parsing
> hugetlb command line options.  If not enabled, throw an error.  This
> will be much easier to do after moving all command line parsing to
> arch independent code.

Yeah, that sounds sane.

> Is that a sufficient way to address this concern?  I think it is a good
> change in any case.

(Thanks to Kirill for pointing this out.)

So, it turns out the x86 huge page enumeration is totally buggered.
X86_FEATURE_PSE is actually meaningless on 64-bit (and 32-bit PAE).  All
CPUs architecturally support 2MB pages regardless of X86_FEATURE_PSE and
the state of CR4.PSE.

So, on x86_64 at least, hugepages_supported() should *always* return 1.

1GB page support can continue to be dependent on X86_FEATURE_GBPAGES.


Re: [PATCH 1/4] hugetlbfs: add arch_hugetlb_valid_size

2020-03-18 Thread Dave Hansen
On 3/18/20 3:52 PM, Mike Kravetz wrote:
> Sounds good.  I'll incorporate those changes into a v2, unless someone
> else with has a different opinion.
> 
> BTW, this patch should not really change the way the code works today.
> It is mostly a movement of code.  Unless I am missing something, the
> existing code will always allow setup of PMD_SIZE hugetlb pages.

Hah, I totally skipped over the old code in the diff.

It looks like we'll disable hugetblfs *entirely* if PSE isn't supported.
 I think this is actually wrong, but nobody ever noticed.  I think you'd
have to be running as a guest under a hypervisor that's lying about PSE
not being supported *and* care about 1GB pages.  Nobody does that.


Re: [PATCH 1/4] hugetlbfs: add arch_hugetlb_valid_size

2020-03-18 Thread Dave Hansen
Hi Mike,

The series looks like a great idea to me.  One nit on the x86 bits,
though...

> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 5bfd5aef5378..51e6208fdeec 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -181,16 +181,25 @@ hugetlb_get_unmapped_area(struct file *file, unsigned 
> long addr,
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
>  #ifdef CONFIG_X86_64
> +bool __init arch_hugetlb_valid_size(unsigned long long size)
> +{
> + if (size == PMD_SIZE)
> + return true;
> + else if (size == PUD_SIZE && boot_cpu_has(X86_FEATURE_GBPAGES))
> + return true;
> + else
> + return false;
> +}

I'm pretty sure it's possible to have a system without 2M/PMD page
support.  We even have a handy-dandy comment about it in
arch/x86/include/asm/required-features.h:

#ifdef CONFIG_X86_64
#ifdef CONFIG_PARAVIRT
/* Paravirtualized systems may not have PSE or PGE available */
#define NEED_PSE0
...

I *think* you need an X86_FEATURE_PSE check here to be totally correct.

if (size == PMD_SIZE && cpu_feature_enabled(X86_FEATURE_PSE))
return true;

BTW, I prefer cpu_feature_enabled() to boot_cpu_has() because it
includes disabled-features checking.  I don't think any of it matters
for these specific features, but I generally prefer it on principle.



Re: [PATCH] treewide: Rename "unencrypted" to "decrypted"

2020-03-17 Thread Dave Hansen
On 3/17/20 2:06 PM, Borislav Petkov wrote:
> On Tue, Mar 17, 2020 at 01:35:12PM -0700, Dave Hansen wrote:
>> On 3/17/20 4:18 AM, Borislav Petkov wrote:
>>> Back then when the whole SME machinery started getting mainlined, it
>>> was agreed that for simplicity, clarity and sanity's sake, the terms
>>> denoting encrypted and not-encrypted memory should be "encrypted" and
>>> "decrypted". And the majority of the code sticks to that convention
>>> except those two. So rename them.
>> Don't "unencrypted" and "decrypted" mean different things?
>>
>> Unencrypted to me means "encryption was never used for this data".
>>
>> Decrypted means "this was/is encrypted but here is a plaintext copy".
> Maybe but linguistical semantics is not the point here.
> 
> The idea is to represent a "binary" concept of memory being encrypted
> or memory being not encrypted. And at the time we decided to use
> "encrypted" and "decrypted" for those two things.

Yeah, agreed.  We're basically trying to name "!encrypted".

> Do you see the need to differentiate a third "state", so to speak, of
> memory which was never encrypted?

No, there are just two states.  I just think the "!encrypted" case
should not be called "decrypted".


Re: [PATCH] treewide: Rename "unencrypted" to "decrypted"

2020-03-17 Thread Dave Hansen
On 3/17/20 4:18 AM, Borislav Petkov wrote:
> Back then when the whole SME machinery started getting mainlined, it
> was agreed that for simplicity, clarity and sanity's sake, the terms
> denoting encrypted and not-encrypted memory should be "encrypted" and
> "decrypted". And the majority of the code sticks to that convention
> except those two. So rename them.

Don't "unencrypted" and "decrypted" mean different things?

Unencrypted to me means "encryption was never used for this data".

Decrypted means "this was/is encrypted but here is a plaintext copy".

This, for instance:

> +++ b/kernel/dma/direct.c
> @@ -26,7 +26,7 @@ unsigned int zone_dma_bits __ro_after_init = 24;
>  static inline dma_addr_t phys_to_dma_direct(struct device *dev,
>   phys_addr_t phys)
>  {
> - if (force_dma_unencrypted(dev))
> + if (force_dma_decrypted(dev))
>   return __phys_to_dma(dev, phys);

is referring to DMA that is not and never was encrypted.  It's skipping
the encryption altogether.  There's no act of "decryption" anywhere.

This, on the other hand, seems named wrong to me:

> /*
>  * Macros to add or remove encryption attribute
>  */
> #define pgprot_encrypted(prot)  __pgprot(__sme_set(pgprot_val(prot)))
> #define pgprot_decrypted(prot)  __pgprot(__sme_clr(pgprot_val(prot)))

This seems like it would be better named pgprot_unencrypted().


Re: [PATCH v18 00/24] selftests, powerpc, x86: Memory Protection Keys

2020-01-30 Thread Dave Hansen
On 1/29/20 10:36 PM, Sandipan Das wrote:
> v18:
>   (1) Fixed issues with x86 multilib builds based on
>   feedback from Dave.
>   (2) Moved patch 2 to the end of the series.

These (finally) build and run successfully for me on an x86 system with
protection keys.  Feel free to add my Tested-by, and Acked-by.

FWIW, I don't think look perfect, but my standards are lower for
selftests/ than normal kernel code. :)


Re: [PATCH v16 00/23] selftests, powerpc, x86: Memory Protection Keys

2020-01-29 Thread Dave Hansen
On 1/28/20 1:38 AM, Sandipan Das wrote:
> On 27/01/20 9:12 pm, Dave Hansen wrote:
>> How have you tested this patch (and the whole series for that matter)?
>>
> I replaced the second patch with this one and did a build test.
> Till v16, I had tested the whole series (build + run) on both a POWER8
> system (with 4K and 64K page sizes) and a Skylake SP system but for
> x86_64 only.

Do you have any idea why I was seeing x86 build errors and you were not?


Re: [PATCH v16 00/23] selftests, powerpc, x86: Memory Protection Keys

2020-01-27 Thread Dave Hansen
On 1/27/20 2:11 AM, Sandipan Das wrote:
> Hi Dave,
> 
> On 23/01/20 12:15 am, Dave Hansen wrote:
>> Still doesn't build for me:
>>
> I have this patch that hopefully fixes this. My understanding was
> that the vm tests are supposed to be generic but this has quite a
> bit of x86-specific conditional code which complicates things even
> though it is not used by any of the other tests.
> 
> I'm not sure if we should keep x86 multilib build support for these
> selftests but I'll let the maintainers take a call.

How have you tested this patch (and the whole series for that matter)?


Re: [PATCH v16 00/23] selftests, powerpc, x86: Memory Protection Keys

2020-01-22 Thread Dave Hansen
Still doesn't build for me:

> # make
> make --no-builtin-rules ARCH=x86_64 -C ../../../.. headers_install
> make[1]: Entering directory '/home/dave/linux.git'
>   INSTALL ./usr/include
> make[1]: Leaving directory '/home/dave/linux.git'
> make: *** No rule to make target 
> '/home/dave/linux.git/tools/testing/selftests/vm/protection_keys_32', needed 
> by 'all'.  Stop.




Re: [PATCH v16 00/23] selftests, powerpc, x86: Memory Protection Keys

2020-01-17 Thread Dave Hansen
On 1/17/20 4:49 AM, Sandipan Das wrote:
> Memory protection keys enables an application to protect its address
> space from inadvertent access by its own code.
> 
> This feature is now enabled on powerpc and has been available since
> 4.16-rc1. The patches move the selftests to arch neutral directory
> and enhance their test coverage.
> 
> Tested on powerpc64 and x86_64 (Skylake-SP).
I also tested the series.  The 64-bit binary works fine.  But,

This is failing to build the x86 selftests:

make: *** No rule to make target 'protection_keys.c', needed by
'/home/daveh/linux/tools/testing/selftests/x86/protection_keys_32'.  Stop.

I think you just forgot to remove the binary from the x86 Makefile.

Which reminds me: This removes the 32-bit binary.  x86 32-bit binaries
exercise different paths than the 64-bit ones, so we like to have both.
 Although it isn't *essential* it would really be nice to keep the
32-bit binary.


Re: [PATCH v15 00/24] selftests, powerpc, x86: Memory Protection Keys

2020-01-10 Thread Dave Hansen
On 1/10/20 9:38 AM, Aneesh Kumar K.V wrote:
>> v15:
>>  (1) Rebased on top of latest master.
>>  (2) Addressed review comments from Dave Hansen.
>>  (3) Moved code for getting or setting pkey bits to new
>>  helpers. These changes replace patch 7 of v14.
>>  (4) Added a fix which ensures that the correct count of
>>  reserved keys is used across different platforms.
>>  (5) Added a fix which ensures that the correct page size
>>  is used as powerpc supports both 4K and 64K pages.
>>
> Any update on merging this series? Can Intel help with testing this
> series on Skylake server? Possibly merging to -next will result in
> automated 01.org tests?

Could you dump these in a git tree, please?  It will make it a wee bit
easier for me to ship the resulting tree around to a couple different
systems.


Re: [PATCH v15 06/23] selftests/vm/pkeys: Typecast the pkey register

2019-12-18 Thread Dave Hansen
On 12/18/19 12:59 PM, Michal Suchánek wrote:
>> I'd really just rather do %016lx *everywhere* than sprinkle the
>> PKEY_REG_FMTs around.
> Does lx work with u32 without warnings?

Either way, I'd be happy to just make the x86 one u64 to make the whole
thing look more sane,


Re: [PATCH v15 00/24] selftests, powerpc, x86: Memory Protection Keys

2019-12-18 Thread Dave Hansen
On 12/17/19 11:51 PM, Sandipan Das wrote:
> Testing
> ---
> Verified for correctness on powerpc. Need help with x86 testing as I
> do not have access to a Skylake server. Client platforms like Coffee
> Lake do not have the required feature bits set in CPUID.

FWIW, you can get a Skylake Server instance from cloud providers.  I
spooled up an Amazon EC3 instance once to run these tests.  It think it
cost me $0.08.


Re: [PATCH v15 06/23] selftests/vm/pkeys: Typecast the pkey register

2019-12-18 Thread Dave Hansen
On 12/17/19 11:51 PM, Sandipan Das wrote:
>   write_pkey_reg(pkey_reg);
> - dprintf4("pkey_reg now: %08x\n", read_pkey_reg());
> + dprintf4("pkey_reg now: "PKEY_REG_FMT"\n", read_pkey_reg());
>  }
>  
>  #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
> diff --git a/tools/testing/selftests/vm/pkey-x86.h 
> b/tools/testing/selftests/vm/pkey-x86.h
> index 2f04ade8ca9c..5f40901219d3 100644
> --- a/tools/testing/selftests/vm/pkey-x86.h
> +++ b/tools/testing/selftests/vm/pkey-x86.h
> @@ -46,6 +46,8 @@
>  #define HPAGE_SIZE   (1UL<<21)
>  #define PAGE_SIZE4096
>  #define MB   (1<<20)
> +#define pkey_reg_t   u32
> +#define PKEY_REG_FMT "%016x"

How big is the ppc one?

I'd really just rather do %016lx *everywhere* than sprinkle the
PKEY_REG_FMTs around.

BTW, why are you doing a %016lx for a u32?


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Dave Hansen
On 9/3/19 1:01 AM, Anshuman Khandual wrote:
> This adds a test module which will validate architecture page table helpers
> and accessors regarding compliance with generic MM semantics expectations.
> This will help various architectures in validating changes to the existing
> page table helpers or addition of new ones.

This looks really cool.  The "only" complication on x86 is the large
number of compile and runtime options that we have.  When this gets
merged, it would be really nice to make sure that the 0day guys have
good coverage of all the configurations.

I'm not _quite_ sure what kind of bugs it will catch on x86 and I
suspect it'll have more value for the other architectures, but it seems
harmless enough.


Re: [RFC V3] mm: Generalize and rename notify_page_fault() as kprobe_page_fault()

2019-06-09 Thread Dave Hansen
On 6/9/19 9:34 PM, Anshuman Khandual wrote:
>> Do you really think this is easier to read?
>>
>> Why not just move the x86 version to include/linux/kprobes.h, and replace
>> the int with bool?
> Will just return bool directly without an additional variable here as 
> suggested
> before. But for the conditional statement, I guess the proposed one here is 
> more
> compact than the x86 one.

FWIW, I don't think "compact" is generally a good goal for code.  Being
readable is 100x more important than being compact and being un-compact
is only a problem when it hurts readability.

For a function like the one in question, having the individual return
conditions clearly commented is way more important than saving 10 lines
of code.


Re: [RFC V3] mm: Generalize and rename notify_page_fault() as kprobe_page_fault()

2019-06-07 Thread Dave Hansen
On 6/7/19 3:34 AM, Anshuman Khandual wrote:
> +static nokprobe_inline bool kprobe_page_fault(struct pt_regs *regs,
> +   unsigned int trap)
> +{
> + int ret = 0;
> +
> + /*
> +  * To be potentially processing a kprobe fault and to be allowed
> +  * to call kprobe_running(), we have to be non-preemptible.
> +  */
> + if (kprobes_built_in() && !preemptible() && !user_mode(regs)) {
> + if (kprobe_running() && kprobe_fault_handler(regs, trap))
> + ret = 1;
> + }
> + return ret;
> +}

Nits: Other that taking the nice, readable, x86 one and globbing it onto
a single line, looks OK to me.  It does seem a _bit_ silly to go to the
trouble of converting to 'bool' and then using 0/1 and an 'int'
internally instead of true/false and a bool, though.  It's also not a
horrible thing to add a single line comment to this sucker to say:

/* returns true if kprobes handled the fault */

In any case, and even if you don't clean any of this up:

Reviewed-by: Dave Hansen 


[PATCH] [v2] x86/mpx: fix recursive munmap() corruption

2019-04-19 Thread Dave Hansen


Changes from v1:
 * Fix compile errors on UML and non-x86 arches
 * Clarify commit message and Fixes about the origin of the
   bug and add the impact to powerpc / uml / unicore32

--

This is a bit of a mess, to put it mildly.  But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now because nobody uses MPX.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory.  When
memory is unmapped, there is also a need to unmap the MPX
bounds tables.  Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state.  It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful.  Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

== UML / unicore32 impact ==

Remove unused 'vma' argument to arch_unmap().  No functional
change.

I compile tested this on UML but not unicore32.

== powerpc impact ==

powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
arch_unmap() makes this happen earlier in __do_munmap().  But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace.  I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.

powerpc does not use the 'vma' argument and is unaffected by
its removal.

I compile-tested a 64-bit powerpc defconfig.

== x86 impact ==

For the common success case this is functionally identical to
what was there before.  For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use.  But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory.  There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.



The long story:

munmap() has a couple of pieces:
1. Find the affected VMA(s)
2. Split the start/end one(s) if neceesary
3. Pull the VMAs out of the rbtree
4. Actually zap the memory via unmap_region(), including
   freeing page tables (or queueing them to be freed).
5. Fixup some of the accounting (like fput()) and actually
   free the VMA itself.

This specific ordering was actually introduced by:

dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

during the 4.20 merge window.  The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list().  arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.

Richard Biener reported a test that shows this in dmesg:

[1216548.787498] BUG: Bad rss-counter state mm:17ce560b idx:1 val:551
[1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap().  It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this.  For one, arch_unmap()
can free VMAs.  But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.

I tried a couple of things here.  First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue.  I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart.  That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

This was also reported in the following kernel bugzilla entry:

https://bugzilla.kernel.org/show_bug.cgi?id=203123

There are some reports that dd2283f2605 ("mm: mmap: zap pages
with read mmap_sem in munmap") triggered this issue.  While that
commit certainly made the issues easier to hit, I belive the
fundamental issue has been with us as long as MPX itself, thus
the Fixes: tag below is for one of the original MPX commits.

Reported-by: Richard Biener 
Reported-by: H.J. Lu 
Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Cc: Yang Shi 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Andy Lutomirski 
Cc: x...@kernel.org
Cc: Andrew Morton 
Cc: linux-ke

Re: [PATCH 0/2] x86, numa: always initialize all possible nodes

2019-04-15 Thread Dave Hansen
On 4/15/19 4:42 AM, Michal Hocko wrote:
>> Friendly ping. I haven't heard any complains so can we route this via
>> tip/x86/mm or should we go via mmotm.
> It seems that Dave is busy. Let's add Andrew. Can we get this [1] merged
> finally, please?

Sorry these slipped through the cracks.

These look sane to me.  Because it pokes around mm/page_alloc.c a bit,
and could impact other architectures, my preference would be for Andrew
to pick these up for -mm.  But, I don't feel that strongly about it.

Reviewed-by: Dave Hansen 


Re: [PATCH v6 0/4] Fix free/allocation of runtime gigantic pages

2019-03-13 Thread Dave Hansen
On 3/7/19 5:20 AM, Alexandre Ghiti wrote:
> This series fixes sh and sparc that did not advertise their gigantic page
> support and then were not able to allocate and free those pages at runtime.
> It renames MEMORY_ISOLATION && COMPACTION || CMA condition into the more
> accurate CONTIG_ALLOC, since it allows the definition of alloc_contig_range
> function.
> Finally, it then fixes the wrong definition of ARCH_HAS_GIGANTIC_PAGE config
> that, without MEMORY_ISOLATION && COMPACTION || CMA defined, did not allow
> architectures to free boottime allocated gigantic pages although unrelated.

Looks good, thanks for all the changes.  For everything generic in the
set, plus the x86 bits:

Acked-by: Dave Hansen 


Re: [PATCH v5 4/4] hugetlb: allow to free gigantic pages regardless of the configuration

2019-03-06 Thread Dave Hansen
On 3/6/19 12:08 PM, Alex Ghiti wrote:
>>>
>>> +    /*
>>> + * Gigantic pages allocation depends on the capability for large
>>> page
>>> + * range allocation. If the system cannot provide
>>> alloc_contig_range,
>>> + * allow users to free gigantic pages.
>>> + */
>>> +    if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
>>> +    spin_lock(&hugetlb_lock);
>>> +    if (count > persistent_huge_pages(h)) {
>>> +    spin_unlock(&hugetlb_lock);
>>> +    return -EINVAL;
>>> +    }
>>> +    goto decrease_pool;
>>> +    }
>> We talked about it during the last round and I don't seen any mention of
>> it here in comments or the changelog: Why is this a goto?  Why don't we
>> just let the code fall through to the "decrease_pool" label?  Why is
>> this new block needed at all?  Can't we just remove the old check and
>> let it be?
> 
> I'll get rid of the goto, I don't know how to justify it properly in a
> comment,
> maybe because it is not necessary.
> This is not a new block, this means exactly the same as before (remember
> gigantic_page_supported() actually meant CONTIG_ALLOC before this series),
> except that now we allow a user to free boottime allocated gigantic pages.
> And no we cannot just remove the check and let it be since it would modify
> the current behaviour, which is to return an error when trying to allocate
> gigantic pages whereas alloc_contig_range is not defined. I thought it was
> clearly commented above, I can try to make it more explicit.

OK, that makes sense.  Could you get some of this in the changelog,
please?  Otherwise this all looks good to me.


Re: [PATCH v5 4/4] hugetlb: allow to free gigantic pages regardless of the configuration

2019-03-06 Thread Dave Hansen
On 3/6/19 11:00 AM, Alexandre Ghiti wrote:
> +static int set_max_huge_pages(struct hstate *h, unsigned long count,
> +   nodemask_t *nodes_allowed)
>  {
>   unsigned long min_count, ret;
>  
> - if (hstate_is_gigantic(h) && !gigantic_page_supported())
> - return h->max_huge_pages;
> + /*
> +  * Gigantic pages allocation depends on the capability for large page
> +  * range allocation. If the system cannot provide alloc_contig_range,
> +  * allow users to free gigantic pages.
> +  */
> + if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
> + spin_lock(&hugetlb_lock);
> + if (count > persistent_huge_pages(h)) {
> + spin_unlock(&hugetlb_lock);
> + return -EINVAL;
> + }
> + goto decrease_pool;
> + }

We talked about it during the last round and I don't seen any mention of
it here in comments or the changelog: Why is this a goto?  Why don't we
just let the code fall through to the "decrease_pool" label?  Why is
this new block needed at all?  Can't we just remove the old check and
let it be?


[PATCH 1/5] mm/resource: return real error codes from walk failures

2019-02-25 Thread Dave Hansen


From: Dave Hansen 

walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error.  The memory hotplug does the following:

ret = walk_system_ram_range(..., func);
if (ret)
return ret;

and 'ret' makes it out to userspace, eventually.  The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1.  That leads to a very odd
-EPERM (-1) return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.

This return code is compatible with all the callers that I
audited.

This changes both the generic mm/ and powerpc-specific
implementations to have the same return value.

Signed-off-by: Dave Hansen 
Reviewed-by: Bjorn Helgaas 
Acked-by: Michael Ellerman  (powerpc)
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvd...@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch 
---

 b/arch/powerpc/mm/mem.c |2 +-
 b/kernel/resource.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN 
arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 
arch/powerpc/mm/mem.c
--- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1  
2019-02-25 10:56:47.452908034 -0800
+++ b/arch/powerpc/mm/mem.c 2019-02-25 10:56:47.458908034 -0800
@@ -189,7 +189,7 @@ walk_system_ram_range(unsigned long star
struct memblock_region *reg;
unsigned long end_pfn = start_pfn + nr_pages;
unsigned long tstart, tend;
-   int ret = -1;
+   int ret = -EINVAL;
 
for_each_memblock(memory, reg) {
tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 
kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1  
2019-02-25 10:56:47.454908034 -0800
+++ b/kernel/resource.c 2019-02-25 10:56:47.459908034 -0800
@@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resourc
 int (*func)(struct resource *, void *))
 {
struct resource res;
-   int ret = -1;
+   int ret = -EINVAL;
 
while (start < end &&
   !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
@@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long
unsigned long flags;
struct resource res;
unsigned long pfn, end_pfn;
-   int ret = -1;
+   int ret = -EINVAL;
 
start = (u64) start_pfn << PAGE_SHIFT;
end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_


Re: [PATCH v3] hugetlb: allow to free gigantic pages regardless of the configuration

2019-02-15 Thread Dave Hansen
> -#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || 
> defined(CONFIG_CMA)
> +#ifdef CONFIG_CONTIG_ALLOC
>  /* The below functions must be run on a range from a single zone. */
>  extern int alloc_contig_range(unsigned long start, unsigned long end,
> unsigned migratetype, gfp_t gfp_mask);
> -extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
>  #endif
> +extern void free_contig_range(unsigned long pfn, unsigned int nr_pages);

There's a lot of stuff going on in this patch.  Adding/removing config
options.  Please get rid of these superfluous changes or at least break
them out.

>  #ifdef CONFIG_CMA
>  /* CMA stuff */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 25c71eb8a7db..138a8df9b813 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -252,12 +252,17 @@ config MIGRATION
> pages as migration can relocate pages to satisfy a huge page
> allocation instead of reclaiming.
>  
> +
>  config ARCH_ENABLE_HUGEPAGE_MIGRATION
>   bool

Like this. :)

>  config ARCH_ENABLE_THP_MIGRATION
>   bool
>  
> +config CONTIG_ALLOC
> + def_bool y
> + depends on (MEMORY_ISOLATION && COMPACTION) || CMA
> +
>  config PHYS_ADDR_T_64BIT
>   def_bool 64BIT

Please think carefully though the Kconfig dependencies.  'select' is
*not* the same as 'depends on'.

This replaces a bunch of arch-specific "select ARCH_HAS_GIGANTIC_PAGE"
with a 'depends on'.  I *think* that ends up being OK, but it absolutely
needs to be addressed in the changelog about why *you* think it is OK
and why it doesn't change the functionality of any of the patched
architetures.

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index afef61656c1e..e686c92212e9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1035,7 +1035,6 @@ static int hstate_next_node_to_free(struct hstate *h, 
> nodemask_t *nodes_allowed)
>   ((node = hstate_next_node_to_free(hs, mask)) || 1); \
>   nr_nodes--)
>  
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>  static void destroy_compound_gigantic_page(struct page *page,
>   unsigned int order)
>  {

Whats the result of this #ifdef removal?  A universally larger kernel
even for architectures that do not support runtime gigantic page
alloc/free?  That doesn't seem like a good thing.

> @@ -1058,6 +1057,12 @@ static void free_gigantic_page(struct page *page, 
> unsigned int order)
>   free_contig_range(page_to_pfn(page), 1 << order);
>  }
>  
> +static inline bool gigantic_page_runtime_allocation_supported(void)
> +{
> + return IS_ENABLED(CONFIG_CONTIG_ALLOC);
> +}

Why bother having this function?  Why don't the callers just check the
config option directly?

> +#ifdef CONFIG_CONTIG_ALLOC
>  static int __alloc_gigantic_page(unsigned long start_pfn,
>   unsigned long nr_pages, gfp_t gfp_mask)
>  {
> @@ -1143,22 +1148,15 @@ static struct page *alloc_gigantic_page(struct hstate 
> *h, gfp_t gfp_mask,
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
>  static void prep_compound_gigantic_page(struct page *page, unsigned int 
> order);
>  
> -#else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
> -static inline bool gigantic_page_supported(void) { return false; }
> +#else /* !CONFIG_CONTIG_ALLOC */
>  static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>   int nid, nodemask_t *nodemask) { return NULL; }
> -static inline void free_gigantic_page(struct page *page, unsigned int order) 
> { }
> -static inline void destroy_compound_gigantic_page(struct page *page,
> - unsigned int order) { }
>  #endif
>  
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>   int i;
>  
> - if (hstate_is_gigantic(h) && !gigantic_page_supported())
> - return;

I don't get the point of removing this check.  Logically, this reads as
checking if the architecture supports gigantic hstates and has nothing
to do with allocation.

>   h->nr_huge_pages--;
>   h->nr_huge_pages_node[page_to_nid(page)]--;
>   for (i = 0; i < pages_per_huge_page(h); i++) {
> @@ -2276,13 +2274,20 @@ static int adjust_pool_surplus(struct hstate *h, 
> nodemask_t *nodes_allowed,
>  }
>  
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long 
> count,
> +static int set_max_huge_pages(struct hstate *h, unsigned long count,
>   nodemask_t *nodes_allowed)
>  {
>   unsigned long min_count, ret;
>  
> - if (hstate_is_gigantic(h) && !gigantic_page_supported())
> - return h->max_huge_pages;
> + if (hstate_is_gigantic(h) &&
> + !gigantic_page_runtime_allocation_supported()) {

The indentation here is wrong and reduces readability.  Needs to be like
this:

if (hstate_is_gi

Re: [PATCH v2] hugetlb: allow to free gigantic pages regardless of the configuration

2019-02-13 Thread Dave Hansen
> -#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || 
> defined(CONFIG_CMA)
> +#ifdef CONFIG_COMPACTION_CORE
>  static __init int gigantic_pages_init(void)
>  {
>   /* With compaction or CMA we can allocate gigantic pages at runtime */
> diff --git a/fs/Kconfig b/fs/Kconfig
> index ac474a61be37..8fecd3ea5563 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -207,8 +207,9 @@ config HUGETLB_PAGE
>  config MEMFD_CREATE
>   def_bool TMPFS || HUGETLBFS
>  
> -config ARCH_HAS_GIGANTIC_PAGE
> +config COMPACTION_CORE
>   bool
> + default y if (MEMORY_ISOLATION && MIGRATION) || CMA

This takes a hard dependency (#if) and turns it into a Kconfig *default*
that can be overridden.  That seems like trouble.

Shouldn't it be:

config COMPACTION_CORE
def_bool y
depends on (MEMORY_ISOLATION && MIGRATION) || CMA

?


Re: [PATCH v3 2/2] mm: be more verbose about zonelist initialization

2019-02-13 Thread Dave Hansen
On 2/13/19 1:43 AM, Michal Hocko wrote:
> 
> We have seen several bugs where zonelists have not been initialized
> properly and it is not really straightforward to track those bugs down.
> One way to help a bit at least is to dump zonelists of each node when
> they are (re)initialized.

Were you thinking of boot-time bugs and crashes, or just stuff going
wonky after boot?

We don't have the zonelists dumped in /proc anywhere, do we?  Would that
help?


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-24 Thread Dave Hansen
On 1/24/19 6:17 AM, Michal Hocko wrote:
> and nr_cpus set to 4. The underlying reason is tha the device is bound
> to node 2 which doesn't have any memory and init_cpu_to_node only
> initializes memory-less nodes for possible cpus which nr_cpus restrics.
> This in turn means that proper zonelists are not allocated and the page
> allocator blows up.

This looks OK to me.

Could we add a few DEBUG_VM checks that *look* for these invalid
zonelists?  Or, would our existing list debugging have caught this?

Basically, is this bug also a sign that we need better debugging around
this?


Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-27 Thread Dave Hansen
On 11/27/18 3:57 AM, Florian Weimer wrote:
> I would have expected something that translates PKEY_DISABLE_WRITE |
> PKEY_DISABLE_READ into PKEY_DISABLE_ACCESS, and also accepts
> PKEY_DISABLE_ACCESS | PKEY_DISABLE_READ, for consistency with POWER.
> 
> (My understanding is that PKEY_DISABLE_ACCESS does not disable all
> access, but produces execute-only memory.)

Correct, it disables all data access, but not execution.



Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data

2018-10-29 Thread Dave Hansen
On 10/29/18 2:55 PM, Michael Sammler wrote:
>> PKRU getting reset on signals, and the requirement now that it *can't*
>> be changed if you make syscalls probably needs to get thought about very
>> carefully before we do this, though.
> I am not sure, whether I follow you. Are you saying, that PKRU is
> currently not guaranteed to be preserved across system calls?
> This would make it very hard to use protection keys if libc does not
> save and restore the PKRU before/after systemcalls (and I am not aware
> of this).

It's preserved *across* system calls, but you have to be a bit careful
using it _inside_ the kernel.  We could context switch off to something
else, and not think that we need to restore PKRU until _just_ before we
return to userspace.

> Or do you mean, that the kernel might want to use the PKRU register for
> its own purposes while it is executing?

That, or we might keep another process's PKRU state in the register if
we don't think anyone is using it.  Now that I think about it, I think
Rik (cc'd, who was working on those patches) *had* to explicitly restore
PKRU because it's hard to tell where we might do a copy_to/from_user()
and need it.

> Then the solution you proposed in another email in this thread would
> work: instead of providing the seccomp filter with the current value of
> the PKRU (which might be different from what the user space expects) use
> the user space value which must have been saved somewhere (otherwise it
> would not be possible to restore it).

Yep, that's the worst-case scenario: either fetch PKRU out of the XSAVE
buffer (current->fpu->something), or just restore them using an existing
API before doing RDPKRU.

But, that's really an implementation detail.  The effect on the ABI and
how this might constrain future pkeys use is my bigger worry.

I'd also want to make sure that your specific use-case is compatible
with all the oddities of pkeys, like the 'clone' and signal behavior.
Some of that is spelled out here:

http://man7.org/linux/man-pages/man7/pkeys.7.html

One thing that's a worry is that we have never said that you *can't*
write to arbitrary permissions in PKRU.  I can totally see some really
paranoid code saying, "I'm about to do something risky, so I'll turn off
access to *all* pkeys", or " turn off all access except my current
stack".  If they did that, they might also inadvertently disable access
to certain seccomp-restricted syscalls.

We can fix that up by documenting restrictions like "code should never
change the access rights of any pkey other than those that it
allocated", but that doesn't help any old code (of which I hope there is
relatively little).



Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data

2018-10-29 Thread Dave Hansen
On 10/29/18 9:48 AM, Jann Horn wrote:
> On Mon, Oct 29, 2018 at 5:37 PM Dave Hansen  wrote:
>> I'm not sure this is a great use for PKRU.  I *think* the basic problem
>> is that you want to communicate some rights information down into a
>> filter, and you want to communicate it with PKRU.  While it's handy to
>> have an extra register that nobody (generally) mucks with, I'm not quite
>> convinced that we want to repurpose it this way.
> 
> That's not how I understand it; I believe that the context is probably
> https://arxiv.org/pdf/1801.06822.pdf ?
> My understanding is that PKRU is used for lightweight in-process
> sandboxing, and to extend this sandbox protection to the syscall
> interface, it is necessary to expose PKRU state to seccomp filters.
> In other words, this isn't using PKRU exclusively for passing rights
> into a filter, but it has to use PKRU anyway.

PKRU gives information about rights to various bits of application data.
 From that, a seccomp filter can infer the context, and thus the ability
for the code to call a given syscall at a certain point in time.

This makes PKRU an opt-in part of the syscall ABI, which is pretty
interesting.  We _could_ do the same kind of thing with any callee-saved
general purpose register, but PKRU is particularly attractive because
there is only one instruction that writes to it (well, outside of
XSAVE*), and random library code is very unlikely at this point to be
using it.

PKRU getting reset on signals, and the requirement now that it *can't*
be changed if you make syscalls probably needs to get thought about very
carefully before we do this, though.


Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data

2018-10-29 Thread Dave Hansen
On 10/29/18 10:02 AM, Michael Sammler wrote:
>>> Also, I'm not sure the kernel provides the PKRU guarantees you want at
>>> the moment.  Our implementation *probably* works, but it's mostly by
>>> accident.
> I don't know, which guarantees about the PKRU are provided at the
> moment, but the only guarantee needed for this patch is, that the kernel
> does not change the bits of the PKRU register, which belong to pkeys
> allocated by the user program, between the syscall entry and the call to
> secure_computing(). Is there are use case where the kernel would like to
> modify these bits of the PKRU?

We've been talking about doing more lax save/restore of the XSAVE
content (PKRU is part of this content).  We would, for instance, only
restore it when returning to userspace, but PKRU might not be up-to-date
with the value in current->fpu.

It's not a deal-breaker with your approach, it's just something to be
careful of and make sure PKRU is up-to-date before you go use it.


Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data

2018-10-29 Thread Dave Hansen
On 10/29/18 9:25 AM, Kees Cook wrote:
> On Mon, Oct 29, 2018 at 4:23 AM, Michael Sammler  wrote:
>> Add the current value of an architecture specific protection keys
>> register (currently PKRU on x86) to data available for seccomp-bpf
>> programs to work on. This allows filters based on the currently
>> enabled protection keys.

How does the current "assignment" of protection keys to the various uses
get communicated to the filter?

I'm not sure this is a great use for PKRU.  I *think* the basic problem
is that you want to communicate some rights information down into a
filter, and you want to communicate it with PKRU.  While it's handy to
have an extra register that nobody (generally) mucks with, I'm not quite
convinced that we want to repurpose it this way.

Also, I'm not sure the kernel provides the PKRU guarantees you want at
the moment.  Our implementation *probably* works, but it's mostly by
accident.


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Dave Hansen
On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> It is more than just memmaps (e.g. forking udev process doing memory
> onlining also needs memory) but yes, the main idea is to make the
> onlining synchronous with hotplug.

That's a good theoretical concern.

But, is it a problem we need to solve in practice?



Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-01 Thread Dave Hansen
> How should a policy in user space look like when new memory gets added
> - on s390x? Not onlining paravirtualized memory is very wrong.

Because we're going to balloon it away in a moment anyway?

We have auto-onlining.  Why isn't that being used on s390?


> So the type of memory is very important here to have in user space.
> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
> to decide whether to online memory and how to online memory is wrong.
> Only some specific memory types (which I call "normal") are to be
> handled by user space.
> 
> For the other ones, we exactly know what to do:
> - standby? don't online

I think you're horribly conflating the software desire for what the stae
should be and the hardware itself.

>> As for the OOM issues, that sounds like something we need to fix by
>> refusing to do (or delaying) hot-add operations once we consume too much
>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>> userspace to hurry thing along.
> 
> That is a moving target and doing that automatically is basically
> impossible.

Nah.  We know how much metadata we've allocated.  We know how much
ZONE_NORMAL we are eating.  We can *easily* add something to
add_memory() that just sleeps until the ratio is not out-of-whack.

> You can add a lot of memory to the movable zone and
> everything is fine. Suddenly a lot of processes are started - boom.
> MOVABLE should only every be used if you expect an unplug. And for
> paravirtualized devices, a "typical" unplug does not exist.

No, it's more complicated than that.  People use MOVABLE, for instance,
to allow more consistent huge page allocations.  It's certainly not just
hot-remove.


Re: [PATCH v9 0/3] powerpc: Detection and scheduler optimization for POWER9 bigcore

2018-10-01 Thread Dave Hansen
On 10/01/2018 06:16 AM, Gautham R. Shenoy wrote:
> 
> Patch 3: Creates a pair of sysfs attributes named
> /sys/devices/system/cpu/cpuN/topology/smallcore_thread_siblings
> and
> /sys/devices/system/cpu/cpuN/topology/smallcore_thread_siblings_list
> exposing the small-core siblings that share the L1 cache
> to the userspace.

I really don't think you've justified the existence of a new user/kernel
interface here.  We already have information about threads share L1
caches in here:

/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list

The only question would be if anything would break because it assumes
that all SMT siblings share all caches.  But, it breaks if your new
interface is there or not; it's old software that we care about.


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-09-28 Thread Dave Hansen
It's really nice if these kinds of things are broken up.  First, replace
the old want_memblock parameter, then add the parameter to the
__add_page() calls.

> +/*
> + * NONE: No memory block is to be created (e.g. device memory).
> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
> + *   (e.g. ACPI DIMMs) that should be onlined either automatically
> + *   (memhp_auto_online) or manually by user space to select a
> + *   specific zone.
> + *   Applicable to memhp_auto_online.
> + * STANDBY:  Memory block that represents standby memory that should only
> + *   be onlined on demand by user space (e.g. standby memory on
> + *   s390x), but never automatically by the kernel.
> + *   Not applicable to memhp_auto_online.
> + * PARAVIRT: Memory block that represents memory added by
> + *   paravirtualized mechanisms (e.g. hyper-v, xen) that will
> + *   always automatically get onlined. Memory will be unplugged
> + *   using ballooning, not by relying on the MOVABLE ZONE.
> + *   Not applicable to memhp_auto_online.
> + */
> +enum {
> + MEMORY_BLOCK_NONE,
> + MEMORY_BLOCK_NORMAL,
> + MEMORY_BLOCK_STANDBY,
> + MEMORY_BLOCK_PARAVIRT,
> +};

This does not seem like the best way to expose these.

STANDBY, for instance, seems to be essentially a replacement for a check
against running on s390 in userspace to implement a _typical_ s390
policy.  It seems rather weird to try to make the userspace policy
determination easier by telling userspace about the typical s390 policy
via the kernel.

As for the OOM issues, that sounds like something we need to fix by
refusing to do (or delaying) hot-add operations once we consume too much
ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
userspace to hurry thing along.

So, to my eye, we need:

 +enum {
 +  MEMORY_BLOCK_NONE,
 +  MEMORY_BLOCK_STANDBY, /* the default */
 +  MEMORY_BLOCK_AUTO_ONLINE,
 +};

and we can probably collapse NONE into AUTO_ONLINE because userspace
ends up doing the same thing for both: nothing.

>  struct memory_block {
>   unsigned long start_section_nr;
>   unsigned long end_section_nr;
> @@ -34,6 +58,7 @@ struct memory_block {
>   int (*phys_callback)(struct memory_block *);
>   struct device dev;
>   int nid;/* NID for this memory block */
> + int type;   /* type of this memory block */
>  };

Shouldn't we just be creating and using an actual named enum type?


Re: [PATCH v8 0/3] powerpc: Detection and scheduler optimization for POWER9 bigcore

2018-09-25 Thread Dave Hansen
On 09/22/2018 04:03 AM, Gautham R Shenoy wrote:
> Without this patchset, the SMT domain would be defined as the group of
> threads that share L2 cache.

Could you try to make a more clear, concise statement about the current
state of the art vs. what you want it to be?  Right now, the sched
domains do something like this in terms of ordering:

1. SMT siblings
2. Caches
3. NUMA

It sounds like you don't want SMT siblings to be the things that we use,
right?  Because some siblings share caches and some do not.  Right?  You
want something like this:

1. SMT siblings (sharing L1)
2. SMT siblings (sharing L2)
3. Other caches
4. NUMA


Re: [PATCH v8 0/3] powerpc: Detection and scheduler optimization for POWER9 bigcore

2018-09-20 Thread Dave Hansen
On 09/20/2018 10:22 AM, Gautham R. Shenoy wrote:
>  -
>  |L1 Cache   |
>--
>|L2| | | |  |
>|  |  0  |  2  |  4  |  6   |Small Core0
>|C | | | |  |
> Big|a --
> Core   |c | | | |  |
>|h |  1  |  3  |  5  |  7   | Small Core1
>|e | | | |  |
>-
> | L1 Cache   |
> --

The scheduler already knows about shared caches.  Could you elaborate on
how this is different from the situation today where we have multiple
cores sharing an L2/L3?

Adding the new sysfs stuff seems like overkill if that's all that you
are trying to do.


Re: [PATCH v14 22/22] selftests/vm: test correct behavior of pkey-0

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> Ensure pkey-0 is allocated on start.  Ensure pkey-0 can be attached
> dynamically in various modes, without failures.  Ensure pkey-0 can be
> freed and allocated.
> 
> Signed-off-by: Ram Pai 
> ---
>  tools/testing/selftests/vm/protection_keys.c |   66 
> +-
>  1 files changed, 64 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 569faf1..156b449 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -999,6 +999,67 @@ void close_test_fds(void)
>   return *ptr;
>  }
>  
> +void test_pkey_alloc_free_attach_pkey0(int *ptr, u16 pkey)
> +{
> + int i, err;
> + int max_nr_pkey_allocs;
> + int alloced_pkeys[NR_PKEYS];
> + int nr_alloced = 0;
> + int newpkey;
> + long size;
> +
> + assert(pkey_last_malloc_record);
> + size = pkey_last_malloc_record->size;
> + /*
> +  * This is a bit of a hack.  But mprotect() requires
> +  * huge-page-aligned sizes when operating on hugetlbfs.
> +  * So, make sure that we use something that's a multiple
> +  * of a huge page when we can.
> +  */
> + if (size >= HPAGE_SIZE)
> + size = HPAGE_SIZE;
> +
> +
> + /* allocate every possible key and make sure key-0 never got allocated 
> */
> + max_nr_pkey_allocs = NR_PKEYS;
> + for (i = 0; i < max_nr_pkey_allocs; i++) {
> + int new_pkey = alloc_pkey();
> + assert(new_pkey != 0);

Missed these earlier.  This needs to be pkey_assert().  We don't want
these tests to ever _actually_ crash.

> + /* attach key-0 in various modes */
> + err = sys_mprotect_pkey(ptr, size, PROT_READ, 0);
> + pkey_assert(!err);
> + err = sys_mprotect_pkey(ptr, size, PROT_WRITE, 0);
> + pkey_assert(!err);
> + err = sys_mprotect_pkey(ptr, size, PROT_EXEC, 0);
> + pkey_assert(!err);
> + err = sys_mprotect_pkey(ptr, size, PROT_READ|PROT_WRITE, 0);
> + pkey_assert(!err);
> + err = sys_mprotect_pkey(ptr, size, PROT_READ|PROT_WRITE|PROT_EXEC, 0);
> + pkey_assert(!err);

This is all fine.

> + /* free key-0 */
> + err = sys_pkey_free(0);
> + pkey_assert(!err);

This part is called out as undefined behavior in the manpage:

>An application should not call pkey_free() on any protection key
>which has been assigned to an address range by pkey_mprotect(2) and
>which is still in use.  The behavior in this case is undefined and
>may result in an error.

I don't think we should be testing for undefined behavior.

> + newpkey = sys_pkey_alloc(0, 0x0);
> + assert(newpkey == 0);
> +}
> +
>  void test_read_of_write_disabled_region(int *ptr, u16 pkey)
>  {
>   int ptr_contents;
> @@ -1144,10 +1205,10 @@ void 
> test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
>  void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
>  {
>   int err;
> - int i = get_start_key();
> + int i;
>  
>   /* Note: 0 is the default pkey, so don't mess with it */
> - for (; i < NR_PKEYS; i++) {
> + for (i=1; i < NR_PKEYS; i++) {
>   if (pkey == i)
>   continue;

This seems to be randomly reverting earlier changes.


Re: [PATCH v14 20/22] selftests/vm: testcases must restore pkey-permissions

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> Generally the signal handler restores the state of the pkey register
> before returning. However there are times when the read/write operation
> can legitamely fail without invoking the signal handler.  Eg: A
> sys_read() operaton to a write-protected page should be disallowed.  In
> such a case the state of the pkey register is not restored to its
> original state.  Test cases may not remember to restoring the key
> register state. During cleanup generically restore the key permissions.

This would, indeed be a good thing to do for a well-behaved application.

But, for selftests, why does it matter what state we leave the key in?
Doesn't the test itself need to establish permissions?  Don't we *do*
that at pkey_alloc() anyway?

What problem does this solve?

> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 8a6afdd..ea3cf04 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -1476,8 +1476,13 @@ void run_tests_once(void)
>   pkey_tests[test_nr](ptr, pkey);
>   dprintf1("freeing test memory: %p\n", ptr);
>   free_pkey_malloc(ptr);
> +
> + /* restore the permission on the key after use */
> + pkey_access_allow(pkey);
> + pkey_write_allow(pkey);
>   sys_pkey_free(pkey);
>  
> +
>   dprintf1("pkey_faults: %d\n", pkey_faults);
>   dprintf1("orig_pkey_faults: %d\n", orig_pkey_faults);





Re: [PATCH v14 16/22] selftests/vm: fix an assertion in test_pkey_alloc_exhaust()

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> The maximum number of keys that can be allocated has to
> take into consideration, that some keys are reserved by
> the architecture for   specific   purpose. Hence cannot
> be allocated.

Back to incomplete sentences, I see. :)

How about:

Some pkeys which are valid to the hardware are not available
for application use.  Those can not be allocated.

test_pkey_alloc_exhaust() tries to account for these but
___FILL_IN_WHAT_IT_DID_WRONG.  We fix this by
___FILL_IN_WAY_IT_WAS_FIXED.

> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index d27fa5e..67d841e 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -1171,15 +1171,11 @@ void test_pkey_alloc_exhaust(int *ptr, u16 pkey)
>   pkey_assert(i < NR_PKEYS*2);
>  
>   /*
> -  * There are 16 pkeys supported in hardware.  Three are
> -  * allocated by the time we get here:
> -  *   1. The default key (0)
> -  *   2. One possibly consumed by an execute-only mapping.
> -  *   3. One allocated by the test code and passed in via
> -  *  'pkey' to this function.
> -  * Ensure that we can allocate at least another 13 (16-3).
> +  * There are NR_PKEYS pkeys supported in hardware. arch_reserved_keys()
> +  * are reserved. And one key is allocated by the test code and passed
> +  * in via 'pkey' to this function.
>*/
> - pkey_assert(i >= NR_PKEYS-3);
> + pkey_assert(i >= (NR_PKEYS-arch_reserved_keys()-1));
>  
>   for (i = 0; i < nr_allocated_pkeys; i++) {
>   err = sys_pkey_free(allocated_pkeys[i]);

You also killed my nice, shiny, new comment.  You made an attempt to
make up for it two patches ago, but it pales in comparison to mine.  The
fact that I wrote it only a few short week ago makes me very attached to
it, kinda like a new puppy.  I don't want to throw it to the wolves
quite yet.  So, please preserve as much of it as possible, even if it
has to live in the x86 header.

For bonus points, axe this comment in the same patch that you create the
x86 header comment for easier review.


Re: [PATCH v14 15/22] selftests/vm: powerpc implementation to check support for pkey

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> -static inline int cpu_has_pku(void)
> +static inline bool is_pkey_supported(void)
>  {
> - return 1;
> + /*
> +  * No simple way to determine this.
> +  * Lets try allocating a key and see if it succeeds.
> +  */
> + int ret = sys_pkey_alloc(0, 0);
> +
> + if (ret > 0) {
> + sys_pkey_free(ret);
> + return true;
> + }
> + return false;
>  }

This actually works on x86 too.

>  static inline int arch_reserved_keys(void)
> diff --git a/tools/testing/selftests/vm/pkey-x86.h 
> b/tools/testing/selftests/vm/pkey-x86.h
> index f5d0ff2..887acf2 100644
> --- a/tools/testing/selftests/vm/pkey-x86.h
> +++ b/tools/testing/selftests/vm/pkey-x86.h
> @@ -105,7 +105,7 @@ static inline void __cpuid(unsigned int *eax, unsigned 
> int *ebx,
>  #define X86_FEATURE_PKU(1<<3) /* Protection Keys for Userspace */
>  #define X86_FEATURE_OSPKE  (1<<4) /* OS Protection Keys Enable */
>  
> -static inline int cpu_has_pku(void)
> +static inline bool is_pkey_supported(void)
>  {
>   unsigned int eax;
>   unsigned int ebx;
> @@ -118,13 +118,13 @@ static inline int cpu_has_pku(void)
>  
>   if (!(ecx & X86_FEATURE_PKU)) {
>   dprintf2("cpu does not have PKU\n");
> - return 0;
> + return false;
>   }
>   if (!(ecx & X86_FEATURE_OSPKE)) {
>   dprintf2("cpu does not have OSPKE\n");
> - return 0;
> + return false;
>   }
> - return 1;
> + return true;
>  }



>  #define XSTATE_PKEY_BIT  (9)
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 18e1bb7..d27fa5e 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -1389,8 +1389,8 @@ void test_mprotect_pkey_on_unsupported_cpu(int *ptr, 
> u16 pkey)
>   int size = PAGE_SIZE;
>   int sret;
>  
> - if (cpu_has_pku()) {
> - dprintf1("SKIP: %s: no CPU support\n", __func__);
> + if (is_pkey_supported()) {
> + dprintf1("SKIP: %s: no CPU/kernel support\n", __func__);
>   return;
>   }

I actually wanted a kernel-independent check, based entirely on CPUID.
That's specifically why I said "no CPU support".

If you want to do this, please do:

/* powerpc has no enumeration, just assume it has support: */
static inline bool cpu_has_cpu(void) { return true; };

if (cpu_has_pku()) {
dprintf1("SKIP: %s: no CPU support\n", __func__);
return
}

if (kernel_pkey_supported()) {
dprintf1("SKIP: %s: no kernel support\n", __func__);
return;
}




Re: [PATCH v14 14/22] selftests/vm: Introduce generic abstractions

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> Introduce generic abstractions and provide architecture
> specific implementation for the abstractions.

I really wanted to see these two things separated:
1. introduce abstractions
2. introduce ppc implementation

But, I guess most of it is done except for the siginfo stuff.

>  #if defined(__i386__) || defined(__x86_64__) /* arch */
>  #include "pkey-x86.h"
> +#elif defined(__powerpc64__) /* arch */
> +#include "pkey-powerpc.h"
>  #else /* arch */
>  #error Architecture not supported
>  #endif /* arch */
> @@ -186,7 +191,16 @@ static inline int open_hugepage_file(int flag)
>  
>  static inline int get_start_key(void)
>  {
> - return 1;
> + return 0;
> +}

How does this not now break x86?
>  #endif /* _PKEYS_X86_H */
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 304f74f..18e1bb7 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -197,17 +197,18 @@ void dump_mem(void *dumpme, int len_bytes)
>  
>  int pkey_faults;
>  int last_si_pkey = -1;
> +void pkey_access_allow(int pkey);

Please just move the function.

>  void signal_handler(int signum, siginfo_t *si, void *vucontext)
>  {
>   ucontext_t *uctxt = vucontext;
>   int trapno;
>   unsigned long ip;
>   char *fpregs;
> +#if defined(__i386__) || defined(__x86_64__) /* arch */
>   pkey_reg_t *pkey_reg_ptr;
> - u64 siginfo_pkey;
> +#endif /* defined(__i386__) || defined(__x86_64__) */
> + u32 siginfo_pkey;
>   u32 *si_pkey_ptr;
> - int pkey_reg_offset;
> - fpregset_t fpregset;
>  
>   dprint_in_signal = 1;
>   dprintf1("===SIGSEGV\n");
> @@ -217,12 +218,14 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>  
>   trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
>   ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
> - fpregset = uctxt->uc_mcontext.fpregs;
> - fpregs = (void *)fpregset;
> + fpregs = (char *) uctxt->uc_mcontext.fpregs;
>  
>   dprintf2("%s() trapno: %d ip: 0x%016lx info->si_code: %s/%d\n",
>   __func__, trapno, ip, si_code_str(si->si_code),
>   si->si_code);
> +
> +#if defined(__i386__) || defined(__x86_64__) /* arch */
> +
>  #ifdef __i386__
>   /*
>* 32-bit has some extra padding so that userspace can tell whether
> @@ -230,20 +233,21 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>* state.  We just assume that it is here.
>*/
>   fpregs += 0x70;
> -#endif
> - pkey_reg_offset = pkey_reg_xstate_offset();
> - pkey_reg_ptr = (void *)(&fpregs[pkey_reg_offset]);
> +#endif /* __i386__ */
>  
> - dprintf1("siginfo: %p\n", si);
> - dprintf1(" fpregs: %p\n", fpregs);
> + pkey_reg_ptr = (void *)(&fpregs[pkey_reg_xstate_offset()]);

There are unnecessary parenthesis here.

Also, why are you bothering to mess with this?  This is inside the x86
#ifdef, right?

>   /*
> -  * If we got a PKEY fault, we *HAVE* to have at least one bit set in
> +  * If we got a key fault, we *HAVE* to have at least one bit set in
>* here.
>*/
>   dprintf1("pkey_reg_xstate_offset: %d\n", pkey_reg_xstate_offset());
>   if (DEBUG_LEVEL > 4)
>   dump_mem(pkey_reg_ptr - 128, 256);
>   pkey_assert(*pkey_reg_ptr);
> +#endif /* defined(__i386__) || defined(__x86_64__) */
> +
> + dprintf1("siginfo: %p\n", si);
> + dprintf1(" fpregs: %p\n", fpregs);
>  
>   if ((si->si_code == SEGV_MAPERR) ||
>   (si->si_code == SEGV_ACCERR) ||
> @@ -252,22 +256,28 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>   exit(4);
>   }
>  
> - si_pkey_ptr = (u32 *)(((u8 *)si) + si_pkey_offset);
> + si_pkey_ptr = siginfo_get_pkey_ptr(si);
>   dprintf1("si_pkey_ptr: %p\n", si_pkey_ptr);
> - dump_mem((u8 *)si_pkey_ptr - 8, 24);
> + dump_mem(si_pkey_ptr - 8, 24);

You removed the cast here, why?  That changes the pointer math.

Can we merge this as-is.  No, I do not think we can.  If it were _just_
the #ifdefs, we could let it pass, but there are a bunch of rough spots,
not just the #ifdefs.



Re: [PATCH v14 13/22] selftests/vm: generic cleanup

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> cleanup the code to satisfy coding styles.
> 
> cc: Dave Hansen 
> cc: Florian Weimer 
> Signed-off-by: Ram Pai 
> ---
>  tools/testing/selftests/vm/protection_keys.c |   64 +
>  1 files changed, 43 insertions(+), 21 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index f50cce8..304f74f 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -4,7 +4,7 @@
>   *
>   * There are examples in here of:
>   *  * how to set protection keys on memory
> - *  * how to set/clear bits in pkey registers (the rights register)
> + *  * how to set/clear bits in Protection Key registers (the rights register)

Huh?  Which coding style says that we can't say "pkey"?

>   *  * how to handle SEGV_PKUERR signals and extract pkey-relevant
>   *information from the siginfo
>   *
> @@ -13,13 +13,18 @@
>   *   prefault pages in at malloc, or not
>   *   protect MPX bounds tables with protection keys?
>   *   make sure VMA splitting/merging is working correctly
> - *   OOMs can destroy mm->mmap (see exit_mmap()), so make sure it is immune 
> to pkeys
> - *   look for pkey "leaks" where it is still set on a VMA but "freed" back 
> to the kernel
> - *   do a plain mprotect() to a mprotect_pkey() area and make sure the pkey 
> sticks
> + *   OOMs can destroy mm->mmap (see exit_mmap()),
> + *   so make sure it is immune to pkeys
> + *   look for pkey "leaks" where it is still set on a VMA
> + *but "freed" back to the kernel
> + *   do a plain mprotect() to a mprotect_pkey() area and make
> + *sure the pkey sticks

This makes it work substantially worse.  That's not acceptable, even if
you did move it under 80 columns.

>   * Compile like this:
> - *   gcc  -o protection_keys-O2 -g -std=gnu99 -pthread -Wall 
> protection_keys.c -lrt -ldl -lm
> - *   gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99 -pthread -Wall 
> protection_keys.c -lrt -ldl -lm
> + *   gcc  -o protection_keys-O2 -g -std=gnu99
> + *-pthread -Wall protection_keys.c -lrt -ldl -lm
> + *   gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99
> + *-pthread -Wall protection_keys.c -lrt -ldl -lm
>   */

Why was this on one line?  Because it was easier to copy and paste.
Please leave it on one line, CodingStyle be damned.

>  #define _GNU_SOURCE
>  #include 
> @@ -263,10 +268,12 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>   __read_pkey_reg());
>   dprintf1("pkey from siginfo: %jx\n", siginfo_pkey);
>   *(u64 *)pkey_reg_ptr = 0x;
> - dprintf1("WARNING: set PRKU=0 to allow faulting instruction to 
> continue\n");
> + dprintf1("WARNING: set PKEY_REG=0 to allow faulting instruction "
> + "to continue\n");

It's actually totally OK to let printk strings go over 80 columns.

>   pkey_faults++;
>   dprintf1("<<<<==\n");
>   dprint_in_signal = 0;
> + return;
>  }

Now we're just being silly.

>  
>  int wait_all_children(void)
> @@ -384,7 +391,7 @@ void pkey_disable_set(int pkey, int flags)
>  {
>   unsigned long syscall_flags = 0;
>   int ret;
> - int pkey_rights;
> + u32 pkey_rights;

This is not CodingStyle.  Shouldn't this be the pkey_reg_t that you
introduced earlier in the series?

> -int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
> +int sys_pkey_alloc(unsigned long flags, u64 init_val)
>  {

Um, this is actually a 'unsigned long' in the ABI.

Can you go back through this and actually make sure that these are real
coding style cleanups?


Re: [PATCH v14 12/22] selftests/vm: pkey register should match shadow pkey

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> expected_pkey_fault() is comparing the contents of pkey
> register with 0. This may not be true all the time. There
> could be bits set by default by the architecture
> which can never be changed. Hence compare the value against
> shadow pkey register, which is supposed to track the bits
> accurately all throughout

This is getting dangerously close to full sentences that actually
describe the patch.  You forgot a period, but much this is a substantial
improvement over earlier parts of the series.  Thanks for writing this,
seriously.

> cc: Dave Hansen 
> cc: Florian Weimer 
> Signed-off-by: Ram Pai 
> ---
>  tools/testing/selftests/vm/protection_keys.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 2e448e0..f50cce8 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -913,10 +913,10 @@ void expected_pkey_fault(int pkey)
>   pkey_assert(last_si_pkey == pkey);
>  
>   /*
> -  * The signal handler shold have cleared out PKEY register to let the
> +  * The signal handler should have cleared out pkey-register to let the
>* test program continue.  We now have to restore it.
>*/

... while I appreciate the spelling corrections, and I would totally ack
a patch that fixed them in one fell swoop, could we please segregate the
random spelling corrections from code fixes unless you touch those lines
otherwise?

> - if (__read_pkey_reg() != 0)
> + if (__read_pkey_reg() != shadow_pkey_reg)
>   pkey_assert(0);
>  
>   __write_pkey_reg(shadow_pkey_reg);

I know this is a one-line change, but I don't fully understand it.

On x86, if we take a pkey fault, we clear PKRU entirely (via the
on-stack XSAVE state that is restored at sigreturn) which allows the
faulting instruction to resume and execute normally.  That's what this
check is looking for: Did the signal handler clear PKRU?

Now, you're saying that powerpc might not clear it.  That makes sense.

While PKRU's state here is obvious, it isn't patently obvious to me what
shadow_pkey_reg's state is.  In fact, looking at it, I don't see the
signal handler manipulating the shadow.  So, how can this patch work?


Re: [PATCH v14 11/22] selftests/vm: introduce two arch independent abstraction

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> open_hugepage_file() <- opens the huge page file

Folks, a sentence here would be nice:

Different architectures have different huge page sizes and thus
have different sysfs filees to manipulate when allocating huge
pages.

> get_start_key() <--  provides the first non-reserved key.

Does powerpc not start on key 0?  Why do you need this?

> cc: Dave Hansen 
> cc: Florian Weimer 
> Signed-off-by: Ram Pai 
> Signed-off-by: Thiago Jung Bauermann 
> Reviewed-by: Dave Hansen 
> ---
>  tools/testing/selftests/vm/pkey-helpers.h|   10 ++
>  tools/testing/selftests/vm/pkey-x86.h|1 +
>  tools/testing/selftests/vm/protection_keys.c |6 +++---
>  3 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/pkey-helpers.h 
> b/tools/testing/selftests/vm/pkey-helpers.h
> index ada0146..52a1152 100644
> --- a/tools/testing/selftests/vm/pkey-helpers.h
> +++ b/tools/testing/selftests/vm/pkey-helpers.h
> @@ -179,4 +179,14 @@ static inline void __pkey_write_allow(int pkey, int 
> do_allow_write)
>  #define __stringify_1(x...) #x
>  #define __stringify(x...)   __stringify_1(x)
>  
> +static inline int open_hugepage_file(int flag)
> +{
> + return open(HUGEPAGE_FILE, flag);
> +}

open_nr_hugepages_file() if you revise this, please
> +
> +static inline int get_start_key(void)
> +{
> + return 1;
> +}

get_first_user_pkey(), please.

>  #endif /* _PKEYS_HELPER_H */
> diff --git a/tools/testing/selftests/vm/pkey-x86.h 
> b/tools/testing/selftests/vm/pkey-x86.h
> index 2b3780d..d5fa299 100644
> --- a/tools/testing/selftests/vm/pkey-x86.h
> +++ b/tools/testing/selftests/vm/pkey-x86.h
> @@ -48,6 +48,7 @@
>  #define MB   (1<<20)
>  #define pkey_reg_t   u32
>  #define PKEY_REG_FMT "%016x"
> +#define HUGEPAGE_FILE
> "/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"
>  
>  static inline u32 pkey_bit_position(int pkey)
>  {
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 2565b4c..2e448e0 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -788,7 +788,7 @@ void setup_hugetlbfs(void)
>* Now go make sure that we got the pages and that they
>* are 2M pages.  Someone might have made 1G the default.
>*/
> - fd = open("/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages", 
> O_RDONLY);
> + fd = open_hugepage_file(O_RDONLY);
>   if (fd < 0) {
>   perror("opening sysfs 2M hugetlb config");
>   return;

This is fine, and obviously necessary.

> @@ -1075,10 +1075,10 @@ void 
> test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
>  void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
>  {
>   int err;
> - int i;
> + int i = get_start_key();
>  
>   /* Note: 0 is the default pkey, so don't mess with it */
> - for (i = 1; i < NR_PKEYS; i++) {
> + for (; i < NR_PKEYS; i++) {
>   if (pkey == i)
>   continue;

Grumble, grumble, you moved the code away from the comment connected to
it.



Re: [PATCH v14 10/22] selftests/vm: fix alloc_random_pkey() to make it really random

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> alloc_random_pkey() was allocating the same pkey every time.
> Not all pkeys were geting tested. fixed it.

This fixes a real issue but also unnecessarily munges whitespace.  If
you rev these again, please fix the munging.  Otherwise:

Acked-by: Dave Hansen 


Re: [PATCH v14 09/22] selftests/vm: fixed bugs in pkey_disable_clear()

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> instead of clearing the bits, pkey_disable_clear() was setting
> the bits. Fixed it.

Again, I know these are just selftests, but I'm seeing a real lack of
attention to detail with these.  Please at least go to the trouble of
writing actual changelogs with full sentences and actual capitalization.
 I think it's the least you can do if I'm going to spend time reviewing
these.

I'll go one better.  Can you just use this as a changelog?

pkey_disable_clear() does not work.  Instead of clearing bits,
it sets them, probably because it originated as a copy-and-paste
of pkey_disable_set().

This has been dead code up to now because the only callers are
pkey_access/write_allow() and _those_ are currently unused.

> Also fixed a wrong assertion in that function. When bits are
> cleared, the resulting bit value will be less than the original.
> 
> This hasn't been a problem so far because this code isn't currently
> used.
> 
> cc: Dave Hansen 
> cc: Florian Weimer 
> Signed-off-by: Ram Pai 
> ---
>  tools/testing/selftests/vm/protection_keys.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 2dd94c3..8fa4f74 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -433,7 +433,7 @@ void pkey_disable_clear(int pkey, int flags)
>   pkey, pkey, pkey_rights);
>   pkey_assert(pkey_rights >= 0);
>  
> - pkey_rights |= flags;
> + pkey_rights &= ~flags;
>
>   ret = hw_pkey_set(pkey, pkey_rights, 0);
>   shadow_pkey_reg &= clear_pkey_flags(pkey, flags);
> @@ -446,7 +446,7 @@ void pkey_disable_clear(int pkey, int flags)
>   dprintf1("%s(%d) pkey_reg: 0x"PKEY_REG_FMT"\n", __func__,
>   pkey, read_pkey_reg());

If you add a better changelog:

Acked-by: Dave Hansen 

>   if (flags)
> - assert(read_pkey_reg() > orig_pkey_reg);
> + assert(read_pkey_reg() <= orig_pkey_reg);
>  }

This really is an orthogonal change that would have been better placed
with the other patch that did the exact same thing.  But I'd be OK with
it here, as long as it has an actual comment.


Re: [PATCH v14 08/22] selftests/vm: fix the wrong assert in pkey_disable_set()

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> If the flag is 0, no bits will be set. Hence we cant expect
> the resulting bitmap to have a higher value than what it
> was earlier.
...
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -415,7 +415,7 @@ void pkey_disable_set(int pkey, int flags)
>   dprintf1("%s(%d) pkey_reg: 0x"PKEY_REG_FMT"\n",
>   __func__, pkey, read_pkey_reg());
>   if (flags)
> - pkey_assert(read_pkey_reg() > orig_pkey_reg);
> + pkey_assert(read_pkey_reg() >= orig_pkey_reg);
>   dprintf1("END<---%s(%d, 0x%x)\n", __func__,
>   pkey, flags);
>  }

I know these are just selftests, but this change makes zero sense
without the context from how powerpc works.  It's also totally
non-obvious from the patch itself what is going on, even though I
specifically called this out in a previous review.

Please add a comment here that either specifically calls out powerpc or
talks about "an architecture that does this ..."



Re: [PATCH v14 07/22] selftests/vm: generic function to handle shadow key register

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> - shifted_pkey_reg = (pkey_reg >> (pkey * PKEY_BITS_PER_PKEY));
> + shifted_pkey_reg = right_shift_bits(pkey, pkey_reg);
>   dprintf2("%s() shifted_pkey_reg: "PKEY_REG_FMT"\n", __func__,
>   shifted_pkey_reg);
>   masked_pkey_reg = shifted_pkey_reg & mask;

I'm not a fan of how this looks.  This is almost certainly going to get
the argument order mixed up at some point.

Why do we need this?  The description doesn't say.


Re: [PATCH v14 06/22] selftests/vm: typecast the pkey register

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> This is in preparation to accomadate a differing size register
> across architectures.

This is pretty fugly, and reading it again, I wonder whether we should
have just made it 64-bit, at least in all the printk's.  Or even

prink("pk reg: %*llx\n", PKEY_FMT_LEN, pkey_reg);

But, I don't _really_ care in the end.

Acked-by: Dave Hansen 


Re: [PATCH v14 04/22] selftests/vm: move arch-specific definitions to arch-specific header

2018-07-18 Thread Dave Hansen
On 07/17/2018 06:49 AM, Ram Pai wrote:
> In preparation for multi-arch support, move definitions which have
> arch-specific values to x86-specific header.

Acked-by: Dave Hansen 


Re: [PATCH v14 03/22] selftests/vm: move generic definitions to header file

2018-07-18 Thread Dave Hansen
Acked-by: Dave Hansen 


Re: [PATCH v14 01/22] selftests/x86: Move protecton key selftest to arch neutral directory

2018-07-18 Thread Dave Hansen
Acked-by: Dave Hansen 



Re: [PATCH v13 19/24] selftests/vm: associate key on a mapped page and detect access violation

2018-07-17 Thread Dave Hansen
On 07/17/2018 09:13 AM, Ram Pai wrote:
> I have incorporated almost all of your comments. But there are some
> comments that take some effort to implement. Shall we get the patches
> merged in the current form?  This code has been sitting out for a while.
> 
> In the current form its tested and works on powerpc and on x86, and
> incorporates about 95% of your suggestions. The rest I will take care
> as we go.

What constitutes the remaining 5%?


Re: [PATCH v13 08/24] selftests/vm: fix the wrong assert in pkey_disable_set()

2018-07-17 Thread Dave Hansen
On 07/17/2018 08:58 AM, Ram Pai wrote:
> On Wed, Jun 20, 2018 at 07:47:02AM -0700, Dave Hansen wrote:
>> On 06/13/2018 05:44 PM, Ram Pai wrote:
>>> If the flag is 0, no bits will be set. Hence we cant expect
>>> the resulting bitmap to have a higher value than what it
>>> was earlier
>> ...
>>> if (flags)
>>> -   pkey_assert(read_pkey_reg() > orig_pkey_reg);
>>> +   pkey_assert(read_pkey_reg() >= orig_pkey_reg);
>>> dprintf1("END<---%s(%d, 0x%x)\n", __func__,
>>> pkey, flags);
>>>  }
>> This is the kind of thing where I'd love to hear the motivation and
>> background.  This "disable a key that was already disabled" operation
>> obviously doesn't happen today.  What motivated you to change it now?
> On powerpc, hardware supports READ_DISABLE and WRITE_DISABLE.
> ACCESS_DISABLE is basically READ_DISABLE|WRITE_DISABLE on powerpc.
> 
> If access disable is called on a key followed by a write disable, the
> second operation becomes a nop. In such cases, 
>read_pkey_reg() == orig_pkey_reg
> 
> Hence the code above is modified to 
>   pkey_assert(read_pkey_reg() >= orig_pkey_reg);

Makes sense.  Do we have a comment for that now?


  1   2   3   >