Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread Dave Hansen
On 2/23/24 12:39, John Groves wrote:
>> We had similar unit test regression concerns with fsdax where some
>> upstream change silently broke PMD faults. The solution there was trace
>> points in the fault handlers and a basic test that knows apriori that it
>> *should* be triggering a certain number of huge faults:
>>
>> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
> Good approach, thanks Dan! My working assumption is that we'll be able to make
> that approach work in the famfs tests. So the fault counters should go away
> in the next version.

I do really suspect there's something more generic that should be done
here.  Maybe we need a generic 'huge_faults' perf event to pair up with
the good ol' faults that we already have:

# perf stat -e faults /bin/ls

 Performance counter stats for '/bin/ls':

   104  faults


   0.001499862 seconds time elapsed

   0.00149 seconds user
   0.0 seconds sys






Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread Dave Hansen
On 2/23/24 09:42, John Groves wrote:
> One of the key requirements for famfs is that it service vma faults
> efficiently. Our metadata helps - the search order is n for n extents,
> and n is usually 1. But we can still observe gnarly lock contention
> in mm if PTE faults are happening. This commit introduces fault counters
> that can be enabled and read via /sys/fs/famfs/...
> 
> These counters have proved useful in troubleshooting situations where
> PTE faults were happening instead of PMD. No performance impact when
> disabled.

This seems kinda wonky.  Why does _this_ specific filesystem need its
own fault counters.  Seems like something we'd want to do much more
generically, if it is needed at all.

Was the issue here just that vm_ops->fault() was getting called instead
of ->huge_fault()?  Or something more subtle?



Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

2023-09-19 Thread Dave Hansen
On 9/19/23 08:48, Matteo Rizzo wrote:
>> I think the whole "make it one single compile-time option" model is
>> completely and fundamentally broken.
> Wouldn't making this toggleable at boot time or runtime make performance
> even worse?

Maybe.

But you can tolerate even more of a performance impact from a feature if
the people that don't care can actually disable it.

There are also plenty of ways to minimize the overhead of switching it
on and off at runtime.  Static branches are your best friend here.


Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

2023-09-19 Thread Dave Hansen
On 9/19/23 06:42, Matteo Rizzo wrote:
> On Mon, 18 Sept 2023 at 19:39, Ingo Molnar  wrote:
>> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
>> user-space execution and kernel-space execution?
>>
> Same benchmark as before (compiling a kernel on a system running the patched
> kernel):

Thanks for running those.  One more situation that comes to mind is how
this will act under memory pressure.  Will some memory pressure make
contention on 'slub_kworker_lock' visible or make the global TLB flushes
less bearable?

In any case, none of this looks _catastrophic_.  It's surely a cost that
some folks will pay.

But I really do think it needs to be more dynamic.  There are a _couple_
of reasons for this.  If it's only a compile-time option, it's never
going to get turned on except for maybe ChromeOS and the datacenter
folks that are paranoid.  I suspect the distros will never turn it on.

A lot of questions get easier if you can disable/enable it at runtime.
For instance, what do you do if the virtual area fills up?  You _could_
just go back to handing out direct map addresses.  Less secure?  Yep.
But better than crashing (for some folks).

It also opens up the door to do this per-slab.  That alone would be a
handy debugging option.


Re: [RFC PATCH 11/14] mm/slub: allocate slabs from virtual memory

2023-09-15 Thread Dave Hansen
On 9/15/23 03:59, Matteo Rizzo wrote:
> + spin_lock_irqsave(&slub_kworker_lock, irq_flags);
> + list_splice_init(&slub_tlbflush_queue, &local_queue);
> + list_for_each_entry(slab, &local_queue, flush_list_elem) {
> + unsigned long start = (unsigned long)slab_to_virt(slab);
> + unsigned long end = start + PAGE_SIZE *
> + (1UL << oo_order(slab->oo));
> +
> + if (start < addr_start)
> + addr_start = start;
> + if (end > addr_end)
> + addr_end = end;
> + }
> + spin_unlock_irqrestore(&slub_kworker_lock, irq_flags);
> +
> + if (addr_start < addr_end)
> + flush_tlb_kernel_range(addr_start, addr_end);

I assume that the TLB flushes in the queue are going to be pretty sparse
on average.

At least on x86, flush_tlb_kernel_range() falls back pretty quickly from
individual address invalidation to just doing a full flush.  It might
not even be worth tracking the address ranges, and just do a full flush
every time.

I'd be really curious to see how often actual ranged flushes are
triggered from this code.  I expect it would be very near zero.


Re: [RFC PATCH 10/14] x86: Create virtual memory region for SLUB

2023-09-15 Thread Dave Hansen
On 9/15/23 14:13, Kees Cook wrote:
> On Fri, Sep 15, 2023 at 10:59:29AM +, Matteo Rizzo wrote:
>> From: Jann Horn 
>>
>> SLAB_VIRTUAL reserves 512 GiB of virtual memory and uses them for both
>> struct slab and the actual slab memory. The pointers returned by
>> kmem_cache_alloc will point to this range of memory.
> 
> I think the 512 GiB limit may be worth mentioning in the Kconfig help
> text.

Yes, please.

> And in the "640K is enough for everything" devil's advocacy, why is 512
> GiB enough here? Is there any greater risk of a pathological allocation
> pattern breaking a system any more (or less) than is currently possible?

I have the feeling folks just grabbed the first big-ish chunk they saw
free in the memory map and stole that one.  Not a horrible approach,
mind you, but I have the feeling it didn't go through the most rigorous
sizing procedure. :)

My laptop memory is ~6% consumed by slab, 90% of which is reclaimable.
If a 64TB system had the same ratio, it would bump into this 512GB
limit.  But it _should_ just reclaim thing earlier rather than falling over.

That said, we still have gobs of actual vmalloc() space.  It's ~30TiB in
size and I'm not aware of anyone consuming anywhere near that much.  If
the 512GB fills up somehow, there are other places to steal the space.

One minor concern is that the virtual area is the same size on 4 and
5-level paging systems.  It might be a good idea to pick one of the
holes that actually gets bigger on 5-level systems.



Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

2023-09-15 Thread Dave Hansen
On 9/15/23 03:59, Matteo Rizzo wrote:
> The goal of this patch series is to deterministically prevent cross-cache
> attacks in the SLUB allocator.

What's the cost?


Re: [PATCH 8/8] x86, kcsan: Enable KCSAN for x86

2019-10-16 Thread Dave Hansen
On 10/16/19 1:39 AM, Marco Elver wrote:
> This patch enables KCSAN for x86, with updates to build rules to not use
> KCSAN for several incompatible compilation units.

First of all KCSAN looks really interesting!

For the x86 code, though, I'd really appreciate some specific notes on
why individual compilation units are incompatible.  There might be some
that were missed, and we have to figure out what we do for any future
work.  Knowing the logic used on these would be really helpful in the
future.


Re: [PATCH] Documentation/process: Clarify disclosure rules

2019-09-25 Thread Dave Hansen
On 9/25/19 1:29 AM, Thomas Gleixner wrote:
> The role of the contact list provided by the disclosing party and how it
> affects the disclosure process and the ability to include experts into
> the development process is not really well explained.
> 
> Neither is it entirely clear when the disclosing party will be informed
> about the fact that a developer who is not covered by an employer NDA needs
> to be brought in and disclosed.
> 
> Explain the role of the contact list and the information policy along with
> an eventual conflict resolution better.

This addresses the concerns I had about individuals who the community
needs to participate but which might be encumbered by an agreement
between their employer and the disclosing party.

To be explicit, this ack does not represent any official Intel position
and has not been reviewed by anyone at Intel other than me.

Acked-by: Dave Hansen 


Re: [PATCH 2/4] Documentation/process: describe relaxing disclosing party NDAs

2019-09-11 Thread Dave Hansen
On 9/11/19 8:44 AM, Greg KH wrote:
> Intel had months of review time for this document before this was
> published.  Your lawyers had it and never objected to this lack of
> inclusion at all, and explictitly said that the document as written was
> fine with them.  So I'm sorry, but it is much too late to add something
> like this to the document at this point in time.

Hi Greg,

I'll personally take 100% of the blame for this patch.  I intended for
it to show our commitment to work *with* our colleagues in the
community, not to dictate demands.  Please consider this as you would
any other patch: a humble suggestion to address what I see as a gap.

Just to be clear: this addition came from me and only me.  It did not
come from any Intel lawyers and does not represent any kind of objection
to the process.  Intel's support for this process is unconditional and
not dependent on any of these patches.

> Oh, and cute use of the term, "timely manner", as if we are going to
> fall for that one again... 

Oh, I think that was actually a quote from an email from Thomas
explaining how he wanted these things dealt with.  If you change youer
mind and are open to improvements to this process in the future, I'd be
happy to change this to some kind of explicit deadline if that's preferred.


Re: [PATCH 2/4] Documentation/process: describe relaxing disclosing party NDAs

2019-09-11 Thread Dave Hansen
On 9/11/19 3:11 AM, Sasha Levin wrote:
>> +Disclosing parties may have shared information about an issue under a
>> +non-disclosure agreement with third parties.  In order to ensure that
>> +these agreements do not interfere with the mitigation development
>> +process, the disclosing party must provide explicit permission to
>> +participate to any response team members affected by a non-disclosure
>> +agreement.  Disclosing parties must resolve requests to do so in a
>> +timely manner.
> 
> Can giving the permission be made explicitly along with the disclosure?
> If it's disclosed with Microsoft under NDA, it makes it tricky for me to
> participate in the "response team" context here unless premission is
> given to do so.

Hi Sasha,

It is probably possible to do in advance.  But, probably only if we list
the folks for which it needs to be done in advance in the process file.
 It makes a lot of sense to have the stable maintainers as permanent
members of any response team.

But, I was hoping what I described above would be a bit more flexible
than needing to have a list.  The downside is that the response team
needs to explicitly ask every time for folks like you to be included.


[PATCH 4/4] Documentation/process: add transparency promise to list subscription

2019-09-10 Thread Dave Hansen


From: Dave Hansen 

Transparency is good.  It it essential for everyone working under an
embargo to know who is involved and who else is a "knower".  Being
transparent allows everyone to always make informed decisions about
ongoing participating in a mitigation effort.

Add a step to the subscription process which will notify existing
subscribers when a new one is added.

While I think this is good for everyone, this patch represents my
personal opinion and not that of my employer.

Cc: Jonathan Corbet 
Cc: Greg Kroah-Hartman 
Cc: Sasha Levin 
Cc: Ben Hutchings 
Cc: Thomas Gleixner 
Cc: Laura Abbott 
Cc: Andrew Cooper 
Cc: Trilok Soni 
Cc: Kees Cook 
Cc: Tony Luck 
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Acked-by: Dan Williams 
Signed-off-by: Dave Hansen 
---

 b/Documentation/process/embargoed-hardware-issues.rst |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff -puN Documentation/process/embargoed-hardware-issues.rst~hw-sec-2 
Documentation/process/embargoed-hardware-issues.rst
--- a/Documentation/process/embargoed-hardware-issues.rst~hw-sec-2  
2019-09-10 09:58:47.989476197 -0700
+++ b/Documentation/process/embargoed-hardware-issues.rst   2019-09-10 
09:58:47.992476197 -0700
@@ -276,10 +276,11 @@ certificate. If a PGP key is used, it mu
 server and is ideally connected to the Linux kernel's PGP web of trust. See
 also: https://www.kernel.org/signature.html.
 
-The response team verifies that the subscriber request is valid and adds
-the subscriber to the list. After subscription the subscriber will receive
-email from the mailing-list which is signed either with the list's PGP key
-or the list's S/MIME certificate. The subscriber's email client can extract
-the PGP key or the S/MIME certificate from the signature so the subscriber
-can send encrypted email to the list.
+The response team verifies that the subscriber request is valid, adds the
+subscriber to the list, and notifies the existing list subscribers
+including the disclosing party. After subscription the subscriber will
+receive email from the mailing-list which is signed either with the list's
+PGP key or the list's S/MIME certificate. The subscriber's email client can
+extract the PGP key or the S/MIME certificate from the signature so the
+subscriber can send encrypted email to the list.
 
_


[PATCH 2/4] Documentation/process: describe relaxing disclosing party NDAs

2019-09-10 Thread Dave Hansen


From: Dave Hansen 

Hardware companies like Intel have lots of information which they
want to disclose to some folks but not others.  Non-disclosure
agreements are a tool of choice for helping to ensure that the
flow of information is controlled.

But, they have caused problems in mitigation development.  It
can be hard for individual developers employed by companies to
figure out how they can participate, especially if their
employer is under an NDA.

To make this easier for developers, make it clear to disclosing
parties that they are expected to give permission for individuals
to participate in mitigation efforts.

Cc: Jonathan Corbet 
Cc: Greg Kroah-Hartman 
Cc: Sasha Levin 
Cc: Ben Hutchings 
Cc: Thomas Gleixner 
Cc: Laura Abbott 
Cc: Andrew Cooper 
Cc: Trilok Soni 
Cc: Kees Cook 
Cc: Tony Luck 
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Acked-by: Dan Williams 
Signed-off-by: Dave Hansen 
---

 b/Documentation/process/embargoed-hardware-issues.rst |7 +++
 1 file changed, 7 insertions(+)

diff -puN Documentation/process/embargoed-hardware-issues.rst~hw-sec-0 
Documentation/process/embargoed-hardware-issues.rst
--- a/Documentation/process/embargoed-hardware-issues.rst~hw-sec-0  
2019-09-10 08:39:02.835488131 -0700
+++ b/Documentation/process/embargoed-hardware-issues.rst   2019-09-10 
08:39:02.838488131 -0700
@@ -74,6 +74,13 @@ unable to enter into any non-disclosure
 is aware of the sensitive nature of such issues and offers a Memorandum of
 Understanding instead.
 
+Disclosing parties may have shared information about an issue under a
+non-disclosure agreement with third parties.  In order to ensure that
+these agreements do not interfere with the mitigation development
+process, the disclosing party must provide explicit permission to
+participate to any response team members affected by a non-disclosure
+agreement.  Disclosing parties must resolve requests to do so in a
+timely manner.
 
 Memorandum of Understanding
 ---
_


[PATCH 1/4] Documentation/process: Volunteer as the ambassador for Intel

2019-09-10 Thread Dave Hansen


From: Tony Luck 

Cc: Jonathan Corbet 
Cc: Greg Kroah-Hartman 
Cc: Sasha Levin 
Cc: Ben Hutchings 
Cc: Thomas Gleixner 
Cc: Laura Abbott 
Cc: Andrew Cooper 
Cc: Trilok Soni 
Cc: Kees Cook 
Cc: Tony Luck 
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: Dan Williams 
Signed-off-by: Tony Luck 
Signed-off-by: Dave Hansen 
---

 b/Documentation/process/embargoed-hardware-issues.rst |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN 
Documentation/process/embargoed-hardware-issues.rst~Documentation-process-Volunteer-as-the-ambassador-for-Intel
 Documentation/process/embargoed-hardware-issues.rst
--- 
a/Documentation/process/embargoed-hardware-issues.rst~Documentation-process-Volunteer-as-the-ambassador-for-Intel
   2019-09-10 08:39:05.971488123 -0700
+++ b/Documentation/process/embargoed-hardware-issues.rst   2019-09-10 
08:39:05.974488123 -0700
@@ -222,7 +222,7 @@ an involved disclosed party. The current
   ARM
   AMD
   IBM
-  Intel
+  IntelTony Luck 
   Qualcomm Trilok Soni 
 
   MicrosoftSasha Levin 
_


[PATCH 3/4] Documentation/process: soften language around conference talk dates

2019-09-10 Thread Dave Hansen


From: Dave Hansen 

Both hardware companies and the kernel community prefer coordinated
disclosure to the alternatives.  It is also obvious that sitting on
ready-to-go mitigations for months is not so nice for kernel
maintainers.

I want to ensure that the patched text can not be read as "the kernel
does not wait for conference dates".  I'm also fairly sure that, so
far, we *have* waited for a number of conference dates.

Change the text to make it clear that waiting for conference dates
is possible, but keep the grumbling about it being a burden.

While I think this is good for everyone, this patch represents my
personal opinion and not that of my employer.

Cc: Jonathan Corbet 
Cc: Greg Kroah-Hartman 
Cc: Sasha Levin 
Cc: Ben Hutchings 
Cc: Thomas Gleixner 
Cc: Laura Abbott 
Cc: Andrew Cooper 
Cc: Trilok Soni 
Cc: Kees Cook 
Cc: Tony Luck 
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Acked-by: Dan Williams 
Signed-off-by: Dave Hansen 
---

 b/Documentation/process/embargoed-hardware-issues.rst |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff -puN Documentation/process/embargoed-hardware-issues.rst~hw-sec-1 
Documentation/process/embargoed-hardware-issues.rst
--- a/Documentation/process/embargoed-hardware-issues.rst~hw-sec-1  
2019-09-10 08:39:03.879488129 -0700
+++ b/Documentation/process/embargoed-hardware-issues.rst   2019-09-10 
08:39:03.883488129 -0700
@@ -197,10 +197,9 @@ While we understand that hardware securi
 time, the embargo time should be constrained to the minimum time which is
 required for all involved parties to develop, test and prepare the
 mitigations. Extending embargo time artificially to meet conference talk
-dates or other non-technical reasons is creating more work and burden for
-the involved developers and response teams as the patches need to be kept
-up to date in order to follow the ongoing upstream kernel development,
-which might create conflicting changes.
+dates or other non-technical reasons is possible, but not preferred. These
+artificial extensions burden the response team with constant maintenance
+updating mitigations to follow upstream kernel development.
 
 CVE assignment
 """"""""""""""
_


[PATCH 0/4] Documentation/process: embargoed hardware issues additions

2019-09-10 Thread Dave Hansen
Intel will adhere to this process for future hardware embargoed
issues.  This series contains a patch from Tony Luck with him
volunteering to be Intel's ambassador for this process.

These are some minor improvements here to the process document.  I've
had the pleasure of seeing some of the problems with the various
"processes" that led to the need for this document and I think these
tweaks will help.  This part of the series is much more of an RFC than
the first patch, obviously.

Cc: Jonathan Corbet 
Cc: Greg Kroah-Hartman 
Cc: Sasha Levin 
Cc: Ben Hutchings 
Cc: Thomas Gleixner 
Cc: Laura Abbott 
Cc: Andrew Cooper 
Cc: Trilok Soni 
Cc: Kees Cook 
Cc: Tony Luck 
Cc: Dan Williams 
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org


Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW

2019-08-14 Thread Dave Hansen
On 8/14/19 9:42 AM, Yu-cheng Yu wrote:
> On Tue, 2019-08-13 at 16:02 -0700, Dave Hansen wrote:
> [...]
>> Please also reconcile the supervisor XSAVE portion of your patches with
>> the ones that Fenghua has been sending around.  I've given quite a bit
>> of feedback to improve those.  Please consolidate and agree on a common
>> set of patches with him.
> XSAVES supervisor is now a six-patch set.  Maybe we can make it a separate
> series?  I will consolidate and send it out.

A separate series would be great.

Please also make sure it's in a (temporary) git tree somewhere so that
it's easy to base other sets on top of it.


Re: [PATCH v8 15/27] mm: Handle shadow stack page fault

2019-08-14 Thread Dave Hansen
On 8/14/19 9:27 AM, Yu-cheng Yu wrote:
> On Tue, 2019-08-13 at 15:55 -0700, Andy Lutomirski wrote:
>> On Tue, Aug 13, 2019 at 2:02 PM Yu-cheng Yu  wrote:
>>> When a task does fork(), its shadow stack (SHSTK) must be duplicated
>>> for the child.  This patch implements a flow similar to copy-on-write
>>> of an anonymous page, but for SHSTK.
>>>
>>> A SHSTK PTE must be RO and dirty.  This dirty bit requirement is used
>>> to effect the copying.  In copy_one_pte(), clear the dirty bit from a
>>> SHSTK PTE to cause a page fault upon the next SHSTK access.  At that
>>> time, fix the PTE and copy/re-use the page.
>> Is using VM_SHSTK and special-casing all of this really better than
>> using a special mapping or other pseudo-file-backed VMA and putting
>> all the magic in the vm_operations?
> A special mapping is cleaner.  However, we also need to exclude normal [RO +
> dirty] pages from shadow stack.

I don't understand what you are saying.

Are you saying that we need this VM_SHSTK flag in order to exclude
RO+HW-Dirty pages from being created in non-shadow-stack VMAs?


Re: [PATCH v8 11/27] x86/mm: Introduce _PAGE_DIRTY_SW

2019-08-13 Thread Dave Hansen
> +#if defined(CONFIG_X86_INTEL_SHADOW_STACK_USER)
> +static inline pte_t pte_move_flags(pte_t pte, pteval_t from, pteval_t to)
> +{
> + if (pte_flags(pte) & from)
> + pte = pte_set_flags(pte_clear_flags(pte, from), to);
> + return pte;

Why is this conditional on the compile option and not a runtime check?

> +}
> +#else
> +static inline pte_t pte_move_flags(pte_t pte, pteval_t from, pteval_t to)
> +{
> + return pte;
> +}
> +#endif

Why do we need this function?  It's not mentioned in the changelog or
commented.

>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> - return pte_clear_flags(pte, _PAGE_DIRTY);
> + return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
>  }
>  
>  static inline pte_t pte_mkold(pte_t pte)
> @@ -322,6 +336,7 @@ static inline pte_t pte_mkold(pte_t pte)
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> + pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
>   return pte_clear_flags(pte, _PAGE_RW);
>  }

Please comment what this is doing and why.

> @@ -332,9 +347,24 @@ static inline pte_t pte_mkexec(pte_t pte)
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> + pteval_t dirty = (!IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER) ||
> +pte_write(pte)) ? _PAGE_DIRTY_HW:_PAGE_DIRTY_SW;

This is *really* hard for me to read and parse.  How about:

pte_t dirty = _PAGE_DIRTY_HW;

/*
 * When Shadow Stacks are enabled, read-only PTEs can
 * not have the hardware dirty bit set and must use
 * the software bit.
 */
if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER) &&
!pte_write(pte))
dirty = _PAGE_DIRTY_SW;


> + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
> +}
> +
> +#ifdef CONFIG_ARCH_HAS_SHSTK
> +static inline pte_t pte_mkdirty_shstk(pte_t pte)
> +{
> + pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
>   return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
>  }

Why does the _PAGE_DIRTY_SW *HAVE* to be cleared on shstk pages?

> +static inline bool pte_dirty_hw(pte_t pte)
> +{
> + return pte_flags(pte) & _PAGE_DIRTY_HW;
> +}
> +#endif

Why are these #ifdef'd?

>  static inline pte_t pte_mkyoung(pte_t pte)
>  {
>   return pte_set_flags(pte, _PAGE_ACCESSED);
> @@ -342,6 +372,7 @@ static inline pte_t pte_mkyoung(pte_t pte)
>  
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
> + pte = pte_move_flags(pte, _PAGE_DIRTY_SW, _PAGE_DIRTY_HW);
>   return pte_set_flags(pte, _PAGE_RW);
>  }

It also isn't clear to me why this *must* move bits here.  Its doubly
unclear why you would need to do this on systems when shadow stacks are
compiled in but disabled.



Same comments for pmds and puds.

> -
>  /* mprotect needs to preserve PAT bits when updating vm_page_prot */
>  #define pgprot_modify pgprot_modify
>  static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
> @@ -1178,6 +1254,19 @@ static inline int pmd_write(pmd_t pmd)
>   return pmd_flags(pmd) & _PAGE_RW;
>  }
>  
> +static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> +{
> + pmdval_t val = pmd_val(pmd), oldval = val;
> +
> + val &= _HPAGE_CHG_MASK;
> + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> + val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
> + if ((pmd_write(pmd) && !(pgprot_val(newprot) & _PAGE_RW)))
> + return pmd_move_flags(__pmd(val), _PAGE_DIRTY_HW,
> +   _PAGE_DIRTY_SW);
> + return __pmd(val);
> +}

Why was this function moved?  This makes it really hard to review what
you changed

I'm going to stop reading this code now.  It needs a lot more care and
feeding to make it reviewable.  Please go back, double-check your
changelogs and flesh them out, then please try to make the code more
readable and understandable by commenting it.

Please take all of the compile-time checks and ask yourself whether they
need to be or *can* be runtime checks.  Consider what the overhead is of
non-shadowstack systems running shadowstack-required code.

Please also reconcile the supervisor XSAVE portion of your patches with
the ones that Fenghua has been sending around.  I've given quite a bit
of feedback to improve those.  Please consolidate and agree on a common
set of patches with him.


Re: [PATCH v8 09/27] mm/mmap: Prevent Shadow Stack VMA merges

2019-08-13 Thread Dave Hansen
On 8/13/19 1:52 PM, Yu-cheng Yu wrote:
> To prevent function call/return spills into the next shadow stack
> area, do not merge shadow stack areas.

How does this prevent call/return spills?


Re: [PATCH v7 1/2] arm64: Define Documentation/arm64/tagged-address-abi.rst

2019-08-09 Thread Dave Hansen
On 8/8/19 10:27 AM, Catalin Marinas wrote:
> On Wed, Aug 07, 2019 at 01:38:16PM -0700, Dave Hansen wrote:
> Extending the interface is still possible even with the current
> proposal, by changing arg2 etc. We also don't seem to be consistent in
> sys_prctl().

We are not consistent because it took a long time to learn this lesson,
but I think this is a best-practice that we follow for new ones.

>> Also, shouldn't this be converted over to an arch_prctl()?
> 
> What do you mean by arch_prctl()? We don't have such thing, apart from
> maybe arch_prctl_spec_ctrl_*(). We achieve the same thing with the
> {SET,GET}_TAGGED_ADDR_CTRL macros. They could be renamed to
> arch_prctl_tagged_addr_{set,get} or something but I don't see much
> point.

Silly me.  We have an x86-specific:

SYSCALL_DEFINE2(arch_prctl, int , option, unsigned long , arg2)

I guess there's no ARM equivalent so you're stuck with the generic one.

> What would be better (for a separate patch series) is to clean up
> sys_prctl() and move the arch-specific options into separate
> arch_prctl() under arch/*/kernel/. But it's not really for this series.

I think it does make sense for truly arch-specific features to stay out
of the arch-generic prctl().  Yes, I know I've personally violated this
in the past. :)

>> What is the scope of these prctl()'s?  Are they thread-scoped or
>> process-scoped?  Can two threads in the same process run with different
>> tagging ABI modes?
> 
> Good point. They are thread-scoped and this should be made clear in the
> doc. Two threads can have different modes.
> 
> The expectation is that this is invoked early during process start (by
> the dynamic loader or libc init) while in single-thread mode and
> subsequent threads will inherit the same mode. However, other uses are
> possible.

If that's the expectation, it would be really nice to codify it.
Basically, you can't enable the feature if another thread is already
been forked off.

> That said, do we have a precedent for changing user ABI from the kernel
> cmd line? 'noexec32', 'vsyscall' I think come close. With the prctl()
> for opt-in, controlling this from the cmd line is not too bad (though my
> preference is still for the sysctl).

There are certainly user-visible things like being able to select
various CPU features.

>>> +When a process has successfully enabled the new ABI by invoking
>>> +prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE), the following
>>> +behaviours are guaranteed:
>>> +
>>> +- Every currently available syscall, except the cases mentioned in section
>>> +  3, can accept any valid tagged pointer. The same rule is applicable to
>>> +  any syscall introduced in the future.
>>> +
>>> +- The syscall behaviour is undefined for non valid tagged pointers.
>>
>> Do you really mean "undefined"?  I mean, a bad pointer is a bad pointer.
>>  Why should it matter if it's a tagged bad pointer or an untagged bad
>> pointer?
> 
> Szabolcs already replied here. We may have tagged pointers that can be
> dereferenced just fine but being passed to the kernel may not be well
> defined (e.g. some driver doing a find_vma() that fails unless it
> explicitly untags the address). It's as undefined as the current
> behaviour (without these patches) guarantees.

It might just be nicer to point out what this features *changes* about
invalid pointer handling, which is nothing. :)  Maybe:

The syscall behaviour for invalid pointers is the same for both
tagged and untagged pointers.

>>> +- prctl(PR_SET_MM, ``*``, ...) other than arg2 PR_SET_MM_MAP and
>>> +  PR_SET_MM_MAP_SIZE.
>>> +
>>> +- prctl(PR_SET_MM, PR_SET_MM_MAP{,_SIZE}, ...) struct prctl_mm_map fields.
>>> +
>>> +Any attempt to use non-zero tagged pointers will lead to undefined
>>> +behaviour.
>>
>> I wonder if you want to generalize this a bit.  I think you're saying
>> that parts of the ABI that modify the *layout* of the address space
>> never accept tagged pointers.
> 
> I guess our difficulty in specifying this may have been caused by
> over-generalising. For example, madvise/mprotect came under the same
> category but there is a use-case for malloc'ed pointers (and tagged) to
> the kernel (e.g. MADV_DONTNEED). If we can restrict the meaning to
> address space *layout* manipulation, we'd have mmap/mremap/munmap,
> brk/sbrk, prctl(PR_SET_MM). Did I miss anything?. Other related syscalls
> like mprotect/madvise preserve the layout while only changing permissions,
> backing store, so the would be allowed to accept tags.

shmat() comes to

Re: [PATCH v7 1/2] arm64: Define Documentation/arm64/tagged-address-abi.rst

2019-08-07 Thread Dave Hansen
On 8/7/19 8:53 AM, Catalin Marinas wrote:
> +- mmap() done by the process itself (or its parent), where either:
> +
> +  - flags have the **MAP_ANONYMOUS** bit set
> +  - the file descriptor refers to a regular file (including those returned
> +by memfd_create()) or **/dev/zero**

What's a "regular file"? ;)

> +- brk() system call done by the process itself (i.e. the heap area between
> +  the initial location of the program break at process creation and its
> +  current location).
> +
> +- any memory mapped by the kernel in the address space of the process
> +  during creation and with the same restrictions as for mmap() above (e.g.
> +  data, bss, stack).
> +
> +The AArch64 Tagged Address ABI is an opt-in feature and an application can
> +control it via **prctl()** as follows:
> +
> +- **PR_SET_TAGGED_ADDR_CTRL**: enable or disable the AArch64 Tagged Address
> +  ABI for the calling process.
> +
> +  The (unsigned int) arg2 argument is a bit mask describing the control mode
> +  used:
> +
> +  - **PR_TAGGED_ADDR_ENABLE**: enable AArch64 Tagged Address ABI. Default
> +status is disabled.
> +
> +  The arguments arg3, arg4, and arg5 are ignored.

For previous prctl()'s, we've found that it's best to require that the
unused arguments be 0.  Without that, apps are free to put garbage
there, which makes extending the prctl to use other arguments impossible
in the future.

Also, shouldn't this be converted over to an arch_prctl()?

> +The prctl(PR_SET_TAGGED_ADDR_CTRL, ...) will return -EINVAL if the
> +AArch64 Tagged Address ABI is not available
> +(CONFIG_ARM64_TAGGED_ADDR_ABI disabled or sysctl abi.tagged_addr=0).
> +
> +The ABI properties set by the mechanism described above are inherited by
> +threads of the same application and fork()'ed children but cleared by
> +execve().

What is the scope of these prctl()'s?  Are they thread-scoped or
process-scoped?  Can two threads in the same process run with different
tagging ABI modes?

> +Opting in (the prctl() option described above only) to or out of the
> +AArch64 Tagged Address ABI can be disabled globally at runtime using the
> +sysctl interface:
> +
> +- **abi.tagged_addr**: a new sysctl interface that can be used to prevent
> +  applications from enabling or disabling the relaxed ABI. The sysctl
> +  supports the following configuration options:
> +
> +  - **0**: disable the prctl(PR_SET_TAGGED_ADDR_CTRL) option to
> +enable/disable the AArch64 Tagged Address ABI globally
> +
> +  - **1** (Default): enable the prctl(PR_SET_TAGGED_ADDR_CTRL) option to
> +enable/disable the AArch64 Tagged Address ABI globally
> +
> +  Note that this sysctl does not affect the status of the AArch64 Tagged
> +  Address ABI of the running processes.

Shouldn't the name be "abi.tagged_addr_control" or something?  It
actually has *zero* direct effect on tagged addresses in the ABI.

What's the reason for allowing it to be toggled at runtime like this?
Wouldn't it make more sense to just have it be a boot option so you
*know* what the state of individual processes is?

> +When a process has successfully enabled the new ABI by invoking
> +prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE), the following
> +behaviours are guaranteed:
> +
> +- Every currently available syscall, except the cases mentioned in section
> +  3, can accept any valid tagged pointer. The same rule is applicable to
> +  any syscall introduced in the future.
> +
> +- The syscall behaviour is undefined for non valid tagged pointers.

Do you really mean "undefined"?  I mean, a bad pointer is a bad pointer.
 Why should it matter if it's a tagged bad pointer or an untagged bad
pointer?

...
> +A definition of the meaning of tagged pointers on AArch64 can be found in:
> +Documentation/arm64/tagged-pointers.txt.
> +
> +3. AArch64 Tagged Address ABI Exceptions
> +-
> +
> +The behaviour described in section 2, with particular reference to the
> +acceptance by the syscalls of any valid tagged pointer, is not applicable
> +to the following cases:

This is saying things in a pretty roundabout manner.  Can't it just say:
 "The following cases do not accept tagged pointers:"

> +- mmap() addr parameter.
> +
> +- mremap() new_address parameter.

Is munmap() missing?  Or was there a reason for leaving it out?

> +- prctl(PR_SET_MM, ``*``, ...) other than arg2 PR_SET_MM_MAP and
> +  PR_SET_MM_MAP_SIZE.
> +
> +- prctl(PR_SET_MM, PR_SET_MM_MAP{,_SIZE}, ...) struct prctl_mm_map fields.
> +
> +Any attempt to use non-zero tagged pointers will lead to undefined
> +behaviour.

I wonder if you want to generalize this a bit.  I think you're saying
that parts of the ABI that modify the *layout* of the address space
never accept tagged pointers.

> +4. Example of correct usage
> +---
> +.. code-block:: c
> +
> +   void main(void)
> +   {
> +   static int tbi_enabled = 0;
> +   unsigned long tag = 0;
> +
> +   char *ptr = mmap(NUL

Re: [PATCH v6 1/2] arm64: Define Documentation/arm64/tagged-address-abi.rst

2019-07-31 Thread Dave Hansen
On 7/25/19 6:50 AM, Vincenzo Frascino wrote:
> With the relaxed ABI proposed through this document, it is now possible
> to pass tagged pointers to the syscalls, when these pointers are in
> memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap().

I don't see a lot of description of why this restriction is necessary.
What's the problem with supporting MAP_SHARED?


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-14 Thread Dave Hansen
On 6/14/19 2:34 PM, Yu-cheng Yu wrote:
> On Fri, 2019-06-14 at 13:57 -0700, Dave Hansen wrote:
>>> I have a related question:
>>>
>>> Do we allow the application to read the bitmap, or any fault from the
>>> application on bitmap pages?
>>
>> We have to allow apps to read it.  Otherwise they can't execute
>> instructions.
> 
> What I meant was, if an app executes some legacy code that results in bitmap
> lookup, but the bitmap page is not yet populated, and if we then populate that
> page with all-zero, a #CP should follow.  So do we even populate that zero 
> page
> at all?
> 
> I think we should; a #CP is more obvious to the user at least.

Please make an effort to un-Intel-ificate your messages as much as
possible.  I'd really prefer that folks say "missing end branch fault"
rather than #CP.  I had to Google "#CP".

I *think* you are saying that:  The *only* lookups to this bitmap are on
"missing end branch" conditions.  Normal, proper-functioning code
execution that has ENDBR instructions in it will never even look at the
bitmap.  The only case when we reference the bitmap locations is when
the processor is about do do a "missing end branch fault" so that it can
be suppressed.  Any population with the zero page would be done when
code had already encountered a "missing end branch" condition, and
populating with a zero-filled page will guarantee that a "missing end
branch fault" will result.  You're arguing that we should just figure
this out at fault time and not ever reach the "missing end branch fault"
at all.

Is that right?

If so, that's an architecture subtlety that I missed until now and which
went entirely unmentioned in the changelog and discussion up to this
point.  Let's make sure that nobody else has to walk that path by
improving our changelog, please.

In any case, I don't think this is worth special-casing our zero-fill
code, FWIW.  It's not performance critical and not worth the complexity.
 If apps want to handle the signals and abuse this to fill space up with
boring page table contents, they're welcome to.  There are much easier
ways to consume a lot of memory.

>> We don't have to allow them to (popuating) fault on it.  But, if we
>> don't, we need some kind of kernel interface to avoid the faults.
> 
> The plan is:
> 
> * Move STACK_TOP (and vdso) down to give space to the bitmap.

Even for apps with 57-bit address spaces?

> * Reserve the bitmap space from (mm->start_stack + PAGE_SIZE) to cover a code
> size of TASK_SIZE_LOW, which is (TASK_SIZE_LOW / PAGE_SIZE / 8).

The bitmap size is determined by CR4.LA57, not the app.  If you place
the bitmap here, won't references to it for high addresses go into the
high address space?

Specifically, on a CR4.LA57=0 system, we have 48 bits of address space,
so 128TB for apps.  You are proposing sticking the bitmap above the
stack which is near the top of that 128TB address space.  But on a
5-level paging system with CR4.LA57=1, there could be valid data at
129GB.  Is there something keeping that data from being mistaken for
being part of the bitmap?

Also, if you're limiting it to TASK_SIZE_LOW, please don't forget that
this is yet another thing that probably won't work with the vsyscall
page.  Please make sure you consider it and mention it in your next post.

> * Mmap the space only when the app issues the first mark-legacy prctl.  This
> avoids the core-dump issue for most apps and the accounting problem that
> MAP_NORESERVE probably won't solve completely.

What is this accounting problem?


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-14 Thread Dave Hansen
On 6/14/19 10:13 AM, Yu-cheng Yu wrote:
> On Fri, 2019-06-14 at 09:13 -0700, Dave Hansen wrote:
>> On 6/14/19 8:25 AM, Yu-cheng Yu wrote:
>>> The bitmap is very big.
>>
>> Really?  It's actually, what, 8*4096=32k, so 1/32,768th of the size of
>> the libraries legacy libraries you load?  Do our crash dumps really not
>> know how to represent or deal with sparse mappings?
> 
> Ok, even the core dump is not physically big, its size still looks odd, right?

Hell if I know.

Could you please go try this in practice so that we're designing this
thing fixing real actual problems instead of phantoms that we're
anticipating?

> Could this also affect how much time for GDB to load it.

I don't know.  Can you go find out for sure, please?

> I have a related question:
> 
> Do we allow the application to read the bitmap, or any fault from the
> application on bitmap pages?

We have to allow apps to read it.  Otherwise they can't execute
instructions.

We don't have to allow them to (popuating) fault on it.  But, if we
don't, we need some kind of kernel interface to avoid the faults.


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-14 Thread Dave Hansen
On 6/14/19 8:25 AM, Yu-cheng Yu wrote:
> On Mon, 2019-06-10 at 15:59 -0700, Dave Hansen wrote:
>> On 6/10/19 3:40 PM, Yu-cheng Yu wrote:
>>> Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and
>>> VM_DONTDUMP.  The bitmap will cover only 48-bit address space.
>>
>> Could you make sure to discuss the downsides of only doing a 48-bit
>> address space?
> 
> The downside is that we cannot load legacy lib's above 48-bit address space, 
> but
> currently ld-linux does not do that.  Should ld-linux do that in the future,
> dlopen() fails.  Considering CRIU migration, we probably need to do this 
> anyway?

Again, I was thinking about JITs.  Please remember that not all code in
the system is from files on the disk.  Please.  We need to be really,
really sure that we don't addle this implementation by being narrow
minded about this.

Please don't forget about JITs.

>> What are the reasons behind and implications of VM_DONTDUMP?
> 
> The bitmap is very big.

Really?  It's actually, what, 8*4096=32k, so 1/32,768th of the size of
the libraries legacy libraries you load?  Do our crash dumps really not
know how to represent or deal with sparse mappings?

> In GDB, it should be easy to tell why a control-protection fault occurred
> without the bitmap.

How about why one didn't happen?


Re: [PATCH v7 25/27] mm/mmap: Add Shadow stack pages to memory accounting

2019-06-11 Thread Dave Hansen
On 6/6/19 1:06 PM, Yu-cheng Yu wrote:
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1703,6 +1703,9 @@ static inline int accountable_mapping(struct file 
> *file, vm_flags_t vm_flags)
>   if (file && is_file_hugepages(file))
>   return 0;
>  
> + if (arch_copy_pte_mapping(vm_flags))
> + return 1;
> +
>   return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
>  }
>  
> @@ -3319,6 +3322,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t 
> flags, long npages)
>   mm->stack_vm += npages;
>   else if (is_data_mapping(flags))
>   mm->data_vm += npages;
> + else if (arch_copy_pte_mapping(flags))
> + mm->data_vm += npages;
>  }

This classifies shadow stack as data instead of stack.  That seems a wee
bit counterintuitive.  Why did you make this choice?


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-10 Thread Dave Hansen
On 6/10/19 4:54 PM, Andy Lutomirski wrote:
> Another benefit of kernel management: we could plausibly auto-clear
> the bits corresponding to munmapped regions. Is this worth it?

I did it for MPX.  I think I even went to the trouble of zapping the
whole pages that got unused.

But, MPX tables took 80% of the address space, worst-case.  This takes
0.003% :)  The only case it would really matter would be a task was
long-running, used legacy executables/JITs, and was mapping/unmapping
text all over the address space.  That seems rather unlikely.


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-10 Thread Dave Hansen
On 6/10/19 4:20 PM, H.J. Lu wrote:
>>> Perhaps we still let the app fill the bitmap?
>> I think I'd want to see some performance data on it first.
> Updating legacy bitmap in user space from kernel requires
> 
> long q;
> 
> get_user(q, ...);
> q |= mask;
> put_user(q, ...);
> 
> instead of
> 
> *p |= mask;
> 
> get_user + put_user was quite slow when we tried before.

Numbers, please.

There are *lots* of ways to speed something like that up if you have
actual issues with it.  For instance, you can skip the get_user() for
whole bytes.  You can write bits with 0's for unallocated address space.
 You can do user_access_begin/end() to avoid bunches of STAC/CLACs...

The list goes on and on. :)


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-10 Thread Dave Hansen
On 6/10/19 3:40 PM, Yu-cheng Yu wrote:
> Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and
> VM_DONTDUMP.  The bitmap will cover only 48-bit address space.

Could you make sure to discuss the downsides of only doing a 48-bit
address space?

What are the reasons behind and implications of VM_DONTDUMP?

> We then create PR_MARK_CODE_AS_LEGACY.  The kernel will set the bitmap, but it
> is going to be slow.

Slow compared to what?  We're effectively adding one (quick) system call
to a path that, today, has at *least* half a dozen syscalls and probably
a bunch of page faults.  Heck, we can probably avoid the actual page
fault to populate the bitmap if we're careful.  That alone would put a
syscall on equal footing with any other approach.  If the bit setting
crossed a page boundary it would probably win.

> Perhaps we still let the app fill the bitmap?

I think I'd want to see some performance data on it first.


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-10 Thread Dave Hansen
On 6/10/19 1:58 PM, Yu-cheng Yu wrote:
>>> On each memory request, the kernel then must consider a percentage of
>>> allocated space in its calculation, and on systems with less memory
>>> this quickly becomes a problem.
>> I'm not sure what you're referring to here?  Are you referring to our
>> overcommit limits?
> Yes.

My assumption has always been that these large, potentially sparse
hardware tables *must* be mmap()'d with MAP_NORESERVE specified.  That
should keep them from being problematic with respect to overcommit.



Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-10 Thread Dave Hansen
On 6/10/19 1:27 PM, Yu-cheng Yu wrote:
>>> If the loader cannot allocate a big bitmap to cover all 5-level
>>> address space (the bitmap will be large), it can put all legacy lib's
>>> at lower address.  We cannot do these easily in the kernel.
>> This is actually an argument to do it in the kernel.  The kernel can
>> always allocate the virtual space however it wants, no matter how large.
>>  If we hide the bitmap behind a kernel API then we can put it at high
>> 5-level user addresses because we also don't have to worry about the
>> high bits confusing userspace.
> We actually tried this.  The kernel needs to reserve the bitmap space in the
> beginning for every CET-enabled app, regardless of actual needs. 

I don't think this is a problem.  In fact, I think reserving the space
is actually the only sane behavior.  If you don't reserve it, you
fundamentally limit where future legacy instructions can go.

One idea is that we always size the bitmap for the 48-bit addressing
systems.  Legacy code probably doesn't _need_ to go in the new address
space, and if we do this we don't have to worry about the gigantic
57-bit address space bitmap.

> On each memory request, the kernel then must consider a percentage of
> allocated space in its calculation, and on systems with less memory
> this quickly becomes a problem.

I'm not sure what you're referring to here?  Are you referring to our
overcommit limits?


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-10 Thread Dave Hansen
On 6/10/19 12:38 PM, Yu-cheng Yu wrote:
>>> When an application starts, its highest stack address is determined.
>>> It uses that as the maximum the bitmap needs to cover.
>> Huh, I didn't think we ran code from the stack. ;)
>>
>> Especially given the way that we implemented the new 5-level-paging
>> address space, I don't think that expecting code to be below the stack
>> is a good universal expectation.
> Yes, you make a good point.  However, allowing the application manage the 
> bitmap
> is the most efficient and flexible.  If the loader finds a legacy lib is 
> beyond
> the bitmap can cover, it can deal with the problem by moving the lib to a 
> lower
> address; or re-allocate the bitmap.

How could the loader reallocate the bitmap and coordinate with other
users of the bitmap?

> If the loader cannot allocate a big bitmap to cover all 5-level
> address space (the bitmap will be large), it can put all legacy lib's
> at lower address.  We cannot do these easily in the kernel.

This is actually an argument to do it in the kernel.  The kernel can
always allocate the virtual space however it wants, no matter how large.
 If we hide the bitmap behind a kernel API then we can put it at high
5-level user addresses because we also don't have to worry about the
high bits confusing userspace.



Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-10 Thread Dave Hansen
On 6/10/19 8:22 AM, Yu-cheng Yu wrote:
>> How does glibc know the linear address space size?  We don’t want LA64 to
>> break old binaries because the address calculation changed.
> When an application starts, its highest stack address is determined.
> It uses that as the maximum the bitmap needs to cover.

Huh, I didn't think we ran code from the stack. ;)

Especially given the way that we implemented the new 5-level-paging
address space, I don't think that expecting code to be below the stack
is a good universal expectation.


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-10 Thread Dave Hansen
On 6/10/19 9:05 AM, Yu-cheng Yu wrote:
> On Fri, 2019-06-07 at 14:09 -0700, Dave Hansen wrote:
>> On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
>>>> Huh, how does glibc know about all possible past and future legacy code
>>>> in the application?
>>> When dlopen() gets a legacy binary and the policy allows that, it will
>>> manage
>>> the bitmap:
>>>
>>>   If a bitmap has not been created, create one.
>>>   Set bits for the legacy code being loaded.
>> I was thinking about code that doesn't go through GLIBC like JITs.
> If JIT manages the bitmap, it knows where it is.
> It can always read the bitmap again, right?

Let's just be clear:

The design proposed here is that all code mappers (anybody wanting to
get legacy non-CET code into the address space):

1. Know about CET
2. Know where the bitmap is, and identify the part that needs to be
   changed
3. Be able to mprotect() the bitmap to be writable (undoing glibc's
   PROT_READ)
4. Set the bits in the bitmap for the legacy code
5. mprotect() the bitmap back to PROT_READ

Do the non-glibc code mappers have glibc interfaces for this?
Otherwise, how could a bunch of JITs in a big multi-threaded application
possibly coordinate the mprotect()s?  Won't they race with each other?


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-07 Thread Dave Hansen
On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
>> Huh, how does glibc know about all possible past and future legacy code
>> in the application?
> When dlopen() gets a legacy binary and the policy allows that, it will manage
> the bitmap:
> 
>   If a bitmap has not been created, create one.
>   Set bits for the legacy code being loaded.

I was thinking about code that doesn't go through GLIBC like JITs.


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-07 Thread Dave Hansen
On 6/7/19 1:40 PM, Andy Lutomirski wrote:
>>> Hmm.  Can we be creative and skip populating it with zeros?  The
>>> CPU
>> should only ever touch a page if we miss an ENDBR on it, so, in
>> normal operation, we don’t need anything to be there.  We could try
>> to prevent anyone from *reading* it outside of ENDBR tracking if we
>> want to avoid people accidentally wasting lots of memory by forcing
>> it to be fully populated when the read it.
>> 
>> Won't reads on a big, contiguous private mapping get the huge zero
>> page anyway?
> 
> The zero pages may be free, but the page tables could be decently
large.  Does the core mm code use huge, immense, etc huge zero pages?
Or can it synthesize them by reusing page table pages that map zeros?

IIRC, we only ever fill single PMDs, even though we could gang a pmd
page up and do it for 1GB areas too.

I guess the page table consumption could really suck if we had code all
over the 57-bit address space and that code moved around and the process
ran for a long long time.  Pathologically, we need a ulong/pmd_t for
each 2MB of address space which is 8*2^56-30=512GB per process.  Yikes.
 Right now, we'd at least detect the memory consumption and OOM-kill the
process(es) eventually.  But, that's not really _this_ patch's problem.
 It's a general problem, and doesn't even require the zero page to be
mapped all over.

Longer-term, I'd much rather see us add some page table reclaim
mechanism that new how to go after things like excessive page tables  in
MAP_NORESERVE areas.


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-07 Thread Dave Hansen
On 6/7/19 12:49 PM, Yu-cheng Yu wrote:
>>
>> This also gives us an excellent opportunity to make it read-only as seen from
>> userspace to prevent exploits from just poking it full of ones before
>> redirecting execution.
> GLIBC sets bits only for legacy code, and then makes the bitmap read-only.  
> That
> avoids most issues:
> 
>   To populate bitmap pages, mprotect() is required.
>   Reading zero bitmap pages would not waste more physical memory, right?

Huh, how does glibc know about all possible past and future legacy code
in the application?


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-07 Thread Dave Hansen
On 6/6/19 1:09 PM, Yu-cheng Yu wrote:
> + modify_fpu_regs_begin();
> + rdmsrl(MSR_IA32_U_CET, r);
> + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> + wrmsrl(MSR_IA32_U_CET, r);
> + modify_fpu_regs_end();

Isn't there a bunch of other stuff in this MSR?  It seems like the
bitmap value would allow overwriting lots of bits in the MSR that have
nothing to do with the bitmap... in a prctl() that's supposed to only be
dealing with the bitmap.



Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-07 Thread Dave Hansen
On 6/7/19 11:29 AM, Andy Lutomirski wrote:
...
>> I think this new MSR probably needs to get included in oops output when
>> CET is enabled.
> 
> This shouldn’t be able to OOPS because it only happens at CPL 3,
> right?  We should put it into core dumps, though.

Good point.

Yu-cheng, can you just confirm that the bitmap can't be referenced in
ring-0, no matter what?  We should also make sure that no funny business
happens if we put an address in the bitmap that faults, or is
non-canonical.  Do we have any self-tests for that?

Let's say userspace gets a fault on this.  Do they have the
introspection capability to figure out why they faulted, say in their
signal handler?

>> Why don't we require that a VMA be in place for the entire bitmap?
>> Don't we need a "get" prctl function too in case something like a JIT is
>> running and needs to find the location of this bitmap to set bits itself?
>>
>> Or, do we just go whole-hog and have the kernel manage the bitmap
>> itself. Our interface here could be:
>>
>>prctl(PR_MARK_CODE_AS_LEGACY, start, size);
>>
>> and then have the kernel allocate and set the bitmap for those code
>> locations.
> 
> Given that the format depends on the VA size, this might be a good
> idea.

Yeah, making userspace know how large the address space is or could be
is rather nasty, especially if we ever get any fancy CPU features that
eat up address bits (a la ARM top-byte-ignore or SPARC ADI).

> Hmm.  Can we be creative and skip populating it with zeros?  The CPU
should only ever touch a page if we miss an ENDBR on it, so, in normal
operation, we don’t need anything to be there.  We could try to prevent
anyone from *reading* it outside of ENDBR tracking if we want to avoid
people accidentally wasting lots of memory by forcing it to be fully
populated when the read it.

Won't reads on a big, contiguous private mapping get the huge zero page
anyway?

> The one downside is this forces it to be per-mm, but that seems like
> a generally reasonable model anyway.

Yeah, practically, you could only make it shared if you shared the
layout of all code in the address space.  I'm sure the big database(s)
do that cross-process, but I bet nobody else does.  User ASLR
practically guarantees that nobody can do this.

> This also gives us an excellent opportunity to make it read-only as
> seen from userspace to prevent exploits from just poking it full of
> ones before redirecting execution.

That would be fun.



Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-07 Thread Dave Hansen
On 6/7/19 10:43 AM, Peter Zijlstra wrote:
> I've no idea what the kernel should do; since you failed to answer the
> question what happens when you point this to garbage.
> 
> Does it then fault or what?

Yeah, I think you'll fault with a rather mysterious CR2 value since
you'll go look at the instruction that faulted and not see any
references to the CR2 value.

I think this new MSR probably needs to get included in oops output when
CET is enabled.

Why don't we require that a VMA be in place for the entire bitmap?
Don't we need a "get" prctl function too in case something like a JIT is
running and needs to find the location of this bitmap to set bits itself?

Or, do we just go whole-hog and have the kernel manage the bitmap
itself. Our interface here could be:

prctl(PR_MARK_CODE_AS_LEGACY, start, size);

and then have the kernel allocate and set the bitmap for those code
locations.


Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function

2019-06-07 Thread Dave Hansen
On 6/7/19 9:35 AM, Andy Lutomirski wrote:
> One might reasonably wonder why this state is privileged in the first
> place and, given that, why we’re allowing it to be written like
> this.

I think it's generally a good architectural practice to make things like
this privileged.  They're infrequent so can survive the cost of a trip
in/out of the kernel and are a great choke point to make sure the OS is
involved.  I wish we had the same for MPX or pkeys per-task "setup".


Re: [PATCH v7 04/27] x86/fpu/xstate: Introduce XSAVES system states

2019-06-06 Thread Dave Hansen



On 6/6/19 3:04 PM, Andy Lutomirski wrote:
>> But, that seems broken.  If we have supervisor state, we can't 
>> always defer the load until return to userspace, so we'll never?? 
>> have TIF_NEED_FPU_LOAD.  That would certainly be true for 
>> cet_kernel_state.
> 
> Ugh. I was sort of imagining that we would treat supervisor state
 completely separately from user state.  But can you maybe give
examples of exactly what you mean?
> 
>> It seems like we actually need three classes of XSAVE states: 1. 
>> User state
> 
> This is FPU, XMM, etc, right?

Yep.

>> 2. Supervisor state that affects user mode
> 
> User CET?

Yep.

>> 3. Supervisor state that affects kernel mode
> 
> Like supervisor CET?  If we start doing supervisor shadow stack, the 
> context switches will be real fun.  We may need to handle this in 
> asm.

Yeah, that's what I was thinking.

I have the feeling Yu-cheng's patches don't comprehend this since
Sebastian's patches went in after he started working on shadow stacks.

> Where does PKRU fit in?  Maybe we can treat it as #3?

I thought Sebastian added specific PKRU handling to make it always
eager.  It's actually user state that affect kernel mode. :)


Re: [PATCH v7 04/27] x86/fpu/xstate: Introduce XSAVES system states

2019-06-06 Thread Dave Hansen
> +/*
> + * Helpers for changing XSAVES system states.
> + */
> +static inline void modify_fpu_regs_begin(void)
> +{
> + fpregs_lock();
> + if (test_thread_flag(TIF_NEED_FPU_LOAD))
> + __fpregs_load_activate();
> +}
> +
> +static inline void modify_fpu_regs_end(void)
> +{
> + fpregs_unlock();
> +}

These are massively under-commented and under-changelogged.  This looks
like it's intended to ensure that we have supervisor FPU state for this
task loaded before we go and run the MSRs that might be modifying it.

But, that seems broken.  If we have supervisor state, we can't always
defer the load until return to userspace, so we'll never?? have
TIF_NEED_FPU_LOAD.  That would certainly be true for cet_kernel_state.

It seems like we actually need three classes of XSAVE states:
1. User state
2. Supervisor state that affects user mode
3. Supervisor state that affects kernel mode

We can delay the load of 1 and 2, but not 3.

But I don't see any infrastructure for this.


Re: [PATCH v10 07/12] fs, arm64: untag user pointers in fs/userfaultfd.c

2019-03-01 Thread Dave Hansen
On 3/1/19 8:59 AM, Catalin Marinas wrote:
>>> So, we have to patch all these sites before the tagged values get to the
>>> point of hitting the vma lookup functions.  Dumb question: Why don't we
>>> just patch the vma lookup functions themselves instead of all of these
>>> callers?
>> That might be a working approach as well. We'll still need to fix up
>> places where the vma fields are accessed directly. Catalin, what do
>> you think?
> Most callers of find_vma*() always follow it by a check of
> vma->vma_start against some tagged address ('end' in the
> userfaultfd_(un)register()) case. So it's not sufficient to untag it in
> find_vma().

If that's truly the common case, sounds like we should have a find_vma()
that does the vma_end checking as well.  Then at least the common case
would not have to worry about tagging.


Re: [PATCH 04/14] x86 topology: Add CPUID.1F multi-die/package support

2019-02-26 Thread Dave Hansen
On 2/25/19 10:20 PM, Len Brown wrote:
> -/* leaf 0xb sub-leaf types */
> +/* extended topology sub-leaf types */
>  #define INVALID_TYPE 0
>  #define SMT_TYPE 1
>  #define CORE_TYPE2
> +#define DIE_TYPE 5

Looking in the SDM, Vol. 3A "8.9.1 Hierarchical Mapping of Shared
Resources", there are a _couple_ of new levels: Die, Tile and Module.
But, this patch only covers Dies.

Was there a reason for that?

I wonder if we'll end up with different (better) infrastructure if we do
these all at once instead of hacking them in one at a time.


Re: [PATCH v10 00/12] arm64: untag user pointers passed to the kernel

2019-02-26 Thread Dave Hansen
On 2/26/19 9:18 AM, Andrey Konovalov wrote:
>> This seems like something
>> where we would ideally add an __tagged annotation (or something) to the
>> source tree and then have sparse rules that can look for missed untags.
> This has been suggested before, search for __untagged here [1].
> However there are many places in the kernel where a __user pointer is
> casted into unsigned long and passed further. I'm not sure if it's
> possible apply a __tagged/__untagged kind of attribute to non-pointer
> types, is it?
> 
> [1] https://patchwork.kernel.org/patch/10581535/

I believe we have sparse checking __GFP_* flags.  We also have a gfp_t
for them and I'm unsure whether the sparse support is tied to _that_ or
whether it's just by tagging the type itself as being part of a discrete
address space.


Re: [PATCH v10 04/12] mm, arm64: untag user pointers passed to memory syscalls

2019-02-22 Thread Dave Hansen
On 2/22/19 4:53 AM, Andrey Konovalov wrote:
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -578,6 +578,7 @@ static int do_mprotect_pkey(unsigned long start, size_t 
> len,
>  SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
>   unsigned long, prot)
>  {
> + start = untagged_addr(start);
>   return do_mprotect_pkey(start, len, prot, -1);
>  }
>  
> @@ -586,6 +587,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, 
> len,
>  SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
>   unsigned long, prot, int, pkey)
>  {
> + start = untagged_addr(start);
>   return do_mprotect_pkey(start, len, prot, pkey);
>  }

This seems to have taken the approach of going as close as possible to
the syscall boundary and untagging the pointer there.  I guess that's
OK, but it does lead to more churn than necessary.  For instance, why
not just do the untagging in do_mprotect_pkey()?

I think that's an overall design question.  I kinda asked the same thing
about patching call sites vs. VMA lookup functions.


Re: [PATCH v10 07/12] fs, arm64: untag user pointers in fs/userfaultfd.c

2019-02-22 Thread Dave Hansen
On 2/22/19 4:53 AM, Andrey Konovalov wrote:
> userfaultfd_register() and userfaultfd_unregister() use provided user
> pointers for vma lookups, which can only by done with untagged pointers.

So, we have to patch all these sites before the tagged values get to the
point of hitting the vma lookup functions.  Dumb question: Why don't we
just patch the vma lookup functions themselves instead of all of these
callers?


Re: [PATCH v10 06/12] fs, arm64: untag user pointers in copy_mount_options

2019-02-22 Thread Dave Hansen
On 2/22/19 4:53 AM, Andrey Konovalov wrote:
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2730,7 +2730,7 @@ void *copy_mount_options(const void __user * data)
>* the remainder of the page.
>*/
>   /* copy_from_user cannot cross TASK_SIZE ! */
> - size = TASK_SIZE - (unsigned long)data;
> + size = TASK_SIZE - (unsigned long)untagged_addr(data);
>   if (size > PAGE_SIZE)
>   size = PAGE_SIZE;

I would have thought that copy_from_user() *is* entirely capable of
detecting and returning an error in the case that its arguments cross
TASK_SIZE.  It will fail and return an error, but that's what it's
supposed to do.

I'd question why this code needs to be doing its own checking in the
first place.  Is there something subtle going on?


Re: [PATCH v10 00/12] arm64: untag user pointers passed to the kernel

2019-02-22 Thread Dave Hansen
On 2/22/19 4:53 AM, Andrey Konovalov wrote:
> The following testing approaches has been taken to find potential issues
> with user pointer untagging:
> 
> 1. Static testing (with sparse [3] and separately with a custom static
>analyzer based on Clang) to track casts of __user pointers to integer
>types to find places where untagging needs to be done.

First of all, it's really cool that you took this approach.  Sounds like
there was a lot of systematic work to fix up the sites in the existing
codebase.

But, isn't this a _bit_ fragile going forward?  Folks can't just "make
sparse" to find issues with missing untags.  This seems like something
where we would ideally add an __tagged annotation (or something) to the
source tree and then have sparse rules that can look for missed untags.


Re: [PATCH] drop_caches: Allow unmapping pages

2019-01-07 Thread Dave Hansen
On 1/7/19 6:15 AM, Matthew Wilcox wrote:
> You're going to get data corruption doing this.  try_to_unmap_one()
> does:
> 
>  /* Move the dirty bit to the page. Now the pte is gone. */
>  if (pte_dirty(pteval))
>  set_page_dirty(page);
> 
> so PageDirty() can be false above, but made true by calling
> try_to_unmap().

I don't think that PageDirty() check is _required_ for correctness.  You
can always safely try_to_unmap() no matter the state of the PTE.  We
can't lock out the hardware from setting the Dirty bit, ever.

It's also just fine to unmap PageDirty() pages, as long as when the PTE
is created, we move the dirty bit from the PTE into the 'struct page'
(which try_to_unmap() does, as you noticed).

> I also think the way you've done this is expedient at the cost of
> efficiency and layering violations.  I think you should first tear
> down the mappings of userspace processes (which will reclaim a lot
> of pages allocated to page tables), then you won't need to touch the
> invalidate_inode_pages paths at all.

By "tear down the mappings", do you mean something analogous to munmap()
where the VMA goes away?  Or madvise(MADV_DONTNEED) where the PTE is
destroyed but the VMA remains?

Last time I checked, we only did free_pgtables() when tearing down VMAs,
but not for pure unmappings like reclaim or MADV_DONTNEED.  I've thought
it might be fun to make a shrinker that scanned page tables looking for
zero'd pages, but I've never run into a system where empty page table
pages were actually causing a noticeable problem.


Re: [PATCH v11 00/13] Intel SGX1 support

2018-12-11 Thread Dave Hansen
On 12/10/18 3:12 PM, Josh Triplett wrote:
>> Or maybe even python/shell scripts? It looked to me like virtual
>> memory will be "interesting" for enclaves.
> Memory management doesn't seem that hard to deal with.

The problems are:

1. SGX enclave memory (EPC) is statically allocated at boot and can't
   grow or shrink
2. EPC is much smaller than regular RAM
3. The core VM has no comprehension of EPC use, thus can not help
   with its algorithms, like the LRU
4. The SGX driver implements its own VM which is substantially simpler
   than the core VM, but less feature-rich, fast, or scalable


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-09 Thread Dave Hansen
On 11/9/18 9:20 AM, Dave Hansen wrote:
> On 11/9/18 9:17 AM, Matthew Wilcox wrote:
>>> But, later versions of the hardware have instructions that don't have
>>> static offsets for the state components (when the XSAVES/XSAVEC
>>> instructions are used).  So, for those, the structure embedding isn't
>>> used at *all* since some state might not be present.
>> But *when present*, this structure is always aligned on an 8-byte
>> boundary, right?

Practically, though, I think it ends up always being aligned on an
8-byte boundary.


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-09 Thread Dave Hansen
On 11/9/18 9:17 AM, Matthew Wilcox wrote:
>> But, later versions of the hardware have instructions that don't have
>> static offsets for the state components (when the XSAVES/XSAVEC
>> instructions are used).  So, for those, the structure embedding isn't
>> used at *all* since some state might not be present.
> But *when present*, this structure is always aligned on an 8-byte
> boundary, right?

There's no guarantee of that.

There is an "aligned" attribute for each XSAVE state component, but I do
not believe it is set for anything yet.


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-09 Thread Dave Hansen
On 11/8/18 4:32 PM, Matthew Wilcox wrote:
>> Now, looking at Yu-cheng's specific example, it doesn't matter.  We've
>> got 64-bit types and natural 64-bit alignment.  Without __packed, we
>> need to look out for natural alignment screwing us up.  With __packed,
>> it just does what it *looks* like it does.
> The question is whether Yu-cheng's struct is ever embedded in another
> struct.  And if so, what does the hardware do?

It's not really.

+struct cet_user_state {
+   u64 u_cet;  /* user control flow settings */
+   u64 user_ssp;   /* user shadow stack pointer */
+} __packed;

This ends up embedded in 'struct fpu'.  The hardware tells us what the
sum of all the sizes of all the state components are, and also tells us
the offsets inside the larger buffer.

We double-check that the structure sizes exactly match the sizes that
the hardware tells us that the buffer pieces are via XCHECK_SZ().

But, later versions of the hardware have instructions that don't have
static offsets for the state components (when the XSAVES/XSAVEC
instructions are used).  So, for those, the structure embedding isn't
used at *all* since some state might not be present.



Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Dave Hansen
On 11/8/18 2:00 PM, Matthew Wilcox wrote:
> struct a {
>   char c;
>   struct b b;
> };
> 
> we want struct b to start at offset 8, but with __packed, it will start
> at offset 1.

You're talking about how we want the struct laid out in memory if we
have control over the layout.  I'm talking about what happens if
something *else* tells us the layout, like a hardware specification
which is what is in play with the XSAVE instruction dictated layout
that's in question here.

What I'm concerned about is a structure like this:

struct foo {
u32 i1;
u64 i2;
};

If we leave that to natural alignment, we end up with a 16-byte
structure laid out like this:

0-3 i1
3-8 alignment gap
8-15i2

Which isn't what we want.  We want a 12-byte structure, laid out like this:

0-3 i1
4-11i2

Which we get with:


struct foo {
u32 i1;
u64 i2;
} __packed;

Now, looking at Yu-cheng's specific example, it doesn't matter.  We've
got 64-bit types and natural 64-bit alignment.  Without __packed, we
need to look out for natural alignment screwing us up.  With __packed,
it just does what it *looks* like it does.


Re: [PATCH v5 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

2018-11-08 Thread Dave Hansen
On 11/8/18 1:22 PM, Andy Lutomirski wrote:
>> +struct cet_kernel_state {
>> +   u64 kernel_ssp; /* kernel shadow stack */
>> +   u64 pl1_ssp;/* ring-1 shadow stack */
>> +   u64 pl2_ssp;/* ring-2 shadow stack */
>> +} __packed;
>> +
> Why are these __packed?  It seems like it'll generate bad code for no
> obvious purpose.

It's a hardware-defined in-memory structure.  Granted, we'd need a
really wonky compiler to make that anything *other* than a nicely-packed
24-byte structure, but the __packed makes it explicit.

It is probably a really useful long-term thing to stop using __packed
and start using "__hw_defined" or something that #defines down to __packed.


Re: [PATCH v15 23/23] x86/sgx: Driver documentation

2018-11-07 Thread Dave Hansen
On 11/7/18 8:30 AM, Jarkko Sakkinen wrote:
>> Does this code run when I type "make kselftest"?  If not, I think we
>> should rectify that.
> No, it doesn't. It is just my backup for the non-SDK user space code
> that I've made that I will use to fork my user space SGX projects in
> the future.
> 
> I can work-out a selftest (and provide a new patch in the series) but
> I'm still wondering what the enclave should do. I would suggest that
> we start with an enclave that does just EEXIT and nothing else.

Yeah, that's a start.  But, a good selftest would include things like
faults and error conditions.


Re: [PATCH v5 21/27] x86/cet/shstk: Introduce WRUSS instruction

2018-11-06 Thread Dave Hansen
On 10/11/18 8:15 AM, Yu-cheng Yu wrote:
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1305,6 +1305,15 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
>   error_code |= X86_PF_USER;
>   flags |= FAULT_FLAG_USER;
>   } else {
> + /*
> +  * WRUSS is a kernel instruction and but writes
> +  * to user shadow stack.  When a fault occurs,
> +  * both X86_PF_USER and X86_PF_SHSTK are set.
> +  * Clear X86_PF_USER here.
> +  */
> + if ((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> + (X86_PF_USER | X86_PF_SHSTK))
> + error_code &= ~X86_PF_USER;
This hunk of code basically points out that the architecture of WRUSS is
broken for Linux.  The setting of X86_PF_USER for a ring-0 instruction
really is a mis-feature of the architecture for us and we *undo* it in
software which is unfortunate.  Wish I would have caught this earlier.

Andy, note that this is another case where hw_error_code and
sw_error_code will diverge, unfortunately.

Anyway, this is going to necessitate some comment updates in the page
fault code.  Yu-cheng, you are going to collide with some recent changes
I made to the page fault code.  Please be careful with the context when
you do the merge and make sure that all the new comments stay correct.


Re: [PATCH v15 23/23] x86/sgx: Driver documentation

2018-11-06 Thread Dave Hansen
On 11/5/18 9:49 PM, Jarkko Sakkinen wrote:
> On Mon, Nov 05, 2018 at 12:27:11PM -0800, Dave Hansen wrote:
>> The ABI seems entirely undocumented and rather lightly designed, which
>> seems like something we should fix before this is merged.
> 
> ABI is documented in arch/x86/include/uapi/asm/sgx.h that from which the
> documentation is included to intel_sgx.rst. I'm not saying that there is
> no space refine it but it is neither undocumented.

I specifically mean the instruction flow around asynchronous exits or
explicit enclave exit calls via EEXIT.  Signals are part of the ABI but
go unmentioned in the documentation.

It's also worth noting that EENTER *can* act (from the kernel's
perspective) like an instruction that both jumps and sets a bunch of
registers (including %rsp).  It's certainly abnormal in that regard.

In fact, in the docs:

> +Enclave can only execute code inside the ELRANGE. Instructions that may cause
> +VMEXIT, IO instructions and instructions that require a privilege change are
> +prohibited inside the enclave. Interrupts and exceptions always cause enclave
> +to exit and jump to an address outside the enclave given when the enclave is
> +entered by using the leaf instruction ENCLS(EENTER).

it's probably a really good idea to explain that the address outside of
the enclave is enclave-provided, and is not, for instance, just the next
instruction after EENTER.

>> Also, for a feature as massive and complicated as this one, it seems
>> irresponsible to not have a selftest.  Is that not feasible for some reason?
> 
> I do have the in-kernel launch enclave stuff backed up here:
> 
> https://github.com/jsakkine-intel/sgx-le-host
> https://github.com/jsakkine-intel/sgx-le
> 
> This is about as simple as it gets without any type of run-time.

Does this code run when I type "make kselftest"?  If not, I think we
should rectify that.


Re: [PATCH v15 23/23] x86/sgx: Driver documentation

2018-11-05 Thread Dave Hansen
On 11/2/18 4:11 PM, Jarkko Sakkinen wrote:
>  Documentation/index.rst |   1 +
>  Documentation/x86/intel_sgx.rst | 185 
>  2 files changed, 186 insertions(+)
>  create mode 100644 Documentation/x86/intel_sgx.rst

This patch set establishes an ABI.  It basically sets in stone a bunch
of behaviors that the enclave, the kernel, and the out-of-enclave code
must follow.

There are a bunch of things that the enclave can do to %rsp or %rip, for
instance, that it is capable and/or permitted to do.

The ABI seems entirely undocumented and rather lightly designed, which
seems like something we should fix before this is merged.

Also, for a feature as massive and complicated as this one, it seems
irresponsible to not have a selftest.  Is that not feasible for some reason?


Re: [PATCH 10/17] prmem: documentation

2018-10-30 Thread Dave Hansen
On 10/30/18 10:58 AM, Matthew Wilcox wrote:
> Does this satisfy Igor's requirements?  We wouldn't be able to
> copy_to/from_user() while rare_mm was active.  I think that's a feature
> though!  It certainly satisfies my interests (kernel code be able to
> mark things as dynamically-allocated-and-read-only-after-initialisation)

It has to be more than copy_to/from_user(), though, I think.

rare_modify(q) either has to preempt_disable(), or we need to teach the
context-switching code when and how to switch in/out of the rare_mm.
preempt_disable() would also keep us from sleeping.


Re: [PATCH v14 19/19] x86/sgx: Driver documentation

2018-10-17 Thread Dave Hansen
On 10/15/2018 01:54 PM, Pavel Machek wrote:
>> +Intel(R) SGX is a set of CPU instructions that can be used by applications 
>> to
>> +set aside private regions of code and data. The code outside the enclave is
>> +disallowed to access the memory inside the enclave by the CPU access 
>> control.
>> +In a way you can think that SGX provides inverted sandbox. It protects the
>> +application from a malicious host.
> Well, recently hardware had some problems keeping its
> promises. So... what about rowhammer, meltdown and spectre?

There's a ton of documentation out there about what kinds of protections
SGX provides.  I don't think this is an appropriate place to have an
exhaustive discussion about it.  But, there's extensive discussion of it
on Intel's security site:

https://software.intel.com/security-software-guidance/

There's documentation on how L1TF affects SGX here:

https://software.intel.com/security-software-guidance/software-guidance/l1-terminal-fault

Or Spectre v2 here:

https://software.intel.com/security-software-guidance/software-guidance/bounds-check-bypass

> Which ones apply, which ones do not, and on what cpu generations?

The CVEs list this in pretty exhaustive detail.  The L1TF/SGX one, for
example:

https://nvd.nist.gov/vuln/detail/CVE-2018-3615

Lists a bunch of processor models.



Re: [PATCH v5 07/27] mm/mmap: Create a guard area between VMAs

2018-10-11 Thread Dave Hansen
On 10/11/2018 08:15 AM, Yu-cheng Yu wrote:
> Create a guard area between VMAs to detect memory corruption.

This is a pretty major change that has a bunch of end-user implications.
 It's not dependent on any debugging options and can't be turned on/off
by individual apps, at runtime, or even at boot.

Its connection to this series is also tenuous and not spelled out in the
exceptionally terse changelog.


Re: [PATCH v5 00/27] Control Flow Enforcement: Shadow Stack

2018-10-11 Thread Dave Hansen
On 10/11/2018 08:14 AM, Yu-cheng Yu wrote:
> The previous version of CET Shadow Stack patches is at the following
> link:
> 
>   https://lkml.org/lkml/2018/9/21/776

Why are you posting these?  Do you want more review?  Do you simply want
the series applied?


Re: [RFC PATCH v4 09/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW

2018-10-03 Thread Dave Hansen
On 10/03/2018 06:38 AM, Matthew Wilcox wrote:
> On Fri, Sep 21, 2018 at 08:03:33AM -0700, Yu-cheng Yu wrote:
>> We are going to create _PAGE_DIRTY_SW for non-hardware, memory
>> management purposes.  Rename _PAGE_DIRTY to _PAGE_DIRTY_HW and
>> _PAGE_BIT_DIRTY to _PAGE_BIT_DIRTY_HW to make these PTE dirty
>> bits more clear.  There are no functional changes in this
>> patch.
> I would like there to be some documentation in this patchset which
> explains the difference between PAGE_SOFT_DIRTY and PAGE_DIRTY_SW.
> 
> Also, is it really necessary to rename PAGE_DIRTY?  It feels like a
> lot of churn.

This is a lot of churn?  Are we looking a the same patch? :)

 arch/x86/include/asm/pgtable.h   |  6 +++---
 arch/x86/include/asm/pgtable_types.h | 17 +
 arch/x86/kernel/relocate_kernel_64.S |  2 +-
 arch/x86/kvm/vmx.c   |  2 +-
 4 files changed, 14 insertions(+), 13 deletions(-)

But, yeah, I think we need to.  While it will take a little adjustment
in the brains of us old-timers and a bit of pain when switching from old
kernels to new, this makes it a lot more clear what is going on.


Re: [RFC PATCH v4 02/27] x86/fpu/xstate: Change some names to separate XSAVES system and user states

2018-10-02 Thread Dave Hansen
On 10/02/2018 09:37 AM, Borislav Petkov wrote:
> This patch's commit message is not even close. So I'd very much
> appreciate a more verbose explanation, even if it repeats itself at
> places.

Yep, totally agree.


Re: [RFC PATCH v4 02/27] x86/fpu/xstate: Change some names to separate XSAVES system and user states

2018-10-02 Thread Dave Hansen
On 10/02/2018 09:21 AM, Yu-cheng Yu wrote:
> On Tue, 2018-10-02 at 17:29 +0200, Borislav Petkov wrote:
>> On Fri, Sep 21, 2018 at 08:03:26AM -0700, Yu-cheng Yu wrote:
>>> To support XSAVES system states, change some names to distinguish
>>> user and system states.
>> I don't understand what the logic here is. SDM says:
>>
>> XSAVES—Save Processor Extended States Supervisor
>>
>> the stress being on "Supervisor" - why does it need to be renamed to
>> "system" now?
>>
> Good point.  However, "system" is more indicative; CET states are per-task and
> not "Supervisor".  Do we want to go back to "Supervisor" or add comments?

This is one of those things where the SDM language does not match what
we use in the kernel.  I think it's fine to call them "system" or
"kernel" states to make it consistent with our existing in-kernel
nomenclature.

I say add comments to clarify what the SDM calls it vs. what we do.


Re: [RFC PATCH v4 00/27] Control Flow Enforcement: Shadow Stack

2018-09-21 Thread Dave Hansen
On 09/21/2018 08:03 AM, Yu-cheng Yu wrote:
> The previous version of CET patches can be found in the following
> link:
> 
>   https://lkml.org/lkml/2018/8/30/608

So, this is an RFC, but there no mention of what you want comments *on*. :)

What do you want folks to review?  What needs to get settled before this
is merged?


Re: [PATCH v12 07/22] selftests/vm: fixed bugs in pkey_disable_clear()

2018-03-28 Thread Dave Hansen
On 03/28/2018 01:47 PM, Thiago Jung Bauermann wrote:
>>> if (flags)
>>> -   assert(rdpkey_reg() > orig_pkey_reg);
>>> +   assert(rdpkey_reg() < orig_pkey_reg);
>>>  }
>>>
>>>  void pkey_write_allow(int pkey)
>> This seems so horribly wrong that I wonder how it worked in the first
>> place.  Any idea?
> The code simply wasn't used. pkey_disable_clear() is called by
> pkey_write_allow() and pkey_access_allow(), but before this patch series
> nothing called either of these functions.
> 

Ahh, that explains it.  Can that get stuck in the changelog, please?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 22/22] selftests/vm: Fix deadlock in protection_keys.c

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> From: Thiago Jung Bauermann 
> 
> The sig_chld() handler calls dprintf2() taking care of setting
> dprint_in_signal so that sigsafe_printf() won't call printf().
> Unfortunately, this precaution is is negated by dprintf_level(), which
> has a call to fflush().
> 
> This function acquires a lock, which means that if the signal interrupts an
> ongoing fflush() the process will deadlock. At least on powerpc this is
> easy to trigger, resulting in the following backtrace when attaching to the
> frozen process:

Ugh, yeah, I've run into this too.

Acked-by: Dave Hansen 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 21/22] selftests/vm: sub-page allocator

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
...
> @@ -888,6 +917,7 @@ void setup_hugetlbfs(void)
>  void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
>  
>   malloc_pkey_with_mprotect,
> + malloc_pkey_with_mprotect_subpage,
>   malloc_pkey_anon_huge,
>   malloc_pkey_hugetlb
>  /* can not do direct with the pkey_mprotect() API:


I think I'd rather have an #ifdef on the array entries than have the
malloc entry do nothing on x86.  Maybe have a ppc-specific section at
the end?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 20/22] selftests/vm: testcases must restore pkey-permissions

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> Generally the signal handler restores the state of the pkey register
> before returning. However there are times when the read/write operation
> can legitamely fail without invoking the signal handler.  Eg: A
> sys_read() operaton to a write-protected page should be disallowed.  In
> such a case the state of the pkey register is not restored to its
> original state.  The test case is responsible for restoring the key
> register state to its original value.

Oh, that's a good point.

Could we just do this in a common place, though?  Like reset the
register after each test?  Seems more foolproof.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 19/22] selftests/vm: detect write violation on a mapped access-denied-key page

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> detect write-violation on a page to which access-disabled
> key is associated much after the page is mapped.

Acked-by: Dave Hansen 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 18/22] selftests/vm: associate key on a mapped page and detect write violation

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> detect write-violation on a page to which write-disabled
> key is associated much after the page is mapped.

The more tests the merrier.

Acked-by: Dave Hansen 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 17/22] selftests/vm: associate key on a mapped page and detect access violation

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> detect access-violation on a page to which access-disabled
> key is associated much after the page is mapped.

Looks fine to me.  Did this actually find a bug for you?

Acked-by: Dave Hansen 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 16/22] selftests/vm: fix an assertion in test_pkey_alloc_exhaust()

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> +static inline int arch_reserved_keys(void)
> +{
> +#if defined(__i386__) || defined(__x86_64__) /* arch */
> + return NR_RESERVED_PKEYS;
> +#elif __powerpc64__ /* arch */
> + if (sysconf(_SC_PAGESIZE) == 4096)
> + return NR_RESERVED_PKEYS_4K;
> + else
> + return NR_RESERVED_PKEYS_64K;
> +#else /* arch */
> + NOT SUPPORTED
> +#endif /* arch */
> +}

Yeah, this is hideous.

Please either do it in one header:

#ifdef x86..
static inline int arch_reserved_keys(void)
{
}
...
#elif ppc
static inline int arch_reserved_keys(void)
{
}
...
#else
#error
#endif

Or in multiple:

#ifdef x86..
#include 
#elif ppc
#include 
#else
#error
#endif
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 15/22] selftests/vm: powerpc implementation to check support for pkey

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
>  #define PAGE_SIZE (0x1UL << 16)
> -static inline int cpu_has_pku(void)
> +static inline bool is_pkey_supported(void)
>  {
> - return 1;
> + /*
> +  * No simple way to determine this.
> +  * lets try allocating a key and see if it succeeds.
> +  */
> + int ret = sys_pkey_alloc(0, 0);
> +
> + if (ret > 0) {
> + sys_pkey_free(ret);
> + return true;
> + }
> + return false;
>  }

The point of doing this was to have a test for the CPU that way separate
from the syscalls.

Can you leave cpu_has_pkeys() in place?


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 14/22] selftests/vm: clear the bits in shadow reg when a pkey is freed.

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index c4c73e6..e82bd88 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -586,7 +586,8 @@ int sys_pkey_free(unsigned long pkey)
>   int ret = syscall(SYS_pkey_free, pkey);
>  
>   if (!ret)
> - shadow_pkey_reg &= reset_bits(pkey, PKEY_DISABLE_ACCESS);
> + shadow_pkey_reg &= reset_bits(pkey,
> + PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE);
>   dprintf1("%s(pkey=%ld) syscall ret: %d\n", __func__, pkey, ret);
>   return ret;
>  }

What about your EXEC bit?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 13/22] selftests/vm: powerpc implementation for generic abstraction

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
>  static inline u32 pkey_to_shift(int pkey)
>  {
> +#if defined(__i386__) || defined(__x86_64__) /* arch */
>   return pkey * PKEY_BITS_PER_PKEY;
> +#elif __powerpc64__ /* arch */
> + return (NR_PKEYS - pkey - 1) * PKEY_BITS_PER_PKEY;
> +#endif /* arch */
>  }

I really detest the #if #else style.  Can't we just have a pkey_ppc.h
and a pkey_x86.h or something?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 12/22] selftests/vm: generic cleanup

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> cleanup the code to satisfy coding styles.
> 
> cc: Dave Hansen 
> cc: Florian Weimer 
> Signed-off-by: Ram Pai 
> ---
>  tools/testing/selftests/vm/protection_keys.c |   81 
> ++
>  1 files changed, 43 insertions(+), 38 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 6054093..6fdd8f5 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -4,7 +4,7 @@
>   *
>   * There are examples in here of:
>   *  * how to set protection keys on memory
> - *  * how to set/clear bits in pkey registers (the rights register)
> + *  * how to set/clear bits in Protection Key registers (the rights register)

I don't think CodingStyle says to do this. :)

>   *  * how to handle SEGV_PKUERR signals and extract pkey-relevant
>   *information from the siginfo
>   *
> @@ -13,13 +13,18 @@
>   *   prefault pages in at malloc, or not
>   *   protect MPX bounds tables with protection keys?
>   *   make sure VMA splitting/merging is working correctly
> - *   OOMs can destroy mm->mmap (see exit_mmap()), so make sure it is immune 
> to pkeys
> - *   look for pkey "leaks" where it is still set on a VMA but "freed" back 
> to the kernel
> - *   do a plain mprotect() to a mprotect_pkey() area and make sure the pkey 
> sticks
> + *   OOMs can destroy mm->mmap (see exit_mmap()),
> + *   so make sure it is immune to pkeys
> + *   look for pkey "leaks" where it is still set on a VMA
> + *but "freed" back to the kernel
> + *   do a plain mprotect() to a mprotect_pkey() area and make
> + *sure the pkey sticks

Ram, I'm not sure where this came from, but this looks horrid.  Please
don't do this to the file

>   * Compile like this:
> - *   gcc  -o protection_keys-O2 -g -std=gnu99 -pthread -Wall 
> protection_keys.c -lrt -ldl -lm
> - *   gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99 -pthread -Wall 
> protection_keys.c -lrt -ldl -lm
> + *   gcc  -o protection_keys-O2 -g -std=gnu99
> + *-pthread -Wall protection_keys.c -lrt -ldl -lm
> + *   gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99
> + *-pthread -Wall protection_keys.c -lrt -ldl -lm
>   */

Please just leave this, or remove it from the file.  It was a long line
so it could be copied and pasted, this ruins that.



>  #define _GNU_SOURCE
>  #include 
> @@ -251,26 +256,11 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>   dprintf1("signal pkey_reg from  pkey_reg: %016lx\n", __rdpkey_reg());
>   dprintf1("pkey from siginfo: %jx\n", siginfo_pkey);
>   *(u64 *)pkey_reg_ptr = 0x;
> - dprintf1("WARNING: set PRKU=0 to allow faulting instruction to 
> continue\n");
> + dprintf1("WARNING: set PKEY_REG=0 to allow faulting instruction "
> + "to continue\n");
>   pkey_faults++;
>   dprintf1("<<<<==\n");
>   return;
> - if (trapno == 14) {
> - fprintf(stderr,
> - "ERROR: In signal handler, page fault, trapno = %d, ip 
> = %016lx\n",
> - trapno, ip);
> - fprintf(stderr, "si_addr %p\n", si->si_addr);
> - fprintf(stderr, "REG_ERR: %lx\n",
> - (unsigned 
> long)uctxt->uc_mcontext.gregs[REG_ERR]);
> - exit(1);
> - } else {
> - fprintf(stderr, "unexpected trap %d! at 0x%lx\n", trapno, ip);
> - fprintf(stderr, "si_addr %p\n", si->si_addr);
> - fprintf(stderr, "REG_ERR: %lx\n",
> - (unsigned 
> long)uctxt->uc_mcontext.gregs[REG_ERR]);
> - exit(2);
> - }
> - dprint_in_signal = 0;
>  }

I think this is just randomly removing code now.

I think you should probably just drop this patch.  It's not really
brining anything useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 11/22] selftests/vm: pkey register should match shadow pkey

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> expected_pkey_fault() is comparing the contents of pkey
> register with 0. This may not be true all the time. There
> could be bits set by default by the architecture
> which can never be changed. Hence compare the value against
> shadow pkey register, which is supposed to track the bits
> accurately all throughout
> 
> cc: Dave Hansen 
> cc: Florian Weimer 
> Signed-off-by: Ram Pai 
> ---
>  tools/testing/selftests/vm/protection_keys.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 254b66d..6054093 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -926,10 +926,10 @@ void expected_pkey_fault(int pkey)
>   pkey_assert(last_pkey_faults + 1 == pkey_faults);
>   pkey_assert(last_si_pkey == pkey);
>   /*
> -  * The signal handler shold have cleared out PKEY register to let the
> +  * The signal handler shold have cleared out pkey-register to let the

Heh, you randomly changed the formatting and didn't bother with my awful
typo. :)

>* test program continue.  We now have to restore it.
>*/
> - if (__rdpkey_reg() != 0)
> + if (__rdpkey_reg() != shadow_pkey_reg)
>   pkey_assert(0);
>  
>   __wrpkey_reg(shadow_pkey_reg);
> 

I don't think this should be "shadow_pkey_reg".  This was just trying to
double-check that the signal handler messed around with PKRU the way we
expected.

We could also just check that the disable bits for 'pkey' are clear at
this point.  That would be almost as good.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 10/22] selftests/vm: introduce two arch independent abstraction

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> open_hugepage_file() <- opens the huge page file
> get_start_key() <--  provides the first non-reserved key.
> 

Looks reasonable.

Reviewed-by: Dave Hansen 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 09/22] selftests/vm: fix alloc_random_pkey() to make it really random

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> alloc_random_pkey() was allocating the same pkey every time.
> Not all pkeys were geting tested. fixed it.
...
> @@ -602,13 +603,15 @@ int alloc_random_pkey(void)
>   int alloced_pkeys[NR_PKEYS];
>   int nr_alloced = 0;
>   int random_index;
> +
>   memset(alloced_pkeys, 0, sizeof(alloced_pkeys));
> + srand((unsigned int)time(NULL));
>  
>   /* allocate every possible key and make a note of which ones we got */
>   max_nr_pkey_allocs = NR_PKEYS;
> - max_nr_pkey_allocs = 1;
>   for (i = 0; i < max_nr_pkey_allocs; i++) {
>   int new_pkey = alloc_pkey();

The srand() is probably useful, but won't this always just do a single
alloc_pkey() now?  That seems like it will mean we always use the first
one the kernel gives us, which isn't random.

> - dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n", __func__,
> - __LINE__, ret, __rdpkey_reg(), shadow_pkey_reg);
> + dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%016lx\n",
> + __func__, __LINE__, ret, __rdpkey_reg(), shadow_pkey_reg);
>   return ret;
>  }

This belonged in the pkey_reg_t patch, I think.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 08/22] selftests/vm: clear the bits in shadow reg when a pkey is freed.

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> When a key is freed, the  key  is  no  more  effective.
> Clear the bits corresponding to the pkey in the shadow
> register. Otherwise  it  will carry some spurious bits
> which can trigger false-positive asserts.
...
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index ca54a95..aaf9f09 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -582,6 +582,9 @@ int alloc_pkey(void)
>  int sys_pkey_free(unsigned long pkey)
>  {
>   int ret = syscall(SYS_pkey_free, pkey);
> +
> + if (!ret)
> + shadow_pkey_reg &= reset_bits(pkey, PKEY_DISABLE_ACCESS);
>   dprintf1("%s(pkey=%ld) syscall ret: %d\n", __func__, pkey, ret);
>   return ret;
>  }

Did this cause problems for you in practice?

On x86, sys_pkey_free() does not affect PKRU, so this isn't quite right.
 I'd much rather have the actual tests explicitly clear the PKRU bits
and also in the process clear the shadow bits.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 07/22] selftests/vm: fixed bugs in pkey_disable_clear()

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -461,7 +461,7 @@ void pkey_disable_clear(int pkey, int flags)
>   pkey, pkey, pkey_rights);
>   pkey_assert(pkey_rights >= 0);
>  
> - pkey_rights |= flags;
> + pkey_rights &= ~flags;
>  
>   ret = pkey_set(pkey, pkey_rights, 0);
>   /* pkey_reg and flags have the same format */
> @@ -475,7 +475,7 @@ void pkey_disable_clear(int pkey, int flags)
>   dprintf1("%s(%d) pkey_reg: 0x%016lx\n", __func__,
>   pkey, rdpkey_reg());
>   if (flags)
> - assert(rdpkey_reg() > orig_pkey_reg);
> + assert(rdpkey_reg() < orig_pkey_reg);
>  }
>  
>  void pkey_write_allow(int pkey)

This seems so horribly wrong that I wonder how it worked in the first
place.  Any idea?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 06/22] selftests/vm: fix the wrong assert in pkey_disable_set()

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> If the flag is 0, no bits will be set. Hence we cant expect
> the resulting bitmap to have a higher value than what it
> was earlier.
> 
> cc: Dave Hansen 
> cc: Florian Weimer 
> Signed-off-by: Ram Pai 
> ---
>  tools/testing/selftests/vm/protection_keys.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/protection_keys.c 
> b/tools/testing/selftests/vm/protection_keys.c
> index 83216c5..0109388 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -443,7 +443,7 @@ void pkey_disable_set(int pkey, int flags)
>   dprintf1("%s(%d) pkey_reg: 0x%lx\n",
>   __func__, pkey, rdpkey_reg());
>   if (flags)
> - pkey_assert(rdpkey_reg() > orig_pkey_reg);
> + pkey_assert(rdpkey_reg() >= orig_pkey_reg);
>   dprintf1("END<---%s(%d, 0x%x)\n", __func__,
>   pkey, flags);
>  }

I'm not sure about this one.  Did this cause a problem for you?

Why would you call this and ask no bits to be set?

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 05/22] selftests/vm: generic function to handle shadow key register

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> +static inline u32 pkey_to_shift(int pkey)
> +{
> + return pkey * PKEY_BITS_PER_PKEY;
> +}

pkey_bit_position(), perhaps?

> +static inline pkey_reg_t reset_bits(int pkey, pkey_reg_t bits)
> +{
> + u32 shift = pkey_to_shift(pkey);
> +
> + return ~(bits << shift);
> +}

I'd prefer clear_pkey_flags() or maybe clear_pkey_bits().  "reset" can
mean "reset to 0" or "reset to 1".

Also, why the u32 here?  Isn't an int more appropriate?

> +static inline pkey_reg_t left_shift_bits(int pkey, pkey_reg_t bits)
> +{
> + u32 shift = pkey_to_shift(pkey);
> +
> + return (bits << shift);
> +}
> +
> +static inline pkey_reg_t right_shift_bits(int pkey, pkey_reg_t bits)
> +{
> + u32 shift = pkey_to_shift(pkey);
> +
> + return (bits >> shift);
> +}

Some comments on these would be handy.  Basically that this takes a
per-key flags value and puts it at the right position so it can be
shoved in the register.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 04/22] selftests/vm: typecast the pkey register

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
> -static inline unsigned int _rdpkey_reg(int line)
> +static inline pkey_reg_t _rdpkey_reg(int line)
>  {
> - unsigned int pkey_reg = __rdpkey_reg();
> + pkey_reg_t pkey_reg = __rdpkey_reg();
>  
> - dprintf4("rdpkey_reg(line=%d) pkey_reg: %x shadow: %x\n",
> + dprintf4("rdpkey_reg(line=%d) pkey_reg: %016lx shadow: %016lx\n",
>   line, pkey_reg, shadow_pkey_reg);
>   assert(pkey_reg == shadow_pkey_reg);

Hmm.  So we're using %lx for an int?  Doesn't the compiler complain
about this?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 02/22] selftests/vm: rename all references to pkru to a generic name

2018-03-16 Thread Dave Hansen
On 02/21/2018 05:55 PM, Ram Pai wrote:
>  int pkey_set(int pkey, unsigned long rights, unsigned long flags)
>  {
>   u32 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
> - u32 old_pkru = __rdpkru();
> - u32 new_pkru;
> + u32 old_pkey_reg = __rdpkey_reg();
> + u32 new_pkey_reg;

If we're not using the _actual_ instruction names ("rdpkru"), I think
I'd rather this be something more readable, like: __read_pkey_reg().

But, it's OK-ish the way it is.

Reviewed-by: Dave Hansen 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [v2] docs: clarify security-bugs disclosure policy

2018-03-07 Thread Dave Hansen

From: Dave Hansen 

I think we need to soften the language a bit.  It might scare folks
off, especially the:

 We prefer to fully disclose the bug as soon as possible.

which is not really the case.  Linus says:

It's not full disclosure, it's not coordinated disclosure,
and it's not "no disclosure".  It's more like just "timely
open fixes".

I changed a bit of the wording in here, but mostly to remove the word
"disclosure" since it seems to mean very specific things to people
that we do not mean here.

Signed-off-by: Dave Hansen 
Reviewed-by: Dan Williams 
Cc: Thomas Gleixner 
Cc: Greg Kroah-Hartman 
Cc: Linus Torvalds 
Cc: Alan Cox 
Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Kees Cook 
Cc: Tim Chen 
Cc: Alexander Viro 
Cc: Andrew Morton 
Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet 
Cc: Mark Rutland 
---

 b/Documentation/admin-guide/security-bugs.rst |   24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff -puN Documentation/admin-guide/security-bugs.rst~embargo2 
Documentation/admin-guide/security-bugs.rst
--- a/Documentation/admin-guide/security-bugs.rst~embargo2  2018-03-07 
13:23:49.390228208 -0800
+++ b/Documentation/admin-guide/security-bugs.rst   2018-03-07 
13:42:37.618225395 -0800
@@ -29,18 +29,20 @@ made public.
 Disclosure
 --
 
-The goal of the Linux kernel security team is to work with the
-bug submitter to bug resolution as well as disclosure.  We prefer
-to fully disclose the bug as soon as possible.  It is reasonable to
-delay disclosure when the bug or the fix is not yet fully understood,
-the solution is not well-tested or for vendor coordination.  However, we
-expect these delays to be short, measurable in days, not weeks or months.
-A disclosure date is negotiated by the security team working with the
-bug submitter as well as vendors.  However, the kernel security team
-holds the final say when setting a disclosure date.  The timeframe for
-disclosure is from immediate (esp. if it's already publicly known)
+The goal of the Linux kernel security team is to work with the bug
+submitter to understand and fix the bug.  We prefer to publish the fix as
+soon as possible, but try to avoid public discussion of the bug itself
+and leave that to others.
+
+Publishing the fix may be delayed when the bug or the fix is not yet
+fully understood, the solution is not well-tested or for vendor
+coordination.  However, we expect these delays to be short, measurable in
+days, not weeks or months.  A release date is negotiated by the security
+team working with the bug submitter as well as vendors.  However, the
+kernel security team holds the final say when setting a timeframe.  The
+timeframe varies from immediate (esp. if it's already publicly known bug)
 to a few weeks.  As a basic default policy, we expect report date to
-disclosure date to be on the order of 7 days.
+release date to be on the order of 7 days.
 
 Coordination
 
_
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] docs: clarify security-bugs disclosure policy

2018-03-06 Thread Dave Hansen

From: Dave Hansen 

I think we need to soften the language a bit.  It might scare folks
off, especially the:

 We prefer to fully disclose the bug as soon as possible.

which is not really the case.  As Greg mentioned in private mail, we
really do not prefer to disclose things until *after* a fix.  The
whole "we have the final say" is also a bit harsh.

Signed-off-by: Dave Hansen 
Cc: Thomas Gleixner 
Cc: Greg Kroah-Hartman 
Cc: Linus Torvalds 
Cc: Alan Cox 
Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Kees Cook 
Cc: Tim Chen 
Cc: Dan Williams 
Cc: Alexander Viro 
Cc: Andrew Morton 
Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet 
Cc: Mark Rutland 
---

 b/Documentation/admin-guide/security-bugs.rst |   26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff -puN Documentation/admin-guide/security-bugs.rst~embargo 
Documentation/admin-guide/security-bugs.rst
--- a/Documentation/admin-guide/security-bugs.rst~embargo   2018-03-06 
14:47:04.519431230 -0800
+++ b/Documentation/admin-guide/security-bugs.rst   2018-03-06 
14:57:46.410429629 -0800
@@ -29,18 +29,22 @@ made public.
 Disclosure
 --
 
-The goal of the Linux kernel security team is to work with the
-bug submitter to bug resolution as well as disclosure.  We prefer
-to fully disclose the bug as soon as possible.  It is reasonable to
-delay disclosure when the bug or the fix is not yet fully understood,
-the solution is not well-tested or for vendor coordination.  However, we
-expect these delays to be short, measurable in days, not weeks or months.
+The goal of the Linux kernel security team is to work with the bug
+submitter to bug resolution as well as disclosure.  We prefer to fully
+disclose the bug as soon as possible after a fix is available.  It is
+customary to delay disclosure when the bug or the fix is not yet fully
+understood, the solution is not well-tested or for vendor coordination.
+However, we expect these delays to typically be short, measurable in
+days, not weeks or months.
+
 A disclosure date is negotiated by the security team working with the
-bug submitter as well as vendors.  However, the kernel security team
-holds the final say when setting a disclosure date.  The timeframe for
-disclosure is from immediate (esp. if it's already publicly known)
-to a few weeks.  As a basic default policy, we expect report date to
-disclosure date to be on the order of 7 days.
+bug submitter as well as affected vendors.  The security team prefers
+coordinated disclosure and will consider pre-existing, reasonable
+disclosure dates.
+
+The timeframe for disclosure ranges from immediate (esp. if it's
+already publicly known) to a few weeks.  As a basic default policy, we
+expect report date to disclosure date to be on the order of 7 days.
 
 Coordination
 
_
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 10/11] sparc64: Add support for ADI (Application Data Integrity)

2018-03-05 Thread Dave Hansen
On 03/05/2018 01:37 PM, Khalid Aziz wrote:
>> How big can this storage get, btw?  Superficially it seems like it might
>> be able to be gigantic for a large, sparse VMA.
>>
> Tags are stored only for the pages being swapped out, not for the pages
> in entire vma. Each tag storage page can hold tags for 128 pages (each
> page has 128 4-bit tags, hence 64 bytes are needed to store tags for an
> entire page allowing each page to store tags for 128 pages). Sparse VMA
> does not cause any problems since holes do not have corresponding pages
> that will be swapped out. Tag storage pages are freed once all the pages
> they store tags for have been swapped back in, except for a small number
> of pages (maximum of 8) marked for emergency tag storage.

With a linear scan holding a process-wide spinlock?  If you have a fast
swap device, does this become the bottleneck when swapping ADI-tagged
memory?

FWIW, this tag storage is complex and subtle enough code that it
deserves to be in its own well-documented patch, not buried in a
thousand-line patch.

> +tag_storage_desc_t *find_tag_store(struct mm_struct *mm,
> +struct vm_area_struct *vma,
> +unsigned long addr)
> +{
> + tag_storage_desc_t *tag_desc = NULL;
> + unsigned long i, max_desc, flags;
> +
> + /* Check if this vma already has tag storage descriptor
> +  * allocated for it.
> +  */
> + max_desc = PAGE_SIZE/sizeof(tag_storage_desc_t);
> + if (mm->context.tag_store) {
> + tag_desc = mm->context.tag_store;
> + spin_lock_irqsave(&mm->context.tag_lock, flags);
> + for (i = 0; i < max_desc; i++) {
> + if ((addr >= tag_desc->start) &&
> + ((addr + PAGE_SIZE - 1) <= tag_desc->end))
> + break;
> + tag_desc++;
> + }
> + spin_unlock_irqrestore(&mm->context.tag_lock, flags);

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 10/11] sparc64: Add support for ADI (Application Data Integrity)

2018-03-05 Thread Dave Hansen
On 03/05/2018 01:14 PM, Khalid Aziz wrote:
> Are you suggesting that vma returned by find_vma() could be split or
> merged underneath me if I do not hold mmap_sem and thus make the flag
> check invalid? If so, that is a good point.

This part does make me think that this code hasn't been tested very
thoroughly.  Could you describe the testing that you have done?  For MPX
and protection keys, I added something to tools/testing/selftests/x86,
for instance.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 10/11] sparc64: Add support for ADI (Application Data Integrity)

2018-03-05 Thread Dave Hansen
On 03/05/2018 01:14 PM, Khalid Aziz wrote:
> On 03/05/2018 12:22 PM, Dave Hansen wrote:
>> On 02/21/2018 09:15 AM, Khalid Aziz wrote:
>>> +#define arch_validate_prot(prot, addr) sparc_validate_prot(prot, addr)
>>> +static inline int sparc_validate_prot(unsigned long prot, unsigned
>>> long addr)
>>> +{
>>> +    if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM |
>>> PROT_ADI))
>>> +    return 0;
>>> +    if (prot & PROT_ADI) {
>>> +    if (!adi_capable())
>>> +    return 0;
>>> +
>>> +    if (addr) {
>>> +    struct vm_area_struct *vma;
>>> +
>>> +    vma = find_vma(current->mm, addr);
>>> +    if (vma) {
>>> +    /* ADI can not be enabled on PFN
>>> + * mapped pages
>>> + */
>>> +    if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>>> +    return 0;
>>
>> You don't hold mmap_sem here.  How can this work?
>>
> Are you suggesting that vma returned by find_vma() could be split or
> merged underneath me if I do not hold mmap_sem and thus make the flag
> check invalid? If so, that is a good point.

Um, yes.  You can't walk the vma tree without holding mmap_sem.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 10/11] sparc64: Add support for ADI (Application Data Integrity)

2018-03-05 Thread Dave Hansen
On 02/21/2018 09:15 AM, Khalid Aziz wrote:
> +tag_storage_desc_t *alloc_tag_store(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long addr)
...
> + tags = kzalloc(size, GFP_NOWAIT|__GFP_NOWARN);
> + if (tags == NULL) {
> + tag_desc->tag_users = 0;
> + tag_desc = NULL;
> + goto out;
> + }
> + tag_desc->start = addr;
> + tag_desc->tags = tags;
> + tag_desc->end = end_addr;
> +
> +out:
> + spin_unlock_irqrestore(&mm->context.tag_lock, flags);
> + return tag_desc;
> +}

OK, sorry, I missed this.  I do see that you now have per-ADI-block tag
storage and it is not per-page.

How big can this storage get, btw?  Superficially it seems like it might
be able to be gigantic for a large, sparse VMA.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 10/11] sparc64: Add support for ADI (Application Data Integrity)

2018-03-05 Thread Dave Hansen
On 02/21/2018 09:15 AM, Khalid Aziz wrote:
> +#define arch_validate_prot(prot, addr) sparc_validate_prot(prot, addr)
> +static inline int sparc_validate_prot(unsigned long prot, unsigned long addr)
> +{
> + if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_ADI))
> + return 0;
> + if (prot & PROT_ADI) {
> + if (!adi_capable())
> + return 0;
> +
> + if (addr) {
> + struct vm_area_struct *vma;
> +
> + vma = find_vma(current->mm, addr);
> + if (vma) {
> + /* ADI can not be enabled on PFN
> +  * mapped pages
> +  */
> + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> + return 0;

You don't hold mmap_sem here.  How can this work?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 29/51] mm/mprotect, powerpc/mm/pkeys, x86/mm/pkeys: Add sysfs interface

2017-12-18 Thread Dave Hansen
On 12/18/2017 02:18 PM, Ram Pai wrote:
> b) minimum number of keys available to the application.
>   if libraries consumes a few, they could provide a library
>   interface to the application informing the number available to
>   the application.  The library interface can leverage (b) to
>   provide the information.

OK, let's see a real user of this including a few libraries.  Then we'll
put it in the kernel.

> c) types of disable-rights supported by keys.
>   Helps the application to determine the types of disable-features
>   available. This is helpful, otherwise the app has to 
>   make pkey_alloc() call with the corresponding parameter set
>   and see if it suceeds or fails. Painful from an application
>   point of view, in my opinion.

Again, let's see a real-world use of this.  How does it look?  How does
an app "fall back" if it can't set a restriction the way it wants to?

Are we *sure* that such an interface makes sense?  For instance, will it
be possible for some keys to be execute-disable while other are only
write-disable?

> I think on x86 you look for some hardware registers to determine which
> hardware features are enabled by the kernel.

No, we use CPUID.  It's a part of the ISA that's designed for
enumerating CPU and (sometimes) OS support for CPU features.

> We do not have generic support for something like that on ppc.
> The kernel looks at the device tree to determine what hardware features
> are available. But does not have mechanism to tell the hardware to track
> which of its features are currently enabled/used by the kernel; atleast
> not for the memory-key feature.

Bummer.  You're missing out.

But, you could still do this with a syscall.  "Hey, kernel, do you
support this feature?"
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 29/51] mm/mprotect, powerpc/mm/pkeys, x86/mm/pkeys: Add sysfs interface

2017-12-18 Thread Dave Hansen
On 11/06/2017 12:57 AM, Ram Pai wrote:
> Expose useful information for programs using memory protection keys.
> Provide implementation for powerpc and x86.
> 
> On a powerpc system with pkeys support, here is what is shown:
> 
> $ head /sys/kernel/mm/protection_keys/*
> ==> /sys/kernel/mm/protection_keys/disable_access_supported <==
> true

This is cute, but I don't think it should be part of the ABI.  Put it in
debugfs if you want it for cute tests.  The stuff that this tells you
can and should come from pkey_alloc() for the ABI.

http://man7.org/linux/man-pages/man7/pkeys.7.html

>Any application wanting to use protection keys needs to be able to
>function without them.  They might be unavailable because the
>hardware that the application runs on does not support them, the
>kernel code does not contain support, the kernel support has been
>disabled, or because the keys have all been allocated, perhaps by a
>library the application is using.  It is recommended that
>applications wanting to use protection keys should simply call
>pkey_alloc(2) and test whether the call succeeds, instead of
>attempting to detect support for the feature in any other way.

Do you really not have standard way on ppc to say whether hardware
features are supported by the kernel?  For instance, how do you know if
a given set of registers are known to and are being context-switched by
the kernel?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >