Re: [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-07 Thread Kees Cook
On Thu, May 06, 2021 at 11:47:47AM -0700, James Bottomley wrote:
> On Thu, 2021-05-06 at 10:33 -0700, Kees Cook wrote:
> > On Thu, May 06, 2021 at 08:26:41AM -0700, James Bottomley wrote:
> [...]
> > > > I think that a very complete description of the threats which
> > > > this feature addresses would be helpful.  
> > > 
> > > It's designed to protect against three different threats:
> > > 
> > >1. Detection of user secret memory mismanagement
> > 
> > I would say "cross-process secret userspace memory exposures" (via a
> > number of common interfaces by blocking it at the GUP level).
> > 
> > >2. significant protection against privilege escalation
> > 
> > I don't see how this series protects against privilege escalation.
> > (It protects against exfiltration.) Maybe you mean include this in
> > the first bullet point (i.e. "cross-process secret userspace memory
> > exposures, even in the face of privileged processes")?
> 
> It doesn't prevent privilege escalation from happening in the first
> place, but once the escalation has happened it protects against
> exfiltration by the newly minted root attacker.

So, after thinking a bit more about this, I don't think there is
protection here against privileged execution. This feature kind of helps
against cross-process read/write attempts, but it doesn't help with
sufficiently privileged (i.e. ptraced) execution, since we can just ask
the process itself to do the reading:

$ gdb ./memfd_secret
...
ready: 0x77ffb000
Breakpoint 1, ...
(gdb) compile code unsigned long addr = 0x77ffb000UL; printf("%016lx\n", 
*((unsigned long *)addr));
5

And since process_vm_readv() requires PTRACE_ATTACH, there's very little
difference in effort between process_vm_readv() and the above.

So, what other paths through GUP exist that aren't covered by
PTRACE_ATTACH? And if none, then should this actually just be done by
setting the process undumpable? (This is already what things like gnupg
do.)

So, the user-space side of this doesn't seem to really help. The kernel
side protection is interesting for kernel read/write flaws, though, in
the sense that the process is likely not being attacked from "current",
so a kernel-side attack would need to either walk the page tables and
create new ones, or spawn a new userspace process to do the ptracing.

So, while I like the idea of this stuff, and I see how it provides
certain coverages, I'm curious to learn more about the threat model to
make sure it's actually providing meaningful hurdles to attacks.

-- 
Kees Cook
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-06 Thread Kees Cook
ng finding gadgets
> really hard, usually the attacker gets one or at most two gadgets to
> string together.  Not having any in-kernel primitive for accessing
> secret memory means the one gadget ROP attack can't work.  Since the
> only way to access secret memory is to reconstruct the missing mapping
> entry, the attacker has to recover the physical page and insert a PTE
> pointing to it in the kernel and then retrieve the contents.  That
> takes at least three gadgets which is a level of difficulty beyond most
> standard attacks.

As for protecting against exploited kernel flaws I also see benefits
here. While the kernel is already blocked from directly reading contents
from userspace virtual addresses (i.e. SMAP), this feature does help by
blocking the kernel from directly reading contents via the direct map
alias. (i.e. this feature is a specialized version of XPFO[1], which
tried to do this for ALL user memory.) So in that regard, yes, this has
value in the sense that to perform exfiltration, an attacker would need
a significant level of control over kernel execution or over page table
contents.

Sufficient control over PTE allocation and positioning is possible
without kernel execution control[3], and "only" having an arbitrary
write primitive can lead to direct PTE control. Because of this, it
would be nice to have page tables strongly protected[2] in the kernel.
They remain a viable "data only" attack given a sufficiently "capable"
write flaw.

I would argue that page table entries are a more important asset to
protect than userspace secrets, but given the difficulties with XPFO
and the not-yet-available PKS I can understand starting here. It does,
absolutely, narrow the ways exploits must be written to exfiltrate secret
contents. (We are starting to now constrict[4] many attack methods
into attacking the page table itself, which is good in the sense that
protecting page tables will be a big win, and bad in the sense that
focusing attack research on page tables means we're going to see some
very powerful attacks.)

> > I think that a very complete description of the threats which this
> > feature addresses would be helpful.  
> 
> It's designed to protect against three different threats:
> 
>1. Detection of user secret memory mismanagement

I would say "cross-process secret userspace memory exposures" (via a
number of common interfaces by blocking it at the GUP level).

>2. significant protection against privilege escalation

I don't see how this series protects against privilege escalation. (It
protects against exfiltration.) Maybe you mean include this in the first
bullet point (i.e. "cross-process secret userspace memory exposures,
even in the face of privileged processes")?

>3. enhanced protection (in conjunction with all the other in-kernel
>   attack prevention systems) against ROP attacks.

Same here, I don't see it preventing ROP, but I see it making "simple"
ROP insufficient to perform exfiltration.

-Kees

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/yama/yama_lsm.c?h=v5.12#n410
[1] 
https://lore.kernel.org/linux-mm/cover.1554248001.git.khalid.a...@oracle.com/
[2] 
https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgeco...@intel.com/
[3] 
https://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
[4] https://git.kernel.org/linus/cf68fffb66d60d96209446bfc4a15291dc5a5d41

-- 
Kees Cook
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 08/14] taint: add taint for direct hardware access

2021-02-08 Thread Kees Cook
zation plugin
> > + 18  _/H  262144  kernel has allowed vendor shenanigans
> >  ===  ===  ==  
> >
> >  Note: The character ``_`` is representing a blank in this table to make 
> > reading
> > @@ -175,3 +176,6 @@ More detailed explanation for tainting
> >   produce extremely unusual kernel structure layouts (even performance
> >   pathological ones), which is important to know when debugging. Set at
> >   build time.
> > +
> > + 18) ``H`` Kernel has allowed direct access to hardware and can no longer 
> > make
> > + any guarantees about the stability of the device or driver.
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index f7902d8c1048..bc95486f817e 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -443,7 +443,8 @@ extern enum system_states {
> >  #define TAINT_LIVEPATCH15
> >  #define TAINT_AUX  16
> >  #define TAINT_RANDSTRUCT   17
> > -#define TAINT_FLAGS_COUNT  18
> > +#define TAINT_RAW_PASSTHROUGH  18
> > +#define TAINT_FLAGS_COUNT  19
> >  #define TAINT_FLAGS_MAX    ((1UL << TAINT_FLAGS_COUNT) 
> > - 1)
> >
> >  struct taint_flag {
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index 332736a72a58..dff22bd80eaf 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -386,6 +386,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] 
> > = {
> > [ TAINT_LIVEPATCH ] = { 'K', ' ', true },
> > [ TAINT_AUX ]   = { 'X', ' ', true },
> > [ TAINT_RANDSTRUCT ]= { 'T', ' ', true },
> > +   [ TAINT_RAW_PASSTHROUGH ]   = { 'H', ' ', true },
> >  };
> >
> >  /**
> > --
> > 2.30.0
> >

-- 
Kees Cook
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V2 00/17] PKS: Add Protection Keys Supervisor (PKS) support

2020-07-24 Thread Kees Cook
On Fri, Jul 17, 2020 at 12:20:39AM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> This RFC series has been reviewed by Dave Hansen.
> 
> Changes from RFC:
>   Clean up commit messages based on Peter Zijlstra's and Dave Hansen's
>   feedback
>   Fix static branch anti-pattern
>   New patch:
>   (memremap: Convert devmap static branch to {inc,dec})
>   This was the code I used as a model for my static branch which
>   I believe is wrong now.
>   New Patch:
>   (x86/entry: Preserve PKRS MSR through exceptions)
>   This attempts to preserve the per-logical-processor MSR, and
>   reference counting during exceptions.  I'd really like feed
>   back on this because I _think_ it should work but I'm afraid
>   I'm missing something as my testing has shown a lot of spotty
>   crashes which don't make sense to me.
> 
> This patch set introduces a new page protection mechanism for supervisor 
> pages,
> Protection Key Supervisor (PKS) and an initial user of them, persistent 
> memory,
> PMEM.
> 
> PKS enables protections on 'domains' of supervisor pages to limit supervisor
> mode access to those pages beyond the normal paging protections.  They work in
> a similar fashion to user space pkeys.  Like User page pkeys (PKU), supervisor
> pkeys are checked in addition to normal paging protections and Access or 
> Writes
> can be disabled via a MSR update without TLB flushes when permissions change.
> A page mapping is assigned to a domain by setting a pkey in the page table
> entry.
> 
> Unlike User pkeys no new instructions are added; rather WRMSR/RDMSR are used 
> to
> update the PKRS register.
> 
> XSAVE is not supported for the PKRS MSR.  To reduce software complexity the
> implementation saves/restores the MSR across context switches but not during
> irqs.  This is a compromise which results is a hardening of unwanted access
> without absolute restriction.
> 
> For consistent behavior with current paging protections, pkey 0 is reserved 
> and
> configured to allow full access via the pkey mechanism, thus preserving the
> default paging protections on mappings with the default pkey value of 0.
> 
> Other keys, (1-15) are allocated by an allocator which prepares us for key
> contention from day one.  Kernel users should be prepared for the allocator to
> fail either because of key exhaustion or due to PKS not being supported on the
> arch and/or CPU instance.
> 
> Protecting against stray writes is particularly important for PMEM because,
> unlike writes to anonymous memory, writes to PMEM persists across a reboot.
> Thus data corruption could result in permanent loss of data.
> 
> The following attributes of PKS makes it perfect as a mechanism to protect 
> PMEM
> from stray access within the kernel:
> 
>1) Fast switching of permissions
>2) Prevents access without page table manipulations
>3) Works on a per thread basis
>4) No TLB flushes required

Cool! This seems like it'd be very handy to make other types of kernel
data "read-only at rest" (as was long ago proposed via X86_CR0_WP[1],
which only provided to protection levels, not 15). For example, I think
at least a few other kinds of areas stand out to me that are in need
of PKS markings (i.e. only things that actually manipulate these areas
should gain temporary PK access):
- Page Tables themselves
- Identity mapping
- The "read-only at rest" stuff, though it'll need special plumbing to
  make it work with the slab allocator, etc (more like the later "static
  allocation" work[2]).

[1] 
https://lore.kernel.org/lkml/1490811363-93944-1-git-send-email-keesc...@chromium.org/
[2] https://lore.kernel.org/lkml/cover.1550097697.git.igor.sto...@huawei.com/

-- 
Kees Cook
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Kees Cook
On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote:
> On Fri, May 17, 2019 at 8:57 AM Kees Cook  wrote:
> >
> > On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote:
> > > As far as I can see it's mostly check_heap_object() that is the
> > > problem, so I'm open to finding a way to just bypass that sub-routine.
> > > However, as far as I can see none of the other block / filesystem user
> > > copy implementations submit to the hardened checks, like
> > > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So,
> > > either those need to grow additional checks, or the hardened copy
> > > implementation is targeting single object copy use cases, not
> > > necessarily block-I/O. Yes, Kees, please advise.
> >
> > The intention is mainly for copies that haven't had explicit bounds
> > checking already performed on them, yes. Is there something getting
> > checked out of the slab, or is it literally just the overhead of doing
> > the "is this slab?" check that you're seeing?
> 
> It's literally the overhead of "is this slab?" since it needs to go
> retrieve the struct page and read that potentially cold cacheline. In
> the case where that page is on memory media that is higher latency
> than DRAM we get the ~37% performance loss that Jeff measured.

Ah-ha! Okay, I understand now; thanks!

> The path is via the filesystem ->write_iter() file operation. In the
> DAX case the filesystem traps that path early, before submitting block
> I/O, and routes it to the dax_iomap_actor() routine. That routine
> validates that the logical file offset is within bounds of the file,
> then it does a sector-to-pfn translation which validates that the
> physical mapping is within bounds of the block device.
> 
> It seems dax_iomap_actor() is not a path where we'd be worried about
> needing hardened user copy checks.

I would agree: I think the proposed patch makes sense. :)

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Kees Cook
On Fri, May 17, 2019 at 04:14:03PM +, David Laight wrote:
> From: Kees Cook
> > Sent: 17 May 2019 16:54
> ...
> > > I've changed some of our code to use __get_user() to avoid
> > > these stupid overheads.
> > 
> > __get_user() skips even access_ok() checking too, so that doesn't seem
> > like a good idea. Did you run access_ok() checks separately? (This
> > generally isn't recommended.)
> 
> Of course, I'm not THAT stupid :-)

Right, yes, I know. :) I just wanted to double-check since accidents
can happen. The number of underscores on these function is not really
a great way to indicate what they're doing. ;)

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Kees Cook
On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote:
> As far as I can see it's mostly check_heap_object() that is the
> problem, so I'm open to finding a way to just bypass that sub-routine.
> However, as far as I can see none of the other block / filesystem user
> copy implementations submit to the hardened checks, like
> bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So,
> either those need to grow additional checks, or the hardened copy
> implementation is targeting single object copy use cases, not
> necessarily block-I/O. Yes, Kees, please advise.

The intention is mainly for copies that haven't had explicit bounds
checking already performed on them, yes. Is there something getting
checked out of the slab, or is it literally just the overhead of doing
the "is this slab?" check that you're seeing?

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Kees Cook
On Fri, May 17, 2019 at 09:06:26AM +, David Laight wrote:
> From: Jan Kara
> > Sent: 17 May 2019 09:48
> ...
> > So this last paragraph is not obvious to me as check_copy_size() does a lot
> > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of
> > those checks don't make sense for PMEM pages but I'd rather handle that by
> > refining check_copy_size() and check_object_size() functions to detect and
> > appropriately handle pmem pages rather that generally skip all the checks
> > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going
> > to cost performance but that's what user asked for with
> > CONFIG_HARDENED_USERCOPY... Kees?
> 
> Except that all the distros enable it by default.
> So you get the performance cost whether you (as a user) want it or not.

Note that it can be disabled on the kernel command line, but that seems
like a last resort. :)

> 
> I've changed some of our code to use __get_user() to avoid
> these stupid overheads.

__get_user() skips even access_ok() checking too, so that doesn't seem
like a good idea. Did you run access_ok() checks separately? (This
generally isn't recommended.)

The usercopy hardening is intended to only kick in when the copy size
isn't a builtin constant -- it's attempting to do a bounds check on
the size, with the pointer used to figure out what bounds checking is
possible (basically "is this stack? check stack location/frame size"
or "is this kmem cache? check the allocation size") and then do bounds
checks from there. It tries to bail out early to avoid needless checking,
so if there is some additional logic to be added to check_object_size()
that is globally applicable, sure, let's do it.

I'm still not clear from this thread about the case that is getting
tripped/slowed? If you're already doing bounds checks somewhere else
and there isn't a chance for the pointer or size to change, then yeah,
it seems safe to skip the usercopy size checks. But what's the actual
code that is having a problem?

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 12/17] kunit: tool: add Python wrappers for running KUnit tests

2019-05-06 Thread Kees Cook
On Sun, May 5, 2019 at 5:19 PM Frank Rowand  wrote:
> You can see the full version 14 document in the submitter's repo:
>
>   $ git clone https://github.com/isaacs/testanything.github.io.git
>   $ cd testanything.github.io
>   $ git checkout tap14
>   $ ls tap-version-14-specification.md
>
> My understanding is the the version 14 specification is not trying to
> add new features, but instead capture what is already implemented in
> the wild.

Oh! I didn't know about the work on TAP 14. I'll go read through this.

> > ## Here is what I propose for this patchset:
> >
> >  - Print out test number range at the beginning of each test suite.
> >  - Print out log lines as soon as they happen as diagnostics.
> >  - Print out the lines that state whether a test passes or fails as a
> > ok/not ok line.
> >
> > This would be technically conforming with TAP13 and is consistent with
> > what some kselftests have done.

This is what I fixed kselftest to actually do (it wasn't doing correct
TAP13), and Shuah is testing the series now:
https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/log/?h=ksft-tap-refactor

I'll go read TAP 14 now...

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: nvdimm crash at boot

2019-01-08 Thread Kees Cook
On Tue, Jan 8, 2019 at 3:54 PM Dan Williams  wrote:
>
> On Tue, Jan 8, 2019 at 3:34 PM Kees Cook  wrote:
> >
> > On Tue, Jan 8, 2019 at 3:28 PM Dan Williams  
> > wrote:
> > > Ah, thanks for the report! The key difference is that you don't define
> > > a "label area", so the driver bails out early and never initializes
> > > the security state.
> > >
> > > This should fix it up.
> > >
> > > diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> > > index 4890310df874..636cdb06ee17 100644
> > > --- a/drivers/nvdimm/dimm_devs.c
> > > +++ b/drivers/nvdimm/dimm_devs.c
> > > @@ -514,7 +514,7 @@ static umode_t nvdimm_visible(struct kobject
> > > *kobj, struct attribute *a, int n)
> > >
> > > if (a != _attr_security.attr)
> > > return a->mode;
> > > -   if (nvdimm->sec.state < 0)
> > > +   if (!nvdimm->sec.ops || nvdimm->sec.state < 0)
> > > return 0;
> > > /* Are there any state mutation ops? */
> > > if (nvdimm->sec.ops->freeze || nvdimm->sec.ops->disable
> >
> > Okay, cool. I wasn't sure if that test needed a deeper check. :)
> >
> > Fixes: 37833fb7989a9 ("acpi/nfit, libnvdimm: Add freeze security
> > support to Intel nvdimm")
> > Tested-by: Kees Cook 
> >
>
> Actually, looking closer this should have been avoided by the fact
> that __nvdimm_create() initializes the security state early and that
> nvdimm->sec.state should have saved us.
>
> I'll dig a bit deeper with your qemu config.

Maybe something goes weird with pstore stealing the region?

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: nvdimm crash at boot

2019-01-08 Thread Kees Cook
On Tue, Jan 8, 2019 at 3:28 PM Dan Williams  wrote:
> Ah, thanks for the report! The key difference is that you don't define
> a "label area", so the driver bails out early and never initializes
> the security state.
>
> This should fix it up.
>
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index 4890310df874..636cdb06ee17 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -514,7 +514,7 @@ static umode_t nvdimm_visible(struct kobject
> *kobj, struct attribute *a, int n)
>
> if (a != _attr_security.attr)
> return a->mode;
> -   if (nvdimm->sec.state < 0)
> +   if (!nvdimm->sec.ops || nvdimm->sec.state < 0)
> return 0;
> /* Are there any state mutation ops? */
> if (nvdimm->sec.ops->freeze || nvdimm->sec.ops->disable

Okay, cool. I wasn't sure if that test needed a deeper check. :)

Fixes: 37833fb7989a9 ("acpi/nfit, libnvdimm: Add freeze security
support to Intel nvdimm")
Tested-by: Kees Cook 

Thanks!

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


nvdimm crash at boot

2019-01-08 Thread Kees Cook
This is a warn that I added to fail more gracefully (sorry for
whitespace damage):

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 4890310df874..1161b994b1ec 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -516,6 +516,8 @@ static umode_t nvdimm_visible(struct kobject
*kobj, struct attribute *a, int n)
return a->mode;
if (nvdimm->sec.state < 0)
return 0;
+   if (WARN_ON_ONCE(!nvdimm->sec.ops))
+   return 0;
/* Are there any state mutation ops? */
if (nvdimm->sec.ops->freeze || nvdimm->sec.ops->disable
|| nvdimm->sec.ops->change_key

Without it, I would crash at boot due to the sec.ops dereference. It's
not clear to me if there is a better solution than just the sec.ops
NULL test (i.e. should it ever be NULL?)

[1.393599] WARNING: CPU: 3 PID: 484 at
drivers/nvdimm/dimm_devs.c:519 nvdimm_visible+0x79/0x80
[1.393858] Modules linked in:
[1.393858] CPU: 3 PID: 484 Comm: kworker/u8:3 Not tainted 5.0.0-rc1+ #926
[1.393858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1ubuntu1 04/01/2014
[1.396781] Workqueue: events_unbound async_run_entry_fn
[1.396781] RIP: 0010:nvdimm_visible+0x79/0x80
[1.396781] Code: e8 4c fc ff ff eb c7 48 83 78 20 00 75 e6 48 83
78 10 00 75 df 48 83 78 28 00 75 d8 48 83 78 30 00 75 d1 b8 24 01 00
00 eb b1 <0f> 0b eb ad 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56
41 55
[1.396781] RSP: :b911803abd00 EFLAGS: 00010246
[1.396781] RAX:  RBX: 98cf5a80 RCX: 01a4
[1.396781] RDX: 0004 RSI: 98cf5a80 RDI: 94e7ed088028
[1.396781] RBP: b911803abd10 R08:  R09: 0001
[1.396781] R10: b911803abaf8 R11:  R12: 94e7ed088028
[1.396781] R13: 94e7ed088028 R14: 98cf5a60 R15: 
[1.396781] FS:  () GS:94e7efb8()
knlGS:
[1.396781] CS:  0010 DS:  ES:  CR0: 80050033
[1.396781] CR2:  CR3: 000150822001 CR4: 001606e0
[1.396781] Call Trace:
[1.396781]  internal_create_group+0xf4/0x380
[1.396781]  sysfs_create_groups+0x46/0xb0
[1.396781]  device_add+0x331/0x680
[1.396781]  nd_async_device_register+0x15/0x60
[1.396781]  async_run_entry_fn+0x38/0x100
[1.396781]  process_one_work+0x22b/0x5a0
[1.396781]  worker_thread+0x3f/0x3b0
[1.396781]  kthread+0x12b/0x150
[1.396781]  ? process_one_work+0x5a0/0x5a0
[1.396781]  ? kthread_park+0xa0/0xa0
[1.396781]  ret_from_fork+0x24/0x30
[1.396781] irq event stamp: 952
[1.396781] hardirqs last  enabled at (951): []
__slab_alloc.constprop.79+0x44/0x70
[1.396781] hardirqs last disabled at (952): []
trace_hardirqs_off_thunk+0x1a/0x1c
[1.396781] softirqs last  enabled at (0): []
copy_process.part.55+0x413/0x1f10
[1.396781] softirqs last disabled at (0): [<>]
  (null)
[1.396781] ---[ end trace 5608ce056f09564f ]---

I assume this crash is due to be using nvdimm without any special
markings (i.e. I'm using it crudely with pstore), in KVM:

RAM_SIZE=16384
NVDIMM_SIZE=128
MAX_SIZE=$(( RAM_SIZE + NVDIMM_SIZE ))

sudo qemu-system-x86_64 \
  ...
-machine pc,nvdimm \
-m ${RAM_SIZE}M,slots=2,maxmem=${MAX_SIZE}M \
-object
memory-backend-file,id=mem1,share=on,mem-path=nvdimm.img,size=${NVDIMM_SIZE}M,align=128M
\
-device nvdimm,id=nvdimm1,memdev=mem1 \
...
-append '... ramoops.mem_size=1048576 ramoops.ecc=1
ramoops.mem_address=0x44000 ramoops.console_size=16384
ramoops.ftrace_size=16384 ramoops.pmsg_size=16384
ramoops.record_size=32768'

I assume 37833fb7989a9 ("acpi/nfit, libnvdimm: Add freeze security
support to Intel nvdimm") was where it started, but I didn't actually
bisect.

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-10 Thread Kees Cook
On Mon, Apr 10, 2017 at 11:22 AM, Jeff Moyer <jmo...@redhat.com> wrote:
> Kees Cook <keesc...@chromium.org> writes:
>
>> On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer <jmo...@redhat.com> wrote:
>>> Kees Cook <keesc...@chromium.org> writes:
>>>
>>>> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer <jmo...@redhat.com> wrote:
>>>>> Hi,
>>>>>
>>>>> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
>>>>> regions") causes some of my systems with persistent memory (whether real
>>>>> or emulated) to fail to boot with a couple of different crash
>>>>> signatures.  The first signature is a NMI watchdog lockup of all but 1
>>>>> cpu, which causes much difficulty in extracting useful information from
>>>>> the console.  The second variant is an invalid paging request, listed
>>>>> below.
>>>>
>>>> Just to rule out some of the stuff in the boot path, does booting with
>>>> "nokaslr" solve this? (i.e. I want to figure out if this is from some
>>>> of the rearrangements done that are exposed under that commit, or if
>>>> it is genuinely the randomization that is killing the systems...)
>>>
>>> Adding "nokaslr" to the boot line does indeed make the problem go away.
>>
>> Are you booting with a memmap= flag?
>
> From my first email:
>
> [ 0.00] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc5+
> root=/dev/mapper/rhel_intel--lizardhead--04-root ro memmap=192G!1024G
> crashkernel=auto rd.lvm.lv=rhel_intel-lizardhead-04/root
> rd.lvm.lv=rhel_intel-lizardhead-04/swap console=ttyS0,115200n81
> LANG=en_US.UTF-8
>
> Did you not receive the attachments?

I see it now, thanks!

The memmap parsing was added in -rc1 (f28442497b5ca), so I'd expect
that to be handled. Hmmm.

-Kees

-- 
Kees Cook
Pixel Security
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-10 Thread Kees Cook
On Mon, Apr 10, 2017 at 8:49 AM, Jeff Moyer <jmo...@redhat.com> wrote:
> Kees Cook <keesc...@chromium.org> writes:
>
>> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer <jmo...@redhat.com> wrote:
>>> Hi,
>>>
>>> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
>>> regions") causes some of my systems with persistent memory (whether real
>>> or emulated) to fail to boot with a couple of different crash
>>> signatures.  The first signature is a NMI watchdog lockup of all but 1
>>> cpu, which causes much difficulty in extracting useful information from
>>> the console.  The second variant is an invalid paging request, listed
>>> below.
>>
>> Just to rule out some of the stuff in the boot path, does booting with
>> "nokaslr" solve this? (i.e. I want to figure out if this is from some
>> of the rearrangements done that are exposed under that commit, or if
>> it is genuinely the randomization that is killing the systems...)
>
> Adding "nokaslr" to the boot line does indeed make the problem go away.

Are you booting with a memmap= flag?

-Kees

-- 
Kees Cook
Pixel Security
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-07 Thread Kees Cook
On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer <jmo...@redhat.com> wrote:
> Hi,
>
> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
> regions") causes some of my systems with persistent memory (whether real
> or emulated) to fail to boot with a couple of different crash
> signatures.  The first signature is a NMI watchdog lockup of all but 1
> cpu, which causes much difficulty in extracting useful information from
> the console.  The second variant is an invalid paging request, listed
> below.

Just to rule out some of the stuff in the boot path, does booting with
"nokaslr" solve this? (i.e. I want to figure out if this is from some
of the rearrangements done that are exposed under that commit, or if
it is genuinely the randomization that is killing the systems...)

> On some systems, I haven't hit this problem at all.  Other systems
> experience a failed boot maybe 20-30% of the time.  To reproduce it,
> configure some emulated pmem on your system.  You can find directions
> for that here: https://nvdimm.wiki.kernel.org/
>
> Install ndctl (https://github.com/pmem/ndctl).
> Configure the namespace:
> # ndctl create-namespace -f -e namespace0.0 -m memory
>
> Then just reboot several times (5 should be enough), and hopefully
> you'll hit the issue.
>
> I've attached both my .config and the dmesg output from a successful
> boot at the end of this mail.

Thanks! Considering I know nothing about pmem (yet), I bet there is
some oversight in what's happening with how KASLR scans for available
memory areas. I'll carve out some time next week to look into this.

-Kees

-- 
Kees Cook
Pixel Security
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v6] x86: fix kaslr and memmap collision

2017-01-10 Thread Kees Cook
On Tue, Jan 10, 2017 at 9:56 AM, Dave Jiang <dave.ji...@intel.com> wrote:
> CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address.
> However it does not take into account the memmap= parameter passed in from
> the kernel cmdline. This results in the kernel sometimes being put in
> the middle of memmap. Teaching kaslr to not insert the kernel in
> memmap defined regions. We will support up to 4 memmap regions. Any
> additional regions will cause kaslr to disable. The mem_avoid set has
> been augmented to add up to 4 unusable regions of memmaps provided by the
> user to exclude those regions from the set of valid address range to insert
> the uncompressed kernel image. The nn@ss ranges will be skipped by the
> mem_avoid set since it indicates memory useable.
>
> Signed-off-by: Dave Jiang <dave.ji...@intel.com>

Thanks for the updates!

Acked-by: Kees Cook <keesc...@chromium.org>

-Kees

> ---
>  arch/x86/boot/boot.h |3 +
>  arch/x86/boot/compressed/kaslr.c |  132 
> ++
>  arch/x86/boot/string.c   |   38 +++
>  3 files changed, 172 insertions(+), 1 deletion(-)
>
>  v2:
>  Addressing comments from Ingo.
>  - Handle entire list of memmaps
>  v3:
>  Fix 32bit build issue
>  v4:
>  Addressing comments from Baoquan
>  - Not exclude nn@ss ranges
>  v5:
>  Addressing additional comments from Baoquan
>  - Update commit header and various coding style changes
>  v6:
>  Address comments from Kees
>  - Only fail for physical address randomization
>
> diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
> index e5612f3..59c2075 100644
> --- a/arch/x86/boot/boot.h
> +++ b/arch/x86/boot/boot.h
> @@ -332,7 +332,10 @@ int strncmp(const char *cs, const char *ct, size_t 
> count);
>  size_t strnlen(const char *s, size_t maxlen);
>  unsigned int atou(const char *s);
>  unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int 
> base);
> +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base);
> +long simple_strtol(const char *cp, char **endp, unsigned int base);
>  size_t strlen(const char *s);
> +char *strchr(const char *s, int c);
>
>  /* tty.c */
>  void puts(const char *);
> diff --git a/arch/x86/boot/compressed/kaslr.c 
> b/arch/x86/boot/compressed/kaslr.c
> index a66854d..e919b70 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -11,6 +11,7 @@
>   */
>  #include "misc.h"
>  #include "error.h"
> +#include "../boot.h"
>
>  #include 
>  #include 
> @@ -56,11 +57,18 @@ struct mem_vector {
> unsigned long size;
>  };
>
> +/* only supporting at most 4 unusable memmap regions with kaslr */
> +#define MAX_MEMMAP_REGIONS 4
> +
> +static bool memmap_too_large;
> +
>  enum mem_avoid_index {
> MEM_AVOID_ZO_RANGE = 0,
> MEM_AVOID_INITRD,
> MEM_AVOID_CMDLINE,
> MEM_AVOID_BOOTPARAMS,
> +   MEM_AVOID_MEMMAP_BEGIN,
> +   MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 
> 1,
> MEM_AVOID_MAX,
>  };
>
> @@ -77,6 +85,119 @@ static bool mem_overlaps(struct mem_vector *one, struct 
> mem_vector *two)
> return true;
>  }
>
> +/**
> + * _memparse - parse a string with mem suffixes into a number
> + * @ptr: Where parse begins
> + * @retptr: (output) Optional pointer to next char after parse completes
> + *
> + * Parses a string into a number.  The number stored at @ptr is
> + * potentially suffixed with K, M, G, T, P, E.
> + */
> +static unsigned long long _memparse(const char *ptr, char **retptr)
> +{
> +   char *endptr;   /* local pointer to end of parsed string */
> +
> +   unsigned long long ret = simple_strtoull(ptr, , 0);
> +
> +   switch (*endptr) {
> +   case 'E':
> +   case 'e':
> +   ret <<= 10;
> +   case 'P':
> +   case 'p':
> +   ret <<= 10;
> +   case 'T':
> +   case 't':
> +   ret <<= 10;
> +   case 'G':
> +   case 'g':
> +   ret <<= 10;
> +   case 'M':
> +   case 'm':
> +   ret <<= 10;
> +   case 'K':
> +   case 'k':
> +   ret <<= 10;
> +   endptr++;
> +   default:
> +   break;
> +   }
> +
> +   if (retptr)
> +   *retptr = endptr;
> +
> +   return ret;
> +}
> +
> +static int
> +parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
> +{
> +   char *oldp;
> +
&g

Re: [PATCH v5] x86: fix kaslr and memmap collision

2017-01-06 Thread Kees Cook
...
   /* more than 4 memmaps, fail kaslr */
   if ((i >= MAX_MEMMAP_REGIONS) && str) {
   memmap_too_large = true;
   rc = -EINVAL;
   }
...
}
...
static unsigned long find_random_phys_addr(unsigned long minimum,
   unsigned long image_size)
{
int i;
unsigned long addr;

/* Check if we had too many memmaps. */
if (memmap_too_large) {
debug_putstr("Aborted e820 scan (more than 4 memmap=
arguments)!\n");
return 0;
}

/* Make sure minimum is aligned. */
minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
...


And we should likely adjust this warning:

if (!random_addr) {
warn("KASLR disabled: could not find suitable E820 region!");

to something like:

if (!random_addr) {
warn("Physical KASLR disabled: no suitable memory region!");


-Kees

>>
>> > +
>> > boot_params->hdr.loadflags |= KASLR_FLAG;
>> >
>> > /* Prepare to add new identity pagetables on demand. */
>> > diff --git a/arch/x86/boot/string.c b/arch/x86/boot/string.c
>> > index cc3bd58..0464aaa 100644
>> > --- a/arch/x86/boot/string.c
>> > +++ b/arch/x86/boot/string.c
>> > @@ -122,6 +122,31 @@ unsigned long long simple_strtoull(const char *cp, 
>> > char **endp, unsigned int bas
>> >  }
>> >
>> >  /**
>> > + * simple_strtoul - convert a string to an unsigned long
>> > + * @cp: The start of the string
>> > + * @endp: A pointer to the end of the parsed string will be placed here
>> > + * @base: The number base to use
>> > + */
>> > +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int 
>> > base)
>> > +{
>> > +   return simple_strtoull(cp, endp, base);
>> > +}
>> > +
>> > +/**
>> > + * simple_strtol - convert a string to a signed long
>> > + * @cp: The start of the string
>> > + * @endp: A pointer to the end of the parsed string will be placed here
>> > + * @base: The number base to use
>> > + */
>> > +long simple_strtol(const char *cp, char **endp, unsigned int base)
>> > +{
>> > +   if (*cp == '-')
>> > +   return -simple_strtoul(cp + 1, endp, base);
>> > +
>> > +   return simple_strtoul(cp, endp, base);
>> > +}
>> > +
>> > +/**
>> >   * strlen - Find the length of a string
>> >   * @s: The string to be sized
>> >   */
>> > @@ -155,3 +180,16 @@ char *strstr(const char *s1, const char *s2)
>> > }
>> > return NULL;
>> >  }
>> > +
>> > +/**
>> > + * strchr - Find the first occurrence of the character c in the string s.
>> > + * @s: the string to be searched
>> > + * @c: the character to search for
>> > + */
>> > +char *strchr(const char *s, int c)
>> > +{
>> > +   while (*s != (char)c)
>> > +   if (*s++ == '\0')
>> > +   return NULL;
>> > +   return (char *)s;
>> > +}
>> >



-- 
Kees Cook
Nexus Security
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm