On Thu, Aug 15, 2024 at 02:18:15PM +0100, Joey Gouly wrote:
> Hi Catalin,
> 
> On Wed, Aug 14, 2024 at 04:03:47PM +0100, Catalin Marinas wrote:
> > Hi Joey,
> > 
> > On Tue, Aug 06, 2024 at 03:31:03PM +0100, Joey Gouly wrote:
> > > diff --git arch/arm64/kernel/signal.c arch/arm64/kernel/signal.c
> > > index 561986947530..ca7d4e0be275 100644
> > > --- arch/arm64/kernel/signal.c
> > > +++ arch/arm64/kernel/signal.c
> > > @@ -1024,7 +1025,10 @@ static int setup_sigframe_layout(struct 
> > > rt_sigframe_user_layout *user,
> > >                         return err;
> > >         }
> > >  
> > > -       if (system_supports_poe()) {
> > > +       if (system_supports_poe() &&
> > > +                       (add_all ||
> > > +                        mm_pkey_allocation_map(current->mm) != 0x1 ||
> > > +                        read_sysreg_s(SYS_POR_EL0) != POR_EL0_INIT)) {
> > >                 err = sigframe_alloc(user, &user->poe_offset,
> > >                                      sizeof(struct poe_context));
> > >                 if (err)
> > > 
> > > 
> > > That is, we only save the POR_EL0 value if any pkeys have been allocated 
> > > (other
> > > than pkey 0) *or* if POR_EL0 is a non-default value.
> > 
> > I had a chat with Dave as well on this and, in principle, we don't want
> > to add stuff to the signal frame unnecessarily, especially for old
> > binaries that have no clue of pkeys. OTOH, it looks like too complicated
> > for just 16 bytes. Also POR_EL0 all RWX is a valid combination, I don't
> > think we should exclude it.

Unfortunately, this is always going to be the obviously simpler and
more robust option for dealing with any new register state.

In effect, the policy will be to push back on unconditional additions
to the signal frame, except for 100% of proposed additions...


I'm coming round to the view that trying to provide absolute guarantees
about the signal frame size is unsustainable.  x86 didn't, and got away
with it for some time...  Maybe we should just get rid of the relevant
comments in headers, and water down guarantees in the SVE/SME
documentation to recommendations with no promise attached?

I can propose a patch for that.

> > 
> > If no pkey has been allocated, I guess we could skip this and it also
> > matches the x86 description of the PKRU being guaranteed to be preserved
> > only for the allocated keys. Do we reserve pkey 0 for arm64? I thought
> > that's only an x86 thing to emulate execute-only mappings.

It's not clear whether pkey 0 is reserved in the sense of being
permanently allocated, or in the sense of being unavailable for
allocation.

Since userspace gets pages with pkey 0 by default and can fiddle with
the permissions on POR_EL0 and set this pkey onto pages using
pkey_mprotect(), I'd say pkey 0 counts as always allocated; and the
value of the POR_EL0 bits for pkey 0 needs to be maintained.

> 
> To make it less complicated, I could drop the POR_EL0 check and just do:
> 
> -       if (system_supports_poe()) {
> +       if (system_supports_poe() &&
> +                       (add_all ||
> +                        mm_pkey_allocation_map(current->mm) != 0x1) {
> 
> This wouldn't preserve the value of POR_EL0 if no pkeys had been allocated, 
> but
> that is fine, as you said / the man pages say.
> 
> We don't preserve pkey 0, but it is the default for mappings and defaults to
> RWX. So changing it probably will lead to unexpected things.
> 
> > 
> > Another corner case would be the signal handler doing a pkey_alloc() and
> > willing to populate POR_EL0 on sigreturn. It will have to find room in
> > the signal handler, though I don't think that's a problem.
> 
> pkey_alloc() doesn't appear in the signal safety man page, but that might just
> be an omission due to permission keys being newer, than actually saying
> pkey_alloc() can't be used.

In practice this is likely to be a thin syscall wrapper; those are
async-signal-safe in practice on Linux (but documentation tends to take
a while to catch up).  (Exceptions exists where "safe" calls are used
in ways that interfere with the internal operation of libc... but those
cases are mostly at least semi-obvious and rarely documented.)

Using pkey_alloc() in a signal handler doesn't seem a great idea for
more straightforward reasons, though:  pkeys are a scarce, per-process
resource, and allocating them asynchronously in the presence of other
threads etc., doesn't seem like a recipe for success.

I haven't looked, but I'd be surprised if there's any code doing this!

Generally, it's too late to allocate any non-trivial kind of resource
one you're in a signal handler... you need to plan ahead.

> 
> If POR_EL0 isn't in the sig context, I think the signal handler could just
> write the POR_EL0 system register directly? The kernel wouldn't restore 
> POR_EL0
> in that case, so the value set in the signal handler would just be preserved.
> 
> The reason that trying to preserve the value of POR_EL0 without any pkeys
> allocated (like in the patch in my previous e-mail had) doesn't really make
> sense, is that when you do pkey_alloc() you have to pass an initial value for
> the pkey, so that will overwite what you may have manually written into
> POR_EL0. Also you can't pass an unallocated pkey value to pkey_mprotect().

My argument here was that from the signal handler's point of view, the
POR_EL0 value of the interrupted context lives in the sigframe if it's
there (and will then be restored from there), and directly in POR_EL0
otherwise.  Parsing the sigframe determine where the image of POR_EL0 is.

I see two potential problems.

1) (probably not a big deal in practice)

If the signal handler wants to withdraw a permission from pkey 0 for
the interrupted context, and the interrupted context had no permission
on any other pkey (so POR_EL0 is not dumped and the handler must update
POR_EL0 directly before returning).

In this scenario, the interrupted code would explode on return unless
it can cope with globally execute-only or execute-read-only permissions.
(no-execute is obviously dead on arrival).

If a signal handler really really wanted to do this, it could return
through an asm trampoline that is able to cope with the reduced
permissions.  This seems like a highly contrived scenario though, and I
can't see how it could be useful...

2) (possibly a bigger deal) pkeys(7) does say explicitly (well, sort of)
that the PKRU bits are restored on sigreturn.

Since there are generic APIs to manipulate pkeys, it might cause
problems if sigreturn restores the pkey permissions on some arches
but not on others.  Some non-x86-specific software might already be 
relying on the restoration of the permissions.


> That's a lot of words to say, or ask, do you agree with the approach of only
> saving POR_EL0 in the signal frame if num_allocated_pkeys() > 1?
> 
> Thanks,
> Joey

...So..., given all the above, it is perhaps best to go back to
dumping POR_EL0 unconditionally after all, unless we have a mechanism
to determine whether pkeys are in use at all.

If we initially trapped POR_EL0, and set a flag the first time it is
accessed or one of the pkeys syscalls is used, then we could dump
POR_EL0 conditionally based on that: once the flag is set, we always
dump it for that process.  If the first POR_EL0 access or pkeys API
interaction is in a signal handler, and that handler modifies POR_EL0,
then it wouldn't get restored (or at least, not automatically).  Not
sure if this would ever matter in practice.

It's potentially fiddly to make it work 100% consistently though (does
a sigreturn count as a write?  What if the first access to POR_EL0 is
through ptrace, etc.?)

Cheers
---Dave

Reply via email to