Re: [PATCH V7 05/18] x86/pks: Add PKS setup code

2021-11-25 Thread taoyi.ty

On 11/25/21 11:15 PM, Thomas Gleixner wrote:

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:

+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+void setup_pks(void);


pks_setup()


+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+static DEFINE_PER_CPU(u32, pkrs_cache);
+u32 __read_mostly pkrs_init_value;
+
+/*
+ * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
+ * be checked quickly.
+ *
+ * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
+ * serializing but still maintains ordering properties similar to WRPKRU.
+ * The current SDM section on PKRS needs updating but should be the same as
+ * that of WRPKRU.  So to quote from the WRPKRU text:
+ *
+ * WRPKRU will never execute transiently. Memory accesses
+ * affected by PKRU register will not execute (even transiently)
+ * until all prior executions of WRPKRU have completed execution
+ * and updated the PKRU register.
+ */
+void write_pkrs(u32 new_pkrs)


pkrs_write()


+{
+   u32 *pkrs;
+
+   if (!static_cpu_has(X86_FEATURE_PKS))
+   return;


   cpu_feature_enabled() if at all. Why is this function even invoked
   when PKS is off?


+
+   pkrs = get_cpu_ptr(_cache);


As far as I've seen this is mostly called from non-preemptible
regions. So that get/put pair is pointless. Stick a lockdep assert into
the code and disable preemption at the maybe one callsite which needs
it.


+   if (*pkrs != new_pkrs) {
+   *pkrs = new_pkrs;
+   wrmsrl(MSR_IA32_PKRS, new_pkrs);
+   }
+   put_cpu_ptr(pkrs);
+}
+
+/*
+ * Build a default PKRS value from the array specified by consumers
+ */
+static int __init create_initial_pkrs_value(void)
+{
+   /* All users get Access Disabled unless changed below */
+   u8 consumer_defaults[PKS_NUM_PKEYS] = {
+   [0 ... PKS_NUM_PKEYS-1] = PKR_AD_BIT
+   };
+   int i;
+
+   consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT;
+
+   /* Ensure the number of consumers is less than the number of keys */
+   BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS);
+
+   pkrs_init_value = 0;


This needs to be cleared because the BSS might be non zero?


+   /* Fill the defaults for the consumers */
+   for (i = 0; i < PKS_NUM_PKEYS; i++)
+   pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]);


Also PKR_RW_BIT is a horrible invention:


+#define PKR_RW_BIT 0x0
  #define PKR_AD_BIT 0x1
  #define PKR_WD_BIT 0x2
  #define PKR_BITS_PER_PKEY 2


This makes my brain spin. How do you fit 3 bits into 2 bits per key?
That's really non-intuitive.

PKR_RW_ENABLE   0x0
PKR_ACCESS_DISABLE  0x1
PKR_WRITE_DISABLE   0x2

makes it obvious what this is about, no?

Aside of that, the function which set's up the init value is really
bogus. As you explained in the cover letter a kernel user has to:

1) Claim an index in the enum
2) Add a default value to the array in that function

Seriously? How is that any better than doing:

#define PKS_KEY0_DEFAULTPKR_RW_ENABLE
#define PKS_KEY1_DEFAULTPKR_ACCESS_DISABLE
...
#define PKS_KEY15_DEFAULT   PKR_ACCESS_DISABLE

and let the compiler construct pkrs_init_value?

TBH, it's not and this function has to be ripped out in case that you
need a dynamic allocation of keys some day. So what is this buying us
aside of horrible to read and utterly pointless code?


+   return 0;
+}
+early_initcall(create_initial_pkrs_value);
+
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the CPU supports the feature.
+ */
+void setup_pks(void)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   write_pkrs(pkrs_init_value);


Is the init value set up _before_ this function is invoked for the first
time?

Thanks,

 tglx

Setting up for cpu0 is before create_initial_pkrs_value. therefore pkrs 
value of cpu0 won't be set correctly.


[root@AliYun ~]# rdmsr -a 0x06E1
0
5554
5554
5554
5554
5554
5554
5554
5554
5554

Here are my test results after applying the patches

Thanks.



Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros

2021-11-25 Thread Thomas Gleixner
On Thu, Nov 25 2021 at 15:25, Thomas Gleixner wrote:
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
>> @@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru);
>>   */
>>  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>>  {
>> -int pkey_shift = pkey * PKR_BITS_PER_PKEY;
>> -
>>  /*  Mask out old bit values */
>> -pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
>> +pk_reg &= ~PKR_PKEY_MASK(pkey);
>>  
>>  /*  Or in new values */
>>  if (flags & PKEY_DISABLE_ACCESS)
>> -pk_reg |= PKR_AD_BIT << pkey_shift;
>> +pk_reg |= PKR_AD_KEY(pkey);
>>  if (flags & PKEY_DISABLE_WRITE)
>> -pk_reg |= PKR_WD_BIT << pkey_shift;
>> +pk_reg |= PKR_WD_KEY(pkey);
>
> I'm not seeing how this is improving that code. Quite the contrary.

Aside of that why are you ordering it the wrong way around, i.e.

   1) implement turd
   2) polish turd

instead of implementing the required helpers first if they are really
providing value.

Thanks,

tglx





Re: [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch

2021-11-25 Thread Thomas Gleixner
On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> @@ -658,6 +659,8 @@ __switch_to(struct task_struct *prev_p, struct 
> task_struct *next_p)
>   /* Load the Intel cache allocation PQR MSR. */
>   resctrl_sched_in();
>  
> + pkrs_write_current();

This is invoked from switch_to() and does this extra get/put_cpu_ptr()
dance in the write function for no reason. Can you please stop sticking
overhead into the hotpath just because?

Thanks,

tglx



Re: [PATCH V7 05/18] x86/pks: Add PKS setup code

2021-11-25 Thread Thomas Gleixner
On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +void setup_pks(void);

pks_setup()

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +static DEFINE_PER_CPU(u32, pkrs_cache);
> +u32 __read_mostly pkrs_init_value;
> +
> +/*
> + * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
> + * be checked quickly.
> + *
> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> + * serializing but still maintains ordering properties similar to WRPKRU.
> + * The current SDM section on PKRS needs updating but should be the same as
> + * that of WRPKRU.  So to quote from the WRPKRU text:
> + *
> + * WRPKRU will never execute transiently. Memory accesses
> + * affected by PKRU register will not execute (even transiently)
> + * until all prior executions of WRPKRU have completed execution
> + * and updated the PKRU register.
> + */
> +void write_pkrs(u32 new_pkrs)

pkrs_write()

> +{
> + u32 *pkrs;
> +
> + if (!static_cpu_has(X86_FEATURE_PKS))
> + return;

  cpu_feature_enabled() if at all. Why is this function even invoked
  when PKS is off?

> +
> + pkrs = get_cpu_ptr(_cache);

As far as I've seen this is mostly called from non-preemptible
regions. So that get/put pair is pointless. Stick a lockdep assert into
the code and disable preemption at the maybe one callsite which needs
it.

> + if (*pkrs != new_pkrs) {
> + *pkrs = new_pkrs;
> + wrmsrl(MSR_IA32_PKRS, new_pkrs);
> + }
> + put_cpu_ptr(pkrs);
> +}
> +
> +/*
> + * Build a default PKRS value from the array specified by consumers
> + */
> +static int __init create_initial_pkrs_value(void)
> +{
> + /* All users get Access Disabled unless changed below */
> + u8 consumer_defaults[PKS_NUM_PKEYS] = {
> + [0 ... PKS_NUM_PKEYS-1] = PKR_AD_BIT
> + };
> + int i;
> +
> + consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT;
> +
> + /* Ensure the number of consumers is less than the number of keys */
> + BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS);
> +
> + pkrs_init_value = 0;

This needs to be cleared because the BSS might be non zero?

> + /* Fill the defaults for the consumers */
> + for (i = 0; i < PKS_NUM_PKEYS; i++)
> + pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]);

Also PKR_RW_BIT is a horrible invention:

> +#define PKR_RW_BIT 0x0
>  #define PKR_AD_BIT 0x1
>  #define PKR_WD_BIT 0x2
>  #define PKR_BITS_PER_PKEY 2

This makes my brain spin. How do you fit 3 bits into 2 bits per key?
That's really non-intuitive.

PKR_RW_ENABLE   0x0
PKR_ACCESS_DISABLE  0x1
PKR_WRITE_DISABLE   0x2

makes it obvious what this is about, no?

Aside of that, the function which set's up the init value is really
bogus. As you explained in the cover letter a kernel user has to:

   1) Claim an index in the enum
   2) Add a default value to the array in that function

Seriously? How is that any better than doing:

#define PKS_KEY0_DEFAULTPKR_RW_ENABLE
#define PKS_KEY1_DEFAULTPKR_ACCESS_DISABLE
...
#define PKS_KEY15_DEFAULT   PKR_ACCESS_DISABLE

and let the compiler construct pkrs_init_value?

TBH, it's not and this function has to be ripped out in case that you
need a dynamic allocation of keys some day. So what is this buying us
aside of horrible to read and utterly pointless code?

> + return 0;
> +}
> +early_initcall(create_initial_pkrs_value);
> +
> +/*
> + * PKS is independent of PKU and either or both may be supported on a CPU.
> + * Configure PKS if the CPU supports the feature.
> + */
> +void setup_pks(void)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_PKS))
> + return;
> +
> + write_pkrs(pkrs_init_value);

Is the init value set up _before_ this function is invoked for the first
time?

Thanks,

tglx



Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros

2021-11-25 Thread Thomas Gleixner
On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> @@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru);
>   */
>  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>  {
> - int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> -
>   /*  Mask out old bit values */
> - pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> + pk_reg &= ~PKR_PKEY_MASK(pkey);
>  
>   /*  Or in new values */
>   if (flags & PKEY_DISABLE_ACCESS)
> - pk_reg |= PKR_AD_BIT << pkey_shift;
> + pk_reg |= PKR_AD_KEY(pkey);
>   if (flags & PKEY_DISABLE_WRITE)
> - pk_reg |= PKR_WD_BIT << pkey_shift;
> + pk_reg |= PKR_WD_KEY(pkey);

I'm not seeing how this is improving that code. Quite the contrary.

Thanks,

tglx



Re: [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access()

2021-11-25 Thread Thomas Gleixner
On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +/*
> + * Replace disable bits for @pkey with values from @flags
> + *
> + * Kernel users use the same flags as user space:
> + * PKEY_DISABLE_ACCESS
> + * PKEY_DISABLE_WRITE
> + */
> +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)

pkey_ please.

> +{
> + int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> +
> + /*  Mask out old bit values */
> + pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> +
> + /*  Or in new values */
> + if (flags & PKEY_DISABLE_ACCESS)
> + pk_reg |= PKR_AD_BIT << pkey_shift;
> + if (flags & PKEY_DISABLE_WRITE)
> + pk_reg |= PKR_WD_BIT << pkey_shift;
> +
> + return pk_reg;

Also this code is silly.

#define PKEY_ACCESS_MASK(PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)

u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
{
int shift = pkey * PKR_BITS_PER_PKEY;

if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
accessbits &= PKEY_ACCESS_MASK;

pkval &= ~(PKEY_ACCESS_MASK << shift);
return pkval | accessbit << pkey_shift;
}

See? It does not even need comments because it's self explaining and
uses sensible argument names matching the function name.





Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions

2021-11-25 Thread Thomas Gleixner
Ira,

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +/*
> + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
> + * @cfunc:   C function to be called
> + *
> + * This will ensure that extended_ptregs is added and removed as needed 
> during
> + * a call into C code.
> + */
> +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> + /* add space for extended_pt_regs */
> + subq$EXTENDED_PT_REGS_SIZE, %rsp
> +#endif
> + .if \annotate_retpoline_safe == 1
> + ANNOTATE_RETPOLINE_SAFE
> + .endif

This annotation is new and nowhere mentioned why it is part of this
patch.

Can you please do _ONE_ functional change per patch and not a
unreviewable pile of changes in one go? Same applies for the ASM and the
C code changes. The ASM change has to go first and then the C code can
build upon it.

> + call\cfunc
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> + /* remove space for extended_pt_regs */
> + addq$EXTENDED_PT_REGS_SIZE, %rsp
> +#endif

I really have to ask the question whether this #ifdeffery has any value
at all. 8 bytes extra stack usage is not going to be the end of the
world and distro kernels will enable that config anyway.

If we really want to save the space then certainly not by sprinkling
CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
extra sized ptregs in the pkrs header.

You are changing generic architecture code so you better think about
making such a change generic and extensible. Can folks please start
thinking beyond the brim of their teacup and not pretend that the
feature they are working on is the unicorn which requires unique magic
brandnamed after the unicorn of the day.

If the next feature comes around which needs to save something in that
extended area then we are going to change the world again, right?
Certainly not.

This wants to go into asm/ptrace.h:

struct pt_regs_aux {
u32 pkrs;
};

struct pt_regs_extended {
struct pt_regs_aux  aux;
struct pt_regs  regs __attribute__((aligned(8)));
};

and then have in asm-offset:

   DEFINE(PT_REGS_AUX_SIZE, sizeof(struct pt_regs_extended) - sizeof(struct 
pt_regs));

which does the right thing whatever the size of pt_regs_aux is. So for
the above it will have:

 #define PT_REGS_AUX_SIZE 8 /* sizeof(struct pt_regs_extended) - sizeof(struct 
pt_regs) */

Even, if you do

struct pt_regs_aux {
#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
u32 pkrs;
#endif
};

and the config switch is disabled. It's still correct:

 #define PT_REGS_AUX_SIZE 0 /* sizeof(struct pt_regs_extended) - sizeof(struct 
pt_regs) */

See? No magic hardcoded constant, no build time error checking for that
constant. Nothing, it just works.

That's one part, but let me come back to this:

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> + /* add space for extended_pt_regs */
> + subq$EXTENDED_PT_REGS_SIZE, %rsp

What guarantees that RSP points to pt_regs at this point?  Nothing at
all. It's just pure luck and a question of time until this explodes in
hard to diagnose ways.

Because between

movq%rsp, %rdi
and
call

can legitimately be other code which causes the stack pointer to
change. It's not the case today, but nothing prevents this in the
future.

The correct thing to do is:

movq%rsp, %rdi
RSP_MAKE_PT_REGS_AUX_SPACE
call...
RSP_REMOVE_PT_REGS_AUX_SPACE

The few extra macro lines in the actual code are way better as they make
it completely obvious what's going on and any misuse can be spotted
easily.

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +/*
> + * PKRS is a per-logical-processor MSR which overlays additional protection 
> for
> + * pages which have been mapped with a protection key.
> + *
> + * Context switches save the MSR in the task struct thus taking that value to
> + * other processors if necessary.
> + *
> + * To protect against exceptions having access to this memory save the 
> current
> + * thread value and set the PKRS value to be used during the exception.
> + */
> +void pkrs_save_irq(struct pt_regs *regs)

That's a misnomer as this is invoked for _any_ exception not just
interrupts.

>  #ifdef CONFIG_XEN_PV
>  #ifndef CONFIG_PREEMPTION
>  /*
> @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
> pt_regs *regs)
>  
>   inhcall = get_and_clear_inhcall();
>   if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> + /* Normally called by irqentry_exit, restore pkrs here */
> + pkrs_restore_irq(regs);
>   irqentry_exit_cond_resched();

Sigh. Consistency is overrated

> +
>  void setup_pks(void);
>  void pkrs_write_current(void);
>  void pks_init_task(struct task_struct *task);
> +void write_pkrs(u32 new_pkrs);

So we have pkrs_write_current() and write_pkrs() now. Can you please
stick to a common 

Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions

2021-11-25 Thread Thomas Gleixner
On Fri, Nov 12 2021 at 16:50, Ira Weiny wrote:
> On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:
>> From: Ira Weiny 
>> 
>> The PKRS MSR is not managed by XSAVE.  It is preserved through a context
>> switch but this support leaves exception handling code open to memory
>> accesses during exceptions.
>> 
>> 2 possible places for preserving this state were considered,
>> irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
>> was potentially fraught with unintended consequences.[2]  However, Andy
>> came up with a way to hide additional values on the stack which could be
>> accessed as "extended_pt_regs".[3]
>
> Andy,
>
> I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
> since I originally implemented this in V4[1].
>
> Does this meets your expectations?  Are there any issues you can see with this
> code?
>
> I would appreciate your feedback.

Not Andy here, but I'll respond to the patch in a minute.

Thanks,

tglx